Logo of bmcbioiBioMed Centralsearchsubmit a manuscriptregisterthis articleBMC Bioinformatics
BMC Bioinformatics. 2004; 5: 202.
Published online 2004 Dec 16. doi:  10.1186/1471-2105-5-202
PMCID: PMC545963

Identifying spatially similar gene expression patterns in early stage fruit fly embryo images: binary feature versus invariant moment digital representations



Modern developmental biology relies heavily on the analysis of embryonic gene expression patterns. Investigators manually inspect hundreds or thousands of expression patterns to identify those that are spatially similar and to ultimately infer potential gene interactions. However, the rapid accumulation of gene expression pattern data over the last two decades, facilitated by high-throughput techniques, has produced a need for the development of efficient approaches for direct comparison of images, rather than their textual descriptions, to identify spatially similar expression patterns.


The effectiveness of the Binary Feature Vector (BFV) and Invariant Moment Vector (IMV) based digital representations of the gene expression patterns in finding biologically meaningful patterns was compared for a small (226 images) and a large (1819 images) dataset. For each dataset, an ordered list of images, with respect to a query image, was generated to identify overlapping and similar gene expression patterns, in a manner comparable to what a developmental biologist might do. The results showed that the BFV representation consistently outperforms the IMV representation in finding biologically meaningful matches when spatial overlap of the gene expression pattern and the genes involved are considered. Furthermore, we explored the value of conducting image-content based searches in a dataset where individual expression components (or domains) of multi-domain expression patterns were also included separately. We found that this technique improves performance of both IMV and BFV based searches.


We conclude that the BFV representation consistently produces a more extensive and better list of biologically useful patterns than the IMV representation. The high quality of results obtained scales well as the search database becomes larger, which encourages efforts to build automated image query and retrieval systems for spatial gene expression patterns.


The complexity of animal body form arises from a single fertilized egg cell in an odyssey of gene expression and regulation that controls the multiplication and differentiation of cells [1-3]. For over two decades, Drosophila melanogaster (the fruit fly) has been a canonical model animal for understanding this developmental process in the laboratory. The raw data from experiments consist of photographs (two dimensional images) of the Drosophila embryo showing a particular gene expression pattern revealed by a gene-specific probe in wildtype and mutant backgrounds. Manual, visual comparison of these spatial gene expressions is usually carried out to identify overlaps in gene expression and to infer interactions [4-6].

Whole fruit fly embryo and other related gene expression patterns have been published in a wide variety of research journals since late 1980's. These efforts have now entered a high-throughput phase with the systematic determination of patterns of gene expression [e.g., [7]]. As a result, the amount of data currently available has doubled leading to the imminent availability of multiple expression patterns of every gene in the Drosophila genome [7]. In addition, the use of micro-array technology to study Drosophila development has revealed additional and important insights into changes in gene expression levels over time and under different conditions at a genomic scale [8,9].

With this rapid increase in the amount of available primary gene expression images, searchable textual descriptions of images have become available [7,10,11]. However, a direct comparison of the gene expression patterns depicted in the images is also desirable to find biologically similar expression patterns, because textual descriptions (even using a highly structured and controlled vocabulary) cannot fully capture all aspects of an expression pattern. In fact, there is a need for automated identification of images containing overlapping or similar gene expression patterns [6,12] in order to assist researchers in the evaluation of similarity between a given expression pattern and all other existing (comparable) patterns in the same way that the BLAST [13] technique functions for DNA and protein sequences. Of course, unlike the genomes with four letters and proteomes with 20 letters, all gene expression anatomies cannot be easily reduced to, and thus represented by, a small number of components.

We previously proposed a binary coded bit stream pattern to represent gene expression pattern images [6]. In this digital representation, referred to as the Binary Feature Vector (BFV; BSV in [6]), the unstained pixels in the images (white regions and background) were denoted by a value of 0 and the stained areas (colored and foreground: gene expression) were denoted by a value of 1. Based on the BFV representations of the expression pattern, we proposed a Basic Expression Search Tool for Images (BESTi) [6] with an aim to produce biologically significant gene expression pattern matches using image content alone, without any reference to textual descriptions. We found that the BESTi approach generated biologically meaningful matches to query expression patterns [6].

In this paper, we explore how a more sophisticated Invariant Moment Vectors (IMV, [14]) based digital representation of gene expression patterns performs in generating an ordered list of best-matching images that contain similar/overlapping gene expression patterns to that depicted in a query image. IMV are frequently used in natural image processing (e.g., optical character recognition [15]) and have a number of desirable properties, including the compensation for variations of scale, translation, and rotation. If successful, IMV representations hold the promise of producing significantly shorter computing times for image-to-image matching compared to BFV.

Previously, we had examined the performance of the BFV representation for a limited dataset of early stage images [6]. Here we compare the relative performances of BFV and IMV first using a dataset containing 226 images (from 13 research papers). Then we test for scalability of the BESTi search by using a seven times larger dataset containing 1819 (1593 new + 226 previous) images from 262 additional research papers (list available upon request from the authors). Both datasets contained lateral views of early stage (1–8) embryos.

During these investigations, we also developed another measure of image-to-image similarity for the BFV representation. This measure is aimed at finding images that contain as much of the query image expression pattern as possible, but without penalizing for the presence of any expression outside the overlap region in the target image. In addition, we examined whether partitioning a multi-domain expression pattern into multiple BFV representations, each containing only one domain, yields a better result set.

Recently, Peng and Myers [16] have proposed a different procedure involving the global and local Gaussian Mixture Model (GMM) of the pixel intensities (of expression) to identify images with similar patterns. This GMM method is expected to find images with intensity and spatial similarities. This is different from the BFV and IMV methods examined here, which are intended to find only spatially similar patterns. This focus is important because, as mentioned in [6], the differences in gene expression intensity among images in published literature can arise simply due to use of different techniques, illumination conditions, or biological reasons. However, Peng and Myers method [16] appears to be promising and we plan to examine its effectiveness in a separate paper.

Results and discussion

Data set generation

An image database of 226 gene expression pattern images was initially generated using data from the literature [17-29]. All were lateral images and exhibited early stage (1–8) expression patterns. These images were selected because they had some commonality of gene expression (as seen by the human eye), which allowed us to evaluate the performance of the BESTi in finding correct as well as false matches under controlled conditions. BESTi was also tested for scalability on a larger dataset containing 1819 (1593 plus the 226) lateral views of early stage embryos. These 1593 images were obtained from 262 articles.

In order to present comprehensible result sets in this paper, we have primarily discussed the findings from the dataset of 226 and provided information on how those queries scaled when they were conducted for the larger dataset. In general, our focus was to show the retrieval of biologically significant matches based on both the visual overlap of the spatial gene expression pattern and the genes associated with the pattern retrieved.

Each image was standardized and the binary expression pattern extracted following the procedures described previously [6]. These extracted patterns, their invariant moments (φ1 through φ7), and binary feature representations were stored in a database. We also calculated and stored the expression area (the count of the number of 1's in the binary feature represented image), the X and Y coordinates of the centroid (An external file that holds a picture, illustration, etc.
Object name is 1471-2105-5-202-i1.gif, An external file that holds a picture, illustration, etc.
Object name is 1471-2105-5-202-i2.gif), and the principal angle (θ) for each extracted pattern.

To quantify the similarity of gene expressions in two images, we computed two measures (SS, SC) based on the BFV representation (See equations 2 and 3 in Methods). SS is designed to find gene expression patterns with overall similarity to the query image, whereas SC is for finding images that contain as much of the query image expression pattern as possible without penalizing for the presence of any expression outside the overlap region in the target image. For a given pair of gene expression patterns (A and B), SS is the same irrespective of which image in the pair is the query image. That is, SS(A,B) = SS (B,A). This is not so for SC, because SC measures how much of the query gene expression pattern is contained in the image. Therefore, SC (A,B) ≠ SC(B,A).

For IMV representation, we computed one dissimilarity measure (Dφ, equation 13 in Methods). Results from Dφ should be compared to that from SS, as both of these measurements do not depend on the reference image, i.e., Dφ (A,B) = Dφ(B,A) and, also they capture overall similarity or dissimilarity.

Matches and their biological significance

The effectiveness of the BESTi in finding biologically similar expression patterns was geared towards determining the biological validity of the results obtained from the image matching procedure. All results were based solely on quantitative similarities between images without using any textual descriptions. All images were lateral views from the early stages of fruit fly embryogenesis and were oriented anterior end to the left and dorsal to the top. We refer to the images retrieved as the BESTi-matches.

Performance of BFV-SS search

Figure Figure1A1A shows the query image with gene expression restricted to the anterior (left) portion of the embryo, except that the expression is absent at the anterior terminus [22]. The query image depicts the expression of the sloppy paired (slp1) gene in a wildtype embryo. The BESTi-matches based on the SS measure for the representations are given in Figure 1A1–A8. BESTi retrieves images showing similar expression patterns, all of which are from same research article as the query image [22]. These images depict the expression patterns of sloppy paired genes (slp1 and slp2) in a variety of genetic backgrounds or in combination with a head gap gene orthodentical (otd); all of these genes are essential for the pattern formation in Drosophila head development [30]. In fact, slp1 and slp2 are tightly linked genes found in the slp locus of the Drosophila genome. They are not only closely related in their primary sequence structure, but also significantly similar in their expression pattern (compare Figure 1A7 and 1A8).

Figure 1
BESTi search results with smaller dataset. Results from the BESTi-search for the same query image [22] based on (A) BFV [SS], (B) IMV [Dφ] and (C) BFV [SC] representations in the original dataset (226 images); and based on (D) BFV [SS] and (E) ...

A search was conducted using the same query image and same distance measure (SS) on the larger dataset. Figure Figure22 shows the top-35 matches, which contain all 8 matches shown in Figure Figure1A1A (images with blue colored legends). This allowed us to directly compare the quality of matches between the two datasets. Analysis of larger database of images yields more matches for the same SS cut-off value, as expected. A visual inspection reveals that these are all relevant images (Figure (Figure2),2), with the larger dataset yielding more images for otd (20 images, Figure Figure2C).2C). Images with expression patterns from slp1, slp2 and combined otd expression are found in Figure 2A,B, and and2D.2D. More importantly, searches in the larger dataset provide images containing expression patterns of additional genes: Kruppel (Kr), hunchback (hb), bicoid (bcd), nanos, snail, hu-li tai shao (hts) and hairy (Figure 2E–K). Since these images did not exist in the smaller dataset, they were not included in the search results in Figure Figure1A.1A. All are biologically useful matches because combinatorial input from gap genes (Kr, hb) along with slp1 establishes the domains of segment polarity genes in the head [22]. As for the snail, hts and hairy genes, there are no known interaction between them and slp1 (gene in the query image) in the wildtype embryo, but the images show overlap in gene expression due to the genetic backgrounds used [31-33]. Therefore, they are also biologically relevant matches.

Figure 2
BESTi search results for SS with larger dataset. Comparison of search results from the small (226 images) and large (1819 images) dataset using the SS measure for the same query image (Figure 1A) [22]. Panels (A-K) are based on the genes whose expression ...

Performance of IMV search

We used the same query image for the IMV method applied to the smaller dataset (Dφ, results in Figure Figure1B)1B) and compared the results to the BFV-SS search. In this case, we obtain images containing expressions of hb, Kr, tailless (tll), slp1, hairy and infra-abdominal (iab) (type I transcript). It is clear that IMV search produces some biologically disconnected matches. For example, Figures 1B2, 1B4–B7 exhibit no visual overlap in gene expression pattern with the query. Furthermore, even the biologically significant matches were retrieved out of order (Figure 1B1 before 1B3). This happens because Dφ retrieves expression patterns that are of similar shape and/or size, regardless of the translation or rotation with respect to the query image.

A comparison of the results from the smaller and larger dataset for the IMV measure is given in Figure Figure3.3. Twenty-six images were retrieved from the larger dataset when we used the same maximum distance value for the same query image. Of these, only two images were with expression pattern from slp1 (Figure 3 A1–A2). The expression of bcd was found in two of the results (Figures 3 B1–B2). 13 images containing gap gene expression patterns of Kr, hb, tll, giant (gt) and knirps (kni) (Figures 3 C1–C4, D1–D3, E1–E2, F1–F2, I1 and and3J)3J) were also retrieved. Images with expression patterns of hairy, achaete-scute complex (AS-C), iab (type I transcript), IAB5 enhancer, ventral nervous system defective (vnd), short gastrulation (sog) and a combined expression of bcd, nanos and cap 'n' collar (cnc) accounted for the remaining nine (Figures 3 G1–G2, H1–H2, K1, L1, M1, N1 and 3O1). We see that the new results also suffer from the same problems as before. For example, images in Figure 3 C,E,K and and3L3L have no common expression pattern with the query image. Hence these are not biologically significant results even though few of them (Figures 3 C1–C4, E1–E2) contain expression patterns of developmentally connected genes (Kr and tll with slp1).

Figure 3
BESTi search results for Dφ with larger dataset. Comparison of search results from the small (226 images) and large (1819 images) dataset using the Dφ measure for the same query image (Figure 1A) [22]. Panels (A-O) are based on the genes ...

Since both SS and Dφ measures capture the overall similarity or dissimilarity, we can use Figures Figures22 and and33 to compare the relative effectiveness of the BFV and IMV methods on the larger dataset. We clearly see that the BFV method performs much better in retrieving both overlapping and similar expression patterns that are also biologically significant.

In addition to the Hu moments, one could also compute Zernike moments, which are based on the polar coordinate system. Both Hu moments and Zernike moments are susceptible to the same problem namely expression patterns showing a similar shape but translated to different locations in the embryo would be in the same result set. We chose to study the Hu Invariant Moment Vectors mainly because the centroid of the image can be used to distinguish between similarly shaped but translated expression patterns. With Zernike moments, the image must be inherently contained within a unit circle anchored at the centroid [34]. Thus, there is no straightforward method to eliminate the translational problem.

Using the Hu moments, the spatial location problem can be corrected by considering the Euclidean difference in the centroid location expressed in pixels (ΔCXY) of the query and results. In the case of BFV-SS search results in Figure 1 (A1–A8), the maximum ΔCXY is less than or only slightly greater than the minimum ΔCXY for the IMV search results (Figure 1 B1–B8). Therefore, in the present case, the IMV-based BESTi search results need to be pared down using the centroid location difference. For example, if we consider results based on a ΔCXY lesser than or equal to 50 pixels, images shown in Figure 1 B2, B4–B7 would be removed producing a more meaningful result set.

Performance of BFV-SC search

Figure Figure1C1C shows the result for the same query image as used in Figure Figure1A,1A, but using the newly devised SC distance for the BFV representation (BFV-SC search). This is expected to retrieve images with gene expression patterns that contain the largest amount of the overlap with the expression pattern in the query image. The top eight hits shown (Figure 1C1–C8) all contain over 93% of the query expression pattern: five of the matches are to the expression of hunchback (hb; C1, C3–C6) and the remaining three are from slp1 under different genetic backgrounds. As mentioned above, the combinatorial input from gap genes (including hb) along with slp1 establishes the domains of segment polarity genes in the head [22]. Therefore, gene expression patterns found by BFV-SC search are for developmentally connected genes. However, using the same query image, BFV-SC search yielded only two images in common with the BFV-SS results (Figure (Figure1;1; C7 and C8 are the same as A5 and A4, respectively). This difference occurs because SS is designed to find gene expression patterns with overall similarity to the query image (Figure (Figure1A),1A), whereas SC is intended for finding images that contain as much of the query image expression pattern as possible and exclusive of the presence of the gene expression in the result image outside the region of overlap with the query image. Therefore, BFV-SS and BFV-SC have the capability of finding gene expression patterns from different biological perspectives.

Using the same minimum similarity value for the BFV-SC in the larger dataset resulted in 55 images, given in Figure Figure4.4. Gene expression patterns of slp1 and otd accounted for 8 of these images (Figure (Figure4A4A and and4B).4B). 22 images contained expression patterns of the various gap genes hb, Kr, kni and tll (Figure 4C, 4E–F, 4I–L) that were co-expressed with bcd and nanos (Figure (Figure4E4E and and4J)4J) or with en (Figure (Figure4I).4I). Five other genes, developmentally connected to the gene, slp1, in the query image were also retrieved in this result set (eve, twist, dpp (decapentaplegic) [35]; en (engrailed) [36]; arm (armadillo) [37]; Figure 4M–Q). These images were not found in the top-35 of SS result set, which accentuates the different capabilities of the two BFV similarity measures in retrieving biologically relevant matches. The remaining images had expression patterns of AS-C, sc (scute), snail, hairy, zen (zerknullt), run, Hsp83, nmo (nemo), Tc'hb, iab, hts and sog (Figure 4D, 4G–H, 4R–Z) which are not known to be directly related to the gene slp1. All but seven of these images (Figures 4 D3–D4, H1–H2, R1, X1 and 4Y1) were from a different developmental stage than the query image. Hence, by limiting the results to those from a specific stage, extraneous matches can be removed. The seven images having the same stage as the query image were retrieved because of their significant overlap (more than 94%) with the query gene expression pattern. Thus, we observe that the new distance measure SC has the potential to identify images containing expression patterns of developmentally connected genes, other than those retrieved by SS, thus improving the overall performance of the BFV method and the BESTi tool.

Figure 4
BESTi search results for SC with larger dataset. Comparison of search results from the small (226 images) and large (1819 images) dataset using the Dφ measure for the same query image (Figure 1A) [22]. Panels (A-Z) are based on the genes whose ...

Analysis of multi-domain gene expression patterns

Due to the presence of multiple areas of expression, some patterns in the database that appeared to contain much better matches (by eye and biologically) to the query image were not found or ranked very high. Hence, we also analyzed multi-domain expression patterns separately for the smaller dataset. Developmental biologists are also interested in finding such patterns as they contain overlaps with the expression domains in the query image. In fact, a large number of the expression patterns available today contain multiple isolated domains of expressions since more than one topologically distinct region of expression may be produced by many genes, transgenic constructs, probes or experimental techniques (multiple staining). In such cases, we need to consider each of these regions individually as well as in the context of the composite pattern. Biologically, it is important to consider them separately because different regions of expression may be under the control of distinct cis-regulatory sequences [e.g., [28,38]] or may represent the expression of different genes in a multiply-stained embryo.

Separating multi-domain gene expression patterns into individual components was straightforward; we simply generated multiple images from the same initial image and included them in the target dataset. This resulted in 192 additional images (418 total) in the database all of which were components of the initial gene expression patterns. The images were separated into expression regions horizontally and/or vertically depending on the gene expression. For this new set of images, the IMV as well as BFV representations were re-calculated and the BESTi query constructed as above.

Results from BFV-SS and IMV queries for this data set are given in Figures Figures1D1D and and1E,1E, respectively. Now, many images with multiple regions of expression are retrieved in the result set (Figure 1D: D1–D8) and many of them show an even better match with the query pattern than those in Figure Figure1A1A for the BFV-based BESTi search. For instance, gene expression patterns are now retrieved (with more than 55% pattern similarity) from embryos with the expression of tailless (tll), which is known to interact with slp1 in defining the embryonic head [22], and with a composite expression of race (related to angiotensin converting enzyme), sog (short gastrulation) and eve (even-skipped) due to enhanced race expression in the anterior domain caused by a transgenic construct causing ectopic expression of sog [19]. Therefore, the strategy of dividing multi-domain expression data into individual domains provides additional flexibility to query individual components or sub-sets of complex expression patterns. Results also improved for IMV (Figure (Figure1E),1E), but again the outcome reinforced the need to use the difference in centroid to limit the result set.

Next we examine the performance of SS, SC and Dφ in finding BESTi matches for a query pattern with multiple regions of expression (Figure (Figure5A).5A). This complex expression pattern consists of anterior and posterior domains caused by enhanced race expression resulting from dosage alteration of dpp in a gastrulation defective (gd) mutant background, and a middle stripe due to misexpressed sog using an eve stripe-2 enhancer [Figure [Figure2d2d in [19]]. The results from this query are shown in Figure 5A1–A8 (only the original image set (226) was used as the target database in this case). We again find that SS finds many images from the same paper as well as some images from other research articles with similar expression patterns. The results correctly include expression pattern of eve (Figure 5A4), of another pair-rule gene (ftz: fushi tarazu; Figure 5A6), and of two other developmentally related genes [39,40].

Figure 5
BESTi search results with multiple domains of expression using smaller database. Results from BESTi-search for a query image with multiple domains of expression. (A) BFV [SS], (B) IMV [Dφ] and (C) BFV [SC] searches for the same expression pattern ...

When Dφ is used as a search criterion, it produces some correct matches in the result set (Figure 5B1–B8). However, it generally fails to rank biologically meaningful matches as the best matches. Use of the centroid in this case is also not productive, as most of the matches show very close centroids. The principal angle (θ) value calculated does not show a significant difference in the early stage embryos used in this study. The results using the SC based search are given in Figure 5C1–C8. They show a number of images in common with the SS results. However, as expected, there are significant differences between the two searches.

The results in Figures Figures5D5D and and5E5E demonstrate the power of the BESTi-search when the multi-domain expression data are represented in their component patterns (domain database). In this case, all the BESTi searches are based on the use of SS as the search criterion. These searches are based on the complete expression (Figure (Figure5D)5D) and on one of its components (bottom-left domain, Figure Figure5E).5E). All, but one, BESTi-matches in Figure Figure5D5D contain both domains of expression. In contrast, the use of only the left, anterior, domain (Figure (Figure5E)5E) in the BESTi search produces many other images in which the gene expression pattern is similar to only the anterior-ventral query pattern. Therefore, the use of individual expression components as search arguments increases the potential of directly identifying different overlapping expression patterns.


We have found that it is possible to identify biologically significant gene expression patterns from a dataset by first extracting numeric signature descriptors and then using those descriptors in a computerized search of the database for expression patterns with similar signatures or maximum pattern similarities. We find that the BFV methodologies provide a longer and more biologically meaningful set of expression pattern matches than IMV. Even though IMV representations will produce much faster retrieval speeds for large collections of embryogenesis images, the lack of biological validity of BESTi-matches retrieved makes IMV undesirable for the present problem. Instead, investigations and strategies aimed at improving the real time performance of the BFV representation will better serve the developmental biological research.


The wide variety of input methodologies, illumination conditions, equipment, and publication venues involved in the acquisition and presentation of gene expression patterns makes the available gene expression pattern data rather diverse. Extracting a gene expression pattern from its background requires the use of a combination of manual and automatic techniques. Each image is first standardized into a binary image as described in [6]. The standardized images are then represented using the Binary Feature Vector (BFV) [6], and the Invariant Moment Vectors (IMV) [14]. Similarity measures SS and SC are derived from BFV of which, SS is the one's complement of the distance metric DE presented in [6] and SC is a new measure introduced in this paper. The third metric Dφ is deduced from the invariant moment vectors.

Binary Sequence Vector analysis

The binary coded bit stream pattern, in which the two possible states indicate staining over or under a threshold value, is called as Binary Feature Vector (BFV). This is referred to as the Binary Sequence Vector (BSV) in [6]. In other words, we represent each image as a sequence of 1's and 0's, where the black pixels (stained areas) are denoted by a value of 1 and the white pixels (unstained and background) are denoted by a value of 0. This BFV holds the gene expression and localization pattern information of each image.

The expression patterns are ordered by evaluating a set of difference values, DE, between the binary feature vectors of every possible pair of images in the dataset. DE was introduced in [6] and is formally given as,

DE = Count(A XOR B)/Count(A OR B)     (1)

The term Count(A XOR B) corresponds to the number of pixels not spatially common to the two images and the term Count(A OR B) provides the normalizing factor, as it refers to the total number of stained pixels (expression area) depicted in either of the two images being compared. For simplicity, we use the one's complement of DE, as a measure of similarity of gene expression patterns between two images, SS, is given by the equation

SS = (1 - DE).     (2)

SS quantifies the amount of similarity based on the overlap between two expression patterns. SS is equal to 1 when the two expression patterns are identical (DE = 0).

We introduce a new similarity measure in this paper that does not penalize for any non-overlapping region. The measure SC quantifies the amount of similarity based on the containment of one expression pattern in the other given by

SC = Count(A AND B)/Count (A)     (3)

If the entire query image is contained within the result set images found in the database, i.e., there is complete overlap (with respect to the query image) SC is equal to 1. Note that, SC(A,B) ≠ SC(B,A), because the denominator corresponds to the gene expression area of the query image.

Invariant Moment Vector (IMV) analysis

Some methodologies of image analysis produce numeric descriptors that compensate for variations of scale, translation and rotation. In the following section, we describe the invariant moment analysis of gene expression data. Invariant moment calculations have been used in optical character recognition and other applications for many years [15].

To calculate these invariant moment descriptors the standardized binary image [6] is converted to a binary representation of the same pattern (BFV). From this binary sequence of the image, the invariant moments and other descriptors are extracted using the following method [14,41]. The continuous scale equation used is

Mpq = ∬xp yq f(x, y)dxdy,     (4)

where Mpq is the two-dimensional moment of the function of the gene expression pattern, f(x, y). The order of the moment is defined as (p + q), where both p and q are positive natural numbers. When implemented in a digital or discrete form this equation becomes

An external file that holds a picture, illustration, etc.
Object name is 1471-2105-5-202-i3.gif

We then normalize for image translation using An external file that holds a picture, illustration, etc.
Object name is 1471-2105-5-202-i1.gif and An external file that holds a picture, illustration, etc.
Object name is 1471-2105-5-202-i2.gif which are the coordinates of the center of gravity, centroid, of the area showing expression. They are calculated as

An external file that holds a picture, illustration, etc.
Object name is 1471-2105-5-202-i4.gif

Discrete representations of the central moments are then defined as follows:

An external file that holds a picture, illustration, etc.
Object name is 1471-2105-5-202-i5.gif

A further normalization for variations in scale can be implemented using the formula,

An external file that holds a picture, illustration, etc.
Object name is 1471-2105-5-202-i6.gif

and An external file that holds a picture, illustration, etc.
Object name is 1471-2105-5-202-i7.gif is the normalization factor. From the central moments, the following values are calculated:

An external file that holds a picture, illustration, etc.
Object name is 1471-2105-5-202-i8.gif

where φ7 is a skew invariant to distinguish mirror images. In the above, φ1 and φ2 are second order moments and φ3 through φ7 are third order moments. φ1 (the sum of the second order moments) may be thought of as the "spread" of the gene expression pattern; whereas the square root of φ2 (the difference of the second order moments) may be interpreted as the "slenderness" of the pattern. Moments φ3 through φ7 do not have any direct physical meaning, but include the spatial frequencies and ranges of the image.

In order to provide a discriminator for image inversion (and rotation), sometimes called the "6", "9" problem, it has been suggested [14,42] that the principal angle be used to determine "which way is up". This is extremely important in embryo images because gene expression at the anterior and posterior regions may simply appear to be mirror images of each other to the invariant moments, but biologically they are completely distinct. The principal axis of the gene expression pattern f(x, y) is the angular displacement of the minimum rotational inertia line that passes through the centroid (An external file that holds a picture, illustration, etc.
Object name is 1471-2105-5-202-i1.gif, An external file that holds a picture, illustration, etc.
Object name is 1471-2105-5-202-i2.gif) and is given as:

An external file that holds a picture, illustration, etc.
Object name is 1471-2105-5-202-i9.gif

The slope of the principal axis is called the principal angle θ. It is calculated knowing that the moment of inertia of f around the line An external file that holds a picture, illustration, etc.
Object name is 1471-2105-5-202-i10.gif is a line through (An external file that holds a picture, illustration, etc.
Object name is 1471-2105-5-202-i1.gif, An external file that holds a picture, illustration, etc.
Object name is 1471-2105-5-202-i2.gif) with slope θ. We can find the θ value at which the momentum is minimum by differentiating this equation with respect to θ and setting the results equal to zero. This produces the following equation:

An external file that holds a picture, illustration, etc.
Object name is 1471-2105-5-202-i11.gif

Using the condition |θ| < 45° one can distinguish the "6" from the "9" and rotationally similar gene expression patterns.

In invariant moment analysis, our initial method of image comparison calculates the Euclidean distance between the images using all moments (φ1 through φ7) and combinations of these moments. For example, if the first two invariant moments are used, then

An external file that holds a picture, illustration, etc.
Object name is 1471-2105-5-202-i12.gif

and the distance Dij, between a pair of images i and j where i, j = 1, 2,...n is given by

An external file that holds a picture, illustration, etc.
Object name is 1471-2105-5-202-i13.gif

This can be expanded to use all of the moment variables. Here, the Euclidean distance, Dφ, between any two images is calculated as

An external file that holds a picture, illustration, etc.
Object name is 1471-2105-5-202-i14.gif

where i and q designate images whose distance is being calculated and j designates the parameters used in the distance calculation and j = 1, 2, ..., 7. This assumes that all moments have the same dimensions or that they are dimensionless.

Using this method, it is possible to rank each of the images in order of their similarity based on, for example, the first two invariant moments that have clear-cut physical meanings. Expansion to include additional moments or parameters can be performed in a number of ways. It is possible to add additional parameters to the distance calculation making sure that each of the parameters has the same dimension. For example, φ1 has the dimension of distance squared, while φ2 has the dimension of the fourth power of distance, thus requiring the square root function to equalize dimensions for comparable distance calculation purposes. In general, the greater number of invariant moments used in the distance calculation, the more selective the ranking. We have also allowed for the use of the centroids and principal angle as a means of list limiting.

Authors' contributions

SK originally conceived the project, developed the image distance measures based on the BFV representation, wrote an early version of the manuscript, and edited it until the final version. RG was responsible for writing new and using pre-existing programs to perform the image distance and parameter calculations, helped prepare the figures, searched the literature for gene expression data, maintained the database of gene expression pattern images, and helped in writing the manuscript. BVE provided the IMV method description, managed the day-to-day activities in the project, and did significant editing to produce the manuscript in the desired format for the journal. SP originally proposed the use of invariant moment vectors for biological image analysis, contributed significantly for the image distance and parameter calculations and provided critical feedback during the later stages of revision.


We thank Dr. Robert Wisotzkey for biological remarks, Dr. Dana Desonie for editorial comments and Dr. Stuart Newfeld for useful suggestions. This research was supported in part by research grants from National Institutes of Health (S.K.) and the Center for Evolutionary Functional Genomics (S.K.) at the Arizona State University.


  • Carroll SB, Grenier JK, Weatherbee SD. From DNA to Diversity: Molecular Genetics and the Evolution of Animal Design. Massachusetts, MA, Blackwell Scientific; 2000.
  • Davidson E. Genomic Regulatory Systems: Development and Evolution. New York, NY, Academic Press; 2000.
  • Rougvie AE. Control of developmental timing in animals. Nat Rev Genet. 2001;2:690–701. doi: 10.1038/35088566. [PubMed] [Cross Ref]
  • Gieseler K, Wilder E, Mariol MC, Buratovitch M, Berenger H, Graba Y, Pradel J. DWnt4 and wingless elicit similar cellular responses during imaginal development. Dev Biol. 2001;232:339–350. doi: 10.1006/dbio.2001.0184. [PubMed] [Cross Ref]
  • Takaesu NT, Johnson AN, Sultani OH, Newfeld SJ. Combinatorial Signaling by an Unconventional Wg Pathway and the Dpp Pathway Requires Nejire (CBP/p300) to Regulate dpp Expression in Posterior Tracheal Branches. Dev Biol. 2002;247:225–236. doi: 10.1006/dbio.2002.0693. [PubMed] [Cross Ref]
  • Kumar S, Jayaraman K, Panchanathan S, Gurunathan R, Marti-Subirana A, Newfeld SJ. BEST: A novel computational approach for comparing gene expression patterns from early stages of Drosophila melanogaster development. Genetics. 2002;162:2037–2047. [PMC free article] [PubMed]
  • Tomancak P, Beaton A, Weiszmann R, Kwan E, Shu S, Lewis SE, Richards S, Ashburner M, Hartenstein V, Celniker SE, Rubin GM. Systematic determination of patterns of gene expression during Drosophila embryogenesis. Genome Biol. 2002;3:research0088.1–88.14. doi: 10.1186/gb-2002-3-12-research0088. [PMC free article] [PubMed] [Cross Ref]
  • Montalta-He H, Reichert H. Impressive expressions: developing a systematic database of gene-expression patterns in Drosophila embryogenesis. Genome Biol. 2003;4:205. doi: 10.1186/gb-2003-4-2-205. [PMC free article] [PubMed] [Cross Ref]
  • Arbeitman MN, Furlong EE, Imam F, Johnson E, Null BH, Baker BS, Krasnow MA, Scott MP, Davis RW, White KP. Gene expression during the life cycle of Drosophila melanogaster. Science. 2002;297:2270–2275. doi: 10.1126/science.1072152. [PubMed] [Cross Ref]
  • FlyBase The FlyBase database of the Drosophila genome projects and community literature. Nucleic Acids Research. 1999;27:85–88. doi: 10.1093/nar/27.1.85. [PMC free article] [PubMed] [Cross Ref]
  • Janning W. FlyView, a Drosophila image database, and other Drosophila databases. Seminars in Cell and Developmental Biology. 1997;8:469–475. doi: 10.1006/scdb.1997.0172. [PubMed] [Cross Ref]
  • Bard JBI. Introduction: Making and filling gene-expression developmental databases. Seminars in Cell and Developmental Biology. 1997;8:455–458. doi: 10.1006/scdb.1997.0170. [PubMed] [Cross Ref]
  • Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of Molecular Biology. 1990;215:403–410. doi: 10.1006/jmbi.1990.9999. [PubMed] [Cross Ref]
  • Hu MK. Visual pattern recognition by moment invariants. IRE Transactions of Information Theory. 1962. pp. 179–187.
  • Castleman KR. Digital Image Processing. New Jersey, Prentice Hall; 1996.
  • Peng H, Myers EW. Comparing in situ mRNA expression patterns of Drosphila embryos: ; San Diego, CA. ACM Journals; 2004.
  • Arnosti DN, Gray S, Barolo S, Zhou J, Levine M. The gap protein knirps mediates both quenching and direct repression in the Drosophila embryo. Embo J. 1996;15:3659–3666. [PMC free article] [PubMed]
  • La Rosee-Borggreve A, Hader T, Wainwright D, Sauer F, Jackle H. hairy stripe 7 element mediates activation and repression in response to different domains and levels of Kruppel in the Drosophila embryo. Mech Dev. 1999;89:133–140. doi: 10.1016/S0925-4773(99)00219-1. [PubMed] [Cross Ref]
  • Ashe HL, Levine M. Local inhibition and long-range enhancement of Dpp signal transduction by Sog. Nature. 1999;398:427–431. doi: 10.1038/18892. [PubMed] [Cross Ref]
  • Casares F, Sanchez-Herrero E. Regulation of the infraabdominal regions of the bithorax complex of Drosophila by gap genes. Development. 1995;121:1855–1866. [PubMed]
  • Goldstein RE, Jimenez G, Cook O, Gur D, Paroush Z. Huckebein repressor activity in Drosophila terminal patterning is mediated by Groucho. Development. 1999;126:3747–3755. [PubMed]
  • Grossniklaus U, Cadigan KM, Gehring WJ. Three maternal coordinate systems cooperate in the patterning of the Drosophila head. Development. 1994;120:3155–3171. [PubMed]
  • Gutjahr T, Frei E, Noll M. Complex regulation of early paired expression: initial activation by gap genes and pattern modulation by pair-rule genes. Development. 1993;117:609–623. [PubMed]
  • Hartmann C, Taubert H, Jackle H, Pankratz MJ. A two-step mode of stripe formation in the Drosophila blastoderm requires interactions among primary pair rule genes. Mech Dev. 1994;45:3–13. doi: 10.1016/0925-4773(94)90049-3. [PubMed] [Cross Ref]
  • Hulskamp M, Pfeifle C, Tautz D. A morphogenetic gradient of hunchback protein organizes the expression of the gap genes Kruppel and knirps in the early Drosophila embryo. Nature. 1990;346:577–580. doi: 10.1038/346577a0. [PubMed] [Cross Ref]
  • Hulskamp M, Tautz D. Gap genes and gradients - the logic behind the gaps. BioEssays. 1991;13:261–268. [PubMed]
  • Hulskamp M, Lukowitz W, Beermann A, Glaser G, Tautz D. Differential regulation of target genes by different alleles of the segmentation gene hunchback in Drosophila. Genetics. 1994;138:125–134. [PMC free article] [PubMed]
  • Gaul U, Jackle H. Role of gap genes in early Drosophila development. Adv Genet. 1990;27:239–275. [PubMed]
  • Gaul U, Jackle H. Pole region-dependent repression of the Drosophila gap gene kruppel by maternal gene products. Cell. 1987;51:549–555. doi: 10.1016/0092-8674(87)90124-3. [PubMed] [Cross Ref]
  • Royet J, Finkelstein R. Pattern formation in Drosophila head development: the role of the orthodenticle homeobox gene. Development. 1995;121:3561–3572. [PubMed]
  • Stathopoulos A, Levine M. Linear signaling in the Toll-Dorsal pathway of Drosophila: activated Pelle kinase specifies all threshold outputs of gene expression while the bHLH protein Twist specifies a subset. Development. 2002;129:3411–3419. [PubMed]
  • Brent AE, MacQueen A, Hazelrigg T. The Drosophila wispy gene is required for RNA localization and other microtubule-based events of meiosis and early embryogenesis. Genetics. 2000;154:1649–1662. [PMC free article] [PubMed]
  • Zhang H, Levine M. Groucho and dCtBP mediate separate pathways of transcriptional repression in the Drosophila embryo. Proc Natl Acad Sci U S A. 1999;96:535–540. doi: 10.1073/pnas.96.2.535. [PMC free article] [PubMed] [Cross Ref]
  • Teh C, Chin R. On Image Analysis by the Methods of Moments. IEEE Transactions on Patterns Analysis and Machine Intelligence. 1988;10:496–513. doi: 10.1109/34.3913. [Cross Ref]
  • Riechmann V, Irion U, Wilson R, Grosskortenhaus R, Leptin M. Control of cell fates and segmentation in the Drosophila mesoderm. Development. 1997;124:2915–2922. [PubMed]
  • Cadigan KM, Grossniklaus U, Gehring WJ. Localized expression of sloppy paired protein maintains the polarity of Drosophila parasegments. Genes Dev. 1994;8:899–913. [PubMed]
  • Bhat KM, van Beers EH, Bhat P. Sloppy paired acts as the downstream target of wingless in the Drosophila CNS and interaction between sloppy paired and gooseberry inhibits sloppy paired during neurogenesis. Development. 2000;127:655–665. [PubMed]
  • Sanchez L, Thieffry D. A logical analysis of the Drosophila gap-gene system. J Theor Biol. 2001;211:115–141. doi: 10.1006/jtbi.2001.2335. [PubMed] [Cross Ref]
  • Frasch M, Warrior R, Tugwood J, Levine M. Molecular analysis of even-skipped mutants in Drosophila development. Genes Dev. 1988;2:1824–1838. [PubMed]
  • Abbott MK, Kaufman TC. The relationship between the functional complexity and the molecular organization of the Antennapedia locus of Drosophila melanogaster. Genetics. 1986;114:919–942. [PMC free article] [PubMed]
  • Jayaraman K, Panchanathan S, Kumar S. Classification and indexing of gene expression images. Proceedings of Society of Photo-optical Instrumentation Engineers. 2001;4472:471–481.
  • Rosenfeld A, Kak AC. Digital Picture Processing. 2nd. New York, Academic Press; 1982.
  • Zhao C, York A, Yang F, Forsthoefel DJ, Dave V, Fu D, Zhang D, Corado MS, Small S, Seeger MA, Ma J. The activity of the Drosophila morphogenetic protein Bicoid is inhibited by a domain located outside its homeodomain. Development. 2002;129:1669–1680. [PubMed]
  • Gao Q, Finkelstein R. Targeting gene expression to the head: the Drosophila orthodenticle gene is a direct target of the Bicoid morphogen. Development. 1998;125:4185–4193. [PubMed]
  • Wimmer EA, Cohen SM, Jackle H, Desplan C. buttonhead does not contribute to a combinatorial code proposed for Drosophila head development. Development. 1997;124:1509–1517. [PubMed]
  • Schulz C, Tautz D. Autonomous concentration-dependent activation and repression of Kruppel by hunchback in the Drosophila embryo. Development. 1994;120:3043–3049. [PubMed]
  • Tsai C, Gergen JP. Gap gene properties of the pair-rule gene runt during Drosophila segmentation. Development. 1994;120:1671–1683. [PubMed]
  • Janody F, Reischl J, Dostatni N. Persistence of Hunchback in the terminal region of the Drosophila blastoderm embryo impairs anterior development. Development. 2000;127:1573–1582. [PubMed]
  • Sauer F, Wassarman DA, Rubin GM, Tjian R. TAF(II)s mediate activation of transcription in the Drosophila embryo. Cell. 1996;87:1271–1284. doi: 10.1016/S0092-8674(00)81822-X. [PubMed] [Cross Ref]
  • Strunk B, Struffi P, Wright K, Pabst B, Thomas J, Qin L, Arnosti DN. Role of CtBP in transcriptional repression by the Drosophila giant protein. Dev Biol. 2001;239:229–240. doi: 10.1006/dbio.2001.0454. [PubMed] [Cross Ref]
  • Colas JF, Launay JM, Vonesch JL, Hickel P, Maroteaux L. Serotonin synchronises convergent extension of ectoderm with morphogenetic gastrulation movements in Drosophila. Mech Dev. 1999;87:77–91. doi: 10.1016/S0925-4773(99)00141-0. [PubMed] [Cross Ref]
  • Wu X, Vasisht V, Kosman D, Reinitz J, Small S. Thoracic patterning by the Drosophila gap gene hunchback. Dev Biol. 2001;237:79–92. doi: 10.1006/dbio.2001.0355. [PubMed] [Cross Ref]
  • Ghiglione C, Perrimon N, Perkins LA. Quantitative variations in the level of MAPK activity control patterning of the embryonic termini in Drosophila. Dev Biol. 1999;205:181–193. doi: 10.1006/dbio.1998.9102. [PubMed] [Cross Ref]
  • Pankratz MJ, Busch M, Hoch M, Seifert E, Jackle H. Spatial control of the gap gene knirps in the Drosophila embryo by posterior morphogen system. Science. 1992;255:986–989. [PubMed]
  • Melnick MB, Perkins LA, Lee M, Ambrosio L, Perrimon N. Developmental and molecular characterization of mutations in the Drosophila-raf serine/threonine protein kinase. Development. 1993;118:127–138. [PubMed]
  • Parkhurst SM, Lipshitz HD, Ish-Horowicz D. achaete-scute feminizing activities and Drosophila sex determination. Development. 1993;117:737–749. [PubMed]
  • Zhou A, Hassel BA, Silverman RH. Expression cloning of 2-5A-dependent RNAase: A uniquely regulated mediator of interferon action. Cell. 1993;72:753–765. doi: 10.1016/0092-8674(93)90403-D. [PubMed] [Cross Ref]
  • Niessing D, Dostatni N, Jackle H, Rivera-Pomar R. Sequence interval within the PEST motif of Bicoid is important for translational repression of caudal mRNA in the anterior region of the Drosophila embryo. Embo J. 1999;18:1966–1973. doi: 10.1093/emboj/18.7.1966. [PMC free article] [PubMed] [Cross Ref]
  • Yagi Y, Suzuki T, Hayashi S. Interaction between Drosophila EGF receptor and vnd determines three dorsoventral domains of the neuroectoderm. Development. 1998;125:3625–3633. [PubMed]
  • Cowden J, Levine M. The Snail repressor positions Notch signaling in the Drosophila embryo. Development. 2002;129:1785–1793. [PubMed]
  • Miskiewicz P, Morrissey D, Lan Y, Raj L, Kessler S, Fujioka M, Goto T, Weir M. Both the paired domain and homeodomain are required for in vivo function of Drosophila Paired. Development. 1996;122:2709–2718. [PubMed]
  • Schulz C, Tautz D. Zygotic caudal regulation by hunchback and its role in abdominal segment formation of the Drosophila embryo. Development. 1995;121:1023–1028. [PubMed]
  • Goff DJ, Nilson LA, Morisato D. Establishment of dorsal-ventral polarity of the Drosophila egg requires capicua action in ovarian follicle cells. Development. 2001;128:4553–4562. [PubMed]
  • Sackerson C, Fujioka M, Goto T. The even-skipped locus is contained in a 16-kb chromatin domain. Dev Biol. 1999;211:39–52. doi: 10.1006/dbio.1999.9301. [PubMed] [Cross Ref]
  • Rusch J, Levine M. Regulation of a dpp target gene in the Drosophila embryo. Development. 1997;124:303–311. [PubMed]
  • Steingrimsson E, Pignoni F, Liaw GJ, Lengyel JA. Dual role of the Drosophila pattern gene tailless in embryonic termini. Science. 1991;254:418–421. [PubMed]
  • Hamada F, Bienz M. A Drosophila APC tumour suppressor homologue functions in cellular adhesion. Nat Cell Biol. 2002;4:208–213. doi: 10.1038/ncb755. [PubMed] [Cross Ref]
  • Klinger M, Soong J, Butler B, Gergen JP. Disperse versus compact elements for the regulation of runt stripes in Drosophila. Developmental Biology. 1996;177:73–84. doi: 10.1006/dbio.1996.0146. [PubMed] [Cross Ref]
  • Bashirullah A, Halsell SR, Cooperstock RL, Kloc M, Karaiskakis A, Fisher WW, Fu W, Hamilton JK, Etkin LD, Lipshitz HD. Joint action of two RNA degradation pathways controls the timing of maternal transcript elimination at the midblastula transition in Drosophila melanogaster. Embo J. 1999;18:2610–2620. doi: 10.1093/emboj/18.9.2610. [PMC free article] [PubMed] [Cross Ref]
  • Verheyen EM, Mirkovic I, MacLean SJ, Langmann C, Andrews BC, MacKinnon C. The tissue polarity gene nemo carries out multiple roles in patterning during Drosophila development. Mech Dev. 2001;101:119–132. doi: 10.1016/S0925-4773(00)00574-8. [PubMed] [Cross Ref]
  • Wolff C, Schroder R, Schulz C, Tautz D, Klingler M. Regulation of the Tribolium homologues of caudal and hunchback in Drosophila: evidence for maternal gradient systems in a short germ embryo. Development. 1998;125:3645–3654. [PubMed]

Articles from BMC Bioinformatics are provided here courtesy of BioMed Central
PubReader format: click here to try


Save items

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • Gene
    Gene records that cite the current articles. Citations in Gene are added manually by NCBI or imported from outside public resources.
  • GEO Profiles
    GEO Profiles
    Gene Expression Omnibus (GEO) Profiles of molecular abundance data. The current articles are references on the Gene record associated with the GEO profile.
  • HomoloGene
    HomoloGene clusters of homologous genes and sequences that cite the current articles. These are references on the Gene and sequence records in the HomoloGene entry.
  • PubMed
    PubMed citations for these articles
  • Taxonomy
    Taxonomy records associated with the current articles through taxonomic information on related molecular database records (Nucleotide, Protein, Gene, SNP, Structure).
  • Taxonomy Tree
    Taxonomy Tree

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...