![]() | ![]() |
Formats:
|
||||||||||||||
Copyright © 2004 Gurunathan et al; licensee BioMed Central Ltd. Identifying spatially similar gene expression patterns in early stage fruit fly embryo images: binary feature versus invariant moment digital representations 1Center for Evolutionary Functional Genomics, The Biodesign Institute, Arizona State University, Tempe, AZ 85287-5301, USA 2Department of Computer Science and Engineering, Arizona State University, Tempe, AZ 85287-8809, USA 3School of Life Sciences, Arizona State University, Tempe, AZ 85287-4501, USA Corresponding author.Rajalakshmi Gurunathan: Rajalakshmi.Gurunathan/at/asu.edu; Bernard Van Emden: Bernard.VanEmden/at/asu.edu; Sethuraman Panchanathan: panch/at/asu.edu; Sudhir Kumar: s.kumar/at/asu.edu Received April 30, 2004; Accepted December 16, 2004. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract Background Modern developmental biology relies heavily on the analysis of embryonic gene expression patterns. Investigators manually inspect hundreds or thousands of expression patterns to identify those that are spatially similar and to ultimately infer potential gene interactions. However, the rapid accumulation of gene expression pattern data over the last two decades, facilitated by high-throughput techniques, has produced a need for the development of efficient approaches for direct comparison of images, rather than their textual descriptions, to identify spatially similar expression patterns. Results The effectiveness of the Binary Feature Vector (BFV) and Invariant Moment Vector (IMV) based digital representations of the gene expression patterns in finding biologically meaningful patterns was compared for a small (226 images) and a large (1819 images) dataset. For each dataset, an ordered list of images, with respect to a query image, was generated to identify overlapping and similar gene expression patterns, in a manner comparable to what a developmental biologist might do. The results showed that the BFV representation consistently outperforms the IMV representation in finding biologically meaningful matches when spatial overlap of the gene expression pattern and the genes involved are considered. Furthermore, we explored the value of conducting image-content based searches in a dataset where individual expression components (or domains) of multi-domain expression patterns were also included separately. We found that this technique improves performance of both IMV and BFV based searches. Conclusions We conclude that the BFV representation consistently produces a more extensive and better list of biologically useful patterns than the IMV representation. The high quality of results obtained scales well as the search database becomes larger, which encourages efforts to build automated image query and retrieval systems for spatial gene expression patterns. Background The complexity of animal body form arises from a single fertilized egg cell in an odyssey of gene expression and regulation that controls the multiplication and differentiation of cells [1-3]. For over two decades, Drosophila melanogaster (the fruit fly) has been a canonical model animal for understanding this developmental process in the laboratory. The raw data from experiments consist of photographs (two dimensional images) of the Drosophila embryo showing a particular gene expression pattern revealed by a gene-specific probe in wildtype and mutant backgrounds. Manual, visual comparison of these spatial gene expressions is usually carried out to identify overlaps in gene expression and to infer interactions [4-6]. Whole fruit fly embryo and other related gene expression patterns have been published in a wide variety of research journals since late 1980's. These efforts have now entered a high-throughput phase with the systematic determination of patterns of gene expression [e.g., [7]]. As a result, the amount of data currently available has doubled leading to the imminent availability of multiple expression patterns of every gene in the Drosophila genome [7]. In addition, the use of micro-array technology to study Drosophila development has revealed additional and important insights into changes in gene expression levels over time and under different conditions at a genomic scale [8,9]. With this rapid increase in the amount of available primary gene expression images, searchable textual descriptions of images have become available [7,10,11]. However, a direct comparison of the gene expression patterns depicted in the images is also desirable to find biologically similar expression patterns, because textual descriptions (even using a highly structured and controlled vocabulary) cannot fully capture all aspects of an expression pattern. In fact, there is a need for automated identification of images containing overlapping or similar gene expression patterns [6,12] in order to assist researchers in the evaluation of similarity between a given expression pattern and all other existing (comparable) patterns in the same way that the BLAST [13] technique functions for DNA and protein sequences. Of course, unlike the genomes with four letters and proteomes with 20 letters, all gene expression anatomies cannot be easily reduced to, and thus represented by, a small number of components. We previously proposed a binary coded bit stream pattern to represent gene expression pattern images [6]. In this digital representation, referred to as the Binary Feature Vector (BFV; BSV in [6]), the unstained pixels in the images (white regions and background) were denoted by a value of 0 and the stained areas (colored and foreground: gene expression) were denoted by a value of 1. Based on the BFV representations of the expression pattern, we proposed a Basic Expression Search Tool for Images (BESTi) [6] with an aim to produce biologically significant gene expression pattern matches using image content alone, without any reference to textual descriptions. We found that the BESTi approach generated biologically meaningful matches to query expression patterns [6]. In this paper, we explore how a more sophisticated Invariant Moment Vectors (IMV, [14]) based digital representation of gene expression patterns performs in generating an ordered list of best-matching images that contain similar/overlapping gene expression patterns to that depicted in a query image. IMV are frequently used in natural image processing (e.g., optical character recognition [15]) and have a number of desirable properties, including the compensation for variations of scale, translation, and rotation. If successful, IMV representations hold the promise of producing significantly shorter computing times for image-to-image matching compared to BFV. Previously, we had examined the performance of the BFV representation for a limited dataset of early stage images [6]. Here we compare the relative performances of BFV and IMV first using a dataset containing 226 images (from 13 research papers). Then we test for scalability of the BESTi search by using a seven times larger dataset containing 1819 (1593 new + 226 previous) images from 262 additional research papers (list available upon request from the authors). Both datasets contained lateral views of early stage (1–8) embryos. During these investigations, we also developed another measure of image-to-image similarity for the BFV representation. This measure is aimed at finding images that contain as much of the query image expression pattern as possible, but without penalizing for the presence of any expression outside the overlap region in the target image. In addition, we examined whether partitioning a multi-domain expression pattern into multiple BFV representations, each containing only one domain, yields a better result set. Recently, Peng and Myers [16] have proposed a different procedure involving the global and local Gaussian Mixture Model (GMM) of the pixel intensities (of expression) to identify images with similar patterns. This GMM method is expected to find images with intensity and spatial similarities. This is different from the BFV and IMV methods examined here, which are intended to find only spatially similar patterns. This focus is important because, as mentioned in [6], the differences in gene expression intensity among images in published literature can arise simply due to use of different techniques, illumination conditions, or biological reasons. However, Peng and Myers method [16] appears to be promising and we plan to examine its effectiveness in a separate paper. Results and discussion Data set generation An image database of 226 gene expression pattern images was initially generated using data from the literature [17-29]. All were lateral images and exhibited early stage (1–8) expression patterns. These images were selected because they had some commonality of gene expression (as seen by the human eye), which allowed us to evaluate the performance of the BESTi in finding correct as well as false matches under controlled conditions. BESTi was also tested for scalability on a larger dataset containing 1819 (1593 plus the 226) lateral views of early stage embryos. These 1593 images were obtained from 262 articles. In order to present comprehensible result sets in this paper, we have primarily discussed the findings from the dataset of 226 and provided information on how those queries scaled when they were conducted for the larger dataset. In general, our focus was to show the retrieval of biologically significant matches based on both the visual overlap of the spatial gene expression pattern and the genes associated with the pattern retrieved. Each image was standardized and the binary expression pattern extracted following the procedures described previously [6]. These extracted patterns, their invariant moments ( 1 through 7), and binary feature representations were stored in a database. We also calculated and stored the expression area (the count of the number of 1's in the binary feature represented image), the X and Y coordinates of the centroid ( , ), and the principal angle (θ) for each extracted pattern.To quantify the similarity of gene expressions in two images, we computed two measures (SS, SC) based on the BFV representation (See equations 2 and 3 in Methods). SS is designed to find gene expression patterns with overall similarity to the query image, whereas SC is for finding images that contain as much of the query image expression pattern as possible without penalizing for the presence of any expression outside the overlap region in the target image. For a given pair of gene expression patterns (A and B), SS is the same irrespective of which image in the pair is the query image. That is, SS(A,B) = SS (B,A). This is not so for SC, because SC measures how much of the query gene expression pattern is contained in the image. Therefore, SC (A,B) ≠ SC(B,A). For IMV representation, we computed one dissimilarity measure (D , equation 13 in Methods). Results from D should be compared to that from SS, as both of these measurements do not depend on the reference image, i.e., D (A,B) = D (B,A) and, also they capture overall similarity or dissimilarity.Matches and their biological significance The effectiveness of the BESTi in finding biologically similar expression patterns was geared towards determining the biological validity of the results obtained from the image matching procedure. All results were based solely on quantitative similarities between images without using any textual descriptions. All images were lateral views from the early stages of fruit fly embryogenesis and were oriented anterior end to the left and dorsal to the top. We refer to the images retrieved as the BESTi-matches. Performance of BFV-SS search Figure Figure1A1A
A search was conducted using the same query image and same distance measure (SS) on the larger dataset. Figure Figure22
Performance of IMV search We used the same query image for the IMV method applied to the smaller dataset (D , results in Figure Figure1B)1B retrieves expression patterns that are of similar shape and/or size, regardless of the translation or rotation with respect to the query image.A comparison of the results from the smaller and larger dataset for the IMV measure is given in Figure Figure3.3
Since both SS and D measures capture the overall similarity or dissimilarity, we can use Figures Figures22In addition to the Hu moments, one could also compute Zernike moments, which are based on the polar coordinate system. Both Hu moments and Zernike moments are susceptible to the same problem namely expression patterns showing a similar shape but translated to different locations in the embryo would be in the same result set. We chose to study the Hu Invariant Moment Vectors mainly because the centroid of the image can be used to distinguish between similarly shaped but translated expression patterns. With Zernike moments, the image must be inherently contained within a unit circle anchored at the centroid [34]. Thus, there is no straightforward method to eliminate the translational problem. Using the Hu moments, the spatial location problem can be corrected by considering the Euclidean difference in the centroid location expressed in pixels (ΔCXY) of the query and results. In the case of BFV-SS search results in Figure 1 (A1–A8) Performance of BFV-SC search Figure Figure1C1C Using the same minimum similarity value for the BFV-SC in the larger dataset resulted in 55 images, given in Figure Figure4.4
Analysis of multi-domain gene expression patterns Due to the presence of multiple areas of expression, some patterns in the database that appeared to contain much better matches (by eye and biologically) to the query image were not found or ranked very high. Hence, we also analyzed multi-domain expression patterns separately for the smaller dataset. Developmental biologists are also interested in finding such patterns as they contain overlaps with the expression domains in the query image. In fact, a large number of the expression patterns available today contain multiple isolated domains of expressions since more than one topologically distinct region of expression may be produced by many genes, transgenic constructs, probes or experimental techniques (multiple staining). In such cases, we need to consider each of these regions individually as well as in the context of the composite pattern. Biologically, it is important to consider them separately because different regions of expression may be under the control of distinct cis-regulatory sequences [e.g., [28,38]] or may represent the expression of different genes in a multiply-stained embryo. Separating multi-domain gene expression patterns into individual components was straightforward; we simply generated multiple images from the same initial image and included them in the target dataset. This resulted in 192 additional images (418 total) in the database all of which were components of the initial gene expression patterns. The images were separated into expression regions horizontally and/or vertically depending on the gene expression. For this new set of images, the IMV as well as BFV representations were re-calculated and the BESTi query constructed as above. Results from BFV-SS and IMV queries for this data set are given in Figures Figures1D1D Next we examine the performance of SS, SC and D in finding BESTi matches for a query pattern with multiple regions of expression (Figure (Figure5A).5A
When D is used as a search criterion, it produces some correct matches in the result set (Figure 5B1–B8The results in Figures Figures5D5D Conclusions We have found that it is possible to identify biologically significant gene expression patterns from a dataset by first extracting numeric signature descriptors and then using those descriptors in a computerized search of the database for expression patterns with similar signatures or maximum pattern similarities. We find that the BFV methodologies provide a longer and more biologically meaningful set of expression pattern matches than IMV. Even though IMV representations will produce much faster retrieval speeds for large collections of embryogenesis images, the lack of biological validity of BESTi-matches retrieved makes IMV undesirable for the present problem. Instead, investigations and strategies aimed at improving the real time performance of the BFV representation will better serve the developmental biological research. Methods The wide variety of input methodologies, illumination conditions, equipment, and publication venues involved in the acquisition and presentation of gene expression patterns makes the available gene expression pattern data rather diverse. Extracting a gene expression pattern from its background requires the use of a combination of manual and automatic techniques. Each image is first standardized into a binary image as described in [6]. The standardized images are then represented using the Binary Feature Vector (BFV) [6], and the Invariant Moment Vectors (IMV) [14]. Similarity measures SS and SC are derived from BFV of which, SS is the one's complement of the distance metric DE presented in [6] and SC is a new measure introduced in this paper. The third metric D is deduced from the invariant moment vectors.Binary Sequence Vector analysis The binary coded bit stream pattern, in which the two possible states indicate staining over or under a threshold value, is called as Binary Feature Vector (BFV). This is referred to as the Binary Sequence Vector (BSV) in [6]. In other words, we represent each image as a sequence of 1's and 0's, where the black pixels (stained areas) are denoted by a value of 1 and the white pixels (unstained and background) are denoted by a value of 0. This BFV holds the gene expression and localization pattern information of each image. The expression patterns are ordered by evaluating a set of difference values, DE, between the binary feature vectors of every possible pair of images in the dataset. DE was introduced in [6] and is formally given as, DE = Count(A XOR B)/Count(A OR B) (1) The term Count(A XOR B) corresponds to the number of pixels not spatially common to the two images and the term Count(A OR B) provides the normalizing factor, as it refers to the total number of stained pixels (expression area) depicted in either of the two images being compared. For simplicity, we use the one's complement of DE, as a measure of similarity of gene expression patterns between two images, SS, is given by the equation SS = (1 - DE). (2) SS quantifies the amount of similarity based on the overlap between two expression patterns. SS is equal to 1 when the two expression patterns are identical (DE = 0). We introduce a new similarity measure in this paper that does not penalize for any non-overlapping region. The measure SC quantifies the amount of similarity based on the containment of one expression pattern in the other given by SC = Count(A AND B)/Count (A) (3) If the entire query image is contained within the result set images found in the database, i.e., there is complete overlap (with respect to the query image) SC is equal to 1. Note that, SC(A,B) ≠ SC(B,A), because the denominator corresponds to the gene expression area of the query image. Invariant Moment Vector (IMV) analysis Some methodologies of image analysis produce numeric descriptors that compensate for variations of scale, translation and rotation. In the following section, we describe the invariant moment analysis of gene expression data. Invariant moment calculations have been used in optical character recognition and other applications for many years [15]. To calculate these invariant moment descriptors the standardized binary image [6] is converted to a binary representation of the same pattern (BFV). From this binary sequence of the image, the invariant moments and other descriptors are extracted using the following method [14,41]. The continuous scale equation used is Mpq = xp yq f(x, y)dxdy, (4)where Mpq is the two-dimensional moment of the function of the gene expression pattern, f(x, y). The order of the moment is defined as (p + q), where both p and q are positive natural numbers. When implemented in a digital or discrete form this equation becomes ![]() We then normalize for image translation using and which are the coordinates of the center of gravity, centroid, of the area showing expression. They are calculated as![]() Discrete representations of the central moments are then defined as follows: ![]() A further normalization for variations in scale can be implemented using the formula, ![]() and is the normalization factor. From the central moments, the following values are calculated:![]() where 7 is a skew invariant to distinguish mirror images. In the above, 1 and 2 are second order moments and 3 through 7 are third order moments. 1 (the sum of the second order moments) may be thought of as the "spread" of the gene expression pattern; whereas the square root of 2 (the difference of the second order moments) may be interpreted as the "slenderness" of the pattern. Moments 3 through 7 do not have any direct physical meaning, but include the spatial frequencies and ranges of the image.In order to provide a discriminator for image inversion (and rotation), sometimes called the "6", "9" problem, it has been suggested [14,42] that the principal angle be used to determine "which way is up". This is extremely important in embryo images because gene expression at the anterior and posterior regions may simply appear to be mirror images of each other to the invariant moments, but biologically they are completely distinct. The principal axis of the gene expression pattern f(x, y) is the angular displacement of the minimum rotational inertia line that passes through the centroid ( , ) and is given as:![]() The slope of the principal axis is called the principal angle θ. It is calculated knowing that the moment of inertia of f around the line is a line through ( , ) with slope θ. We can find the θ value at which the momentum is minimum by differentiating this equation with respect to θ and setting the results equal to zero. This produces the following equation:![]() Using the condition |θ| < 45° one can distinguish the "6" from the "9" and rotationally similar gene expression patterns. In invariant moment analysis, our initial method of image comparison calculates the Euclidean distance between the images using all moments ( 1 through 7) and combinations of these moments. For example, if the first two invariant moments are used, then![]() and the distance Dij, between a pair of images i and j where i, j = 1, 2,...n is given by ![]() This can be expanded to use all of the moment variables. Here, the Euclidean distance, D , between any two images is calculated as![]() where i and q designate images whose distance is being calculated and j designates the parameters used in the distance calculation and j = 1, 2, ..., 7. This assumes that all moments have the same dimensions or that they are dimensionless. Using this method, it is possible to rank each of the images in order of their similarity based on, for example, the first two invariant moments that have clear-cut physical meanings. Expansion to include additional moments or parameters can be performed in a number of ways. It is possible to add additional parameters to the distance calculation making sure that each of the parameters has the same dimension. For example, 1 has the dimension of distance squared, while 2 has the dimension of the fourth power of distance, thus requiring the square root function to equalize dimensions for comparable distance calculation purposes. In general, the greater number of invariant moments used in the distance calculation, the more selective the ranking. We have also allowed for the use of the centroids and principal angle as a means of list limiting.Authors' contributions SK originally conceived the project, developed the image distance measures based on the BFV representation, wrote an early version of the manuscript, and edited it until the final version. RG was responsible for writing new and using pre-existing programs to perform the image distance and parameter calculations, helped prepare the figures, searched the literature for gene expression data, maintained the database of gene expression pattern images, and helped in writing the manuscript. BVE provided the IMV method description, managed the day-to-day activities in the project, and did significant editing to produce the manuscript in the desired format for the journal. SP originally proposed the use of invariant moment vectors for biological image analysis, contributed significantly for the image distance and parameter calculations and provided critical feedback during the later stages of revision. Acknowledgements We thank Dr. Robert Wisotzkey for biological remarks, Dr. Dana Desonie for editorial comments and Dr. Stuart Newfeld for useful suggestions. This research was supported in part by research grants from National Institutes of Health (S.K.) and the Center for Evolutionary Functional Genomics (S.K.) at the Arizona State University. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||
Nat Rev Genet. 2001 Sep; 2(9):690-701.
[Nat Rev Genet. 2001]Dev Biol. 2001 Apr 15; 232(2):339-50.
[Dev Biol. 2001]Genetics. 2002 Dec; 162(4):2037-47.
[Genetics. 2002]Genome Biol. 2002; 3(12):RESEARCH0088.
[Genome Biol. 2002]Genome Biol. 2003; 4(2):205.
[Genome Biol. 2003]Science. 2002 Sep 27; 297(5590):2270-5.
[Science. 2002]Genome Biol. 2002; 3(12):RESEARCH0088.
[Genome Biol. 2002]Nucleic Acids Res. 1999 Jan 1; 27(1):85-8.
[Nucleic Acids Res. 1999]Semin Cell Dev Biol. 1997 Oct; 8(5):469-75.
[Semin Cell Dev Biol. 1997]Genetics. 2002 Dec; 162(4):2037-47.
[Genetics. 2002]Semin Cell Dev Biol. 1997 Oct; 8(5):455-8.
[Semin Cell Dev Biol. 1997]Genetics. 2002 Dec; 162(4):2037-47.
[Genetics. 2002]Genetics. 2002 Dec; 162(4):2037-47.
[Genetics. 2002]Genetics. 2002 Dec; 162(4):2037-47.
[Genetics. 2002]EMBO J. 1996 Jul 15; 15(14):3659-66.
[EMBO J. 1996]Cell. 1987 Nov 20; 51(4):549-55.
[Cell. 1987]Genetics. 2002 Dec; 162(4):2037-47.
[Genetics. 2002]Development. 1994 Nov; 120(11):3155-71.
[Development. 1994]Development. 1995 Nov; 121(11):3561-72.
[Development. 1995]Development. 1994 Nov; 120(11):3155-71.
[Development. 1994]Development. 2002 Jul; 129(14):3411-9.
[Development. 2002]Proc Natl Acad Sci U S A. 1999 Jan 19; 96(2):535-40.
[Proc Natl Acad Sci U S A. 1999]Development. 1994 Nov; 120(11):3155-71.
[Development. 1994]Development. 1997 Aug; 124(15):2915-22.
[Development. 1997]Genes Dev. 1994 Apr 15; 8(8):899-913.
[Genes Dev. 1994]Development. 2000 Feb; 127(3):655-65.
[Development. 2000]Adv Genet. 1990; 27():239-75.
[Adv Genet. 1990]J Theor Biol. 2001 Jul 21; 211(2):115-41.
[J Theor Biol. 2001]Development. 1994 Nov; 120(11):3155-71.
[Development. 1994]Nature. 1999 Apr 1; 398(6726):427-31.
[Nature. 1999]Nature. 1999 Apr 1; 398(6726):427-31.
[Nature. 1999]Genes Dev. 1988 Dec; 2(12B):1824-38.
[Genes Dev. 1988]Genetics. 1986 Nov; 114(3):919-42.
[Genetics. 1986]Genetics. 2002 Dec; 162(4):2037-47.
[Genetics. 2002]Genetics. 2002 Dec; 162(4):2037-47.
[Genetics. 2002]Genetics. 2002 Dec; 162(4):2037-47.
[Genetics. 2002]Genetics. 2002 Dec; 162(4):2037-47.
[Genetics. 2002]