![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2007 Tomancak et al.; licensee BioMed Central Ltd. Global analysis of patterns of gene expression during Drosophila embryogenesis 1Department of Molecular and Cell Biology, University of California, Berkeley, CA 94720, USA 2Howard Hughes Medical Institute, Cyclotron Road, Berkeley, CA 94720, USA 3Max Planck Institute of Molecular Cell Biology and Genetics, Pfotenhauerstr., Dresden, D-01307, Germany 4Department of Preventive Medicine, Keck School of Medicine of USC, Eastlake Ave, Los Angeles, CA 90033, USA 5Lawrence Berkeley National Laboratory, Cyclotron Road, Berkeley, CA 94720 6Department of Molecular Cell and Developmental Biology, University of California Los Angeles, Los Angeles, CA 90095, USA 7Janelia Farm Research Campus, HHMI, Helix Drive, Ashburn, VA 20147, USA Corresponding author.#Contributed equally. Pavel Tomancak: tomancak/at/mpi-cbg.de; Benjamin P Berman: bberman/at/usc.edu; Amy Beaton: beaton/at/fruitfly.org; Richard Weiszmann: RWeiszmann/at/lbl.gov; Elaine Kwan: ekwan/at/berkeley.edu; Volker Hartenstein: volkerh/at/mcdb.ucla.edu; Susan E Celniker: celniker/at/bdgp.lbl.gov; Gerald M Rubin: rubing/at/janelia.hhmi.org Received March 8, 2007; Revised June 5, 2007; Accepted July 23, 2007. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract Background Cell and tissue specific gene expression is a defining feature of embryonic development in multi-cellular organisms. However, the range of gene expression patterns, the extent of the correlation of expression with function, and the classes of genes whose spatial expression are tightly regulated have been unclear due to the lack of an unbiased, genome-wide survey of gene expression patterns. Results We determined and documented embryonic expression patterns for 6,003 (44%) of the 13,659 protein-coding genes identified in the Drosophila melanogaster genome with over 70,000 images and controlled vocabulary annotations. Individual expression patterns are extraordinarily diverse, but by supplementing qualitative in situ hybridization data with quantitative microarray time-course data using a hybrid clustering strategy, we identify groups of genes with similar expression. Of 4,496 genes with detectable expression in the embryo, 2,549 (57%) fall into 10 clusters representing broad expression patterns. The remaining 1,947 (43%) genes fall into 29 clusters representing restricted expression, 20% patterned as early as blastoderm, with the majority restricted to differentiated cell types, such as epithelia, nervous system, or muscle. We investigate the relationship between expression clusters and known molecular and cellular-physiological functions. Conclusion Nearly 60% of the genes with detectable expression exhibit broad patterns reflecting quantitative rather than qualitative differences between tissues. The other 40% show tissue-restricted expression; the expression patterns of over 1,500 of these genes are documented here for the first time. Within each of these categories, we identified clusters of genes associated with particular cellular and developmental functions. Background A defining feature of multi-cellular organisms is their ability to differentially utilize the information contained in their genomes to generate morphologically and functionally specialized cell types during development. Regulation of gene expression in time and space is a major driving force of this process. A gene's expression pattern can be defined as a series of differential accumulations of its products in subsets of cells as development progresses. Patterns of mRNA expression are studied by two principal methods - microarray analysis [1] and in situ hybridization [2,3]. Microarray analysis provides both a quantitative measure of gene expression and an overview of the temporal dynamics of gene expression regulation [4]. A major limitation of microarray analysis is that obtaining spatial information depends on the dissection or cell-sorting of specific tissues or cell types [5,6]. RNA in situ hybridization has the potential to reveal both spatial and temporal aspects of gene expression during development. However, RNA in situ hybridization is not quantitative [7]. For these reasons, we have used both methods in parallel and integrated the analysis of the resultant datasets. There are several reasons for choosing Drosophila melanogaster as an organism for the global study of gene expression during embryonic development. Genetic and molecular analyses have led to a deep understanding of many embryonic processes in this animal [8]. Classical embryology has provided a solid framework for the anatomical description of embryonic stages [9] and robust high-throughput methods for assaying gene expression by whole mount in situ hybridization have been developed [10-12]. In many cases, the wild-type gene expression pattern has informed the interpretation of the phenotype produced by its mutation [13]. Such studies have provided unprecedented insights into animal development; the process that governs the early embryonic patterning of the Drosophila body plan is now the best understood example of a complex cascade of transcriptional regulation during development [14,15]. We have assembled an atlas of gene expression patterns during Drosophila embryogenesis. Taking advantage of non-redundant gene collections [16,17], we performed an unbiased survey of gene expression by using RNA in situ hybridization of gene specific probes to fixed Drosophila embryos [12] and documented the patterns with a set of digital photographs. We describe the tissue specificity of gene expression at each stage range using selected terms from a controlled vocabulary (CV) for embryo anatomy [18]. The CV integrates the spatial and temporal dimensions of the gene expression patterns by linking together intermediate tissues that develop from one another. It also integrates morphological and molecular description of development by allowing for structures that are morphologically indistinguishable and can be defined only on the basis of gene expression. We show that the genes sampled, representing 44% of the Drosophila genes, are largely representative of the genome as a whole, allowing the global analysis of gene expression during the embryonic development of a multicellular organism. We organized the complex gene expression space by a hybrid fuzzy-clustering approach that uses microarray profiles to supplement the CV annotation of in situ patterns. We divided the resulting clusters into two categories, broad and restricted. Broad patterns are characterized by quantitative enrichment in tissues that are related by specific cellular states. Restricted patterns are highly diverse and provide a basis for defining gene sets expressed in related tissues and with related predicted functions. Results and discussion Annotation dataset The starting point for our analyses is a collection of 6,003 genes whose embryonic expression patterns we have assayed by in situ hybridization and systematically annotated with CVs (Release 2.0). The number of genes in the dataset has more than doubled from Release 1 [12], from 2,179 to 6,003, and the accuracy of the annotation has been significantly enhanced by performing a full re-evaluation of every gene by a second, independent curator (Materials and methods; Additional data file 1). Release 2.0, including 74,833 staged embryo images and accompanying CV annotations and microarray data, is publicly available via a searchable database [19], providing a convenient way to mine the dataset for particular expression patterns. To determine how representative our sample is, we compared the distribution of selected Gene Ontology (GO) functional annotations (generic GO slim [20]) between the 6,003 genes in our subset and the 14,586 genes in the Release 4.3 genome (Additional data file 2). No major biases for a specific molecular function, component or process were detected. Our dataset is slightly enriched for genes with known or inferred GO functions, and is, therefore, slightly deficient for genes with unknown assignment. Genes in this category lack conserved sequence features that would relate them to genes in other organisms, and may be expressed at very low levels, leading to a relative under-representation in expressed sequence tag (EST) collections. We conclude that our dataset contains a largely representative sample of gene expression patterns in the Drosophila genome. To annotate gene expression patterns, we used a set of 314 anatomical terms selected from the broad Drosophila Controlled Vocabulary for Anatomy maintained by FlyBase [18]. We grouped developmental structures into 16 color-coded organ systems, and reduced the full 314-term CV to 145 terms by collapsing rarely used or difficult to distinguish sub-terms to their corresponding parent term (Materials and methods; Additional data files 3-5). In order to compare the gene expression properties for a set of related genes, we created a representation of the hierarchical CV that fits on a single line, which we call an 'anatomical signature', or 'anatogram'. Figure Figure11
Organization of gene expression data using a hybrid clustering approach Of the 6,003 genes annotated, 4,759 (79%) showed detectable expression in the embryo, while the remaining 1,244 (21%) were annotated with only the 'No staining' CV term. By grouping genes with identical annotations, the 4,759 genes with detectable expression in the embryo were subdivided into 205 multi-gene groups and 2,335 'singleton' groups (that is, groups consisting of a single uniquely annotated gene). By relaxing the criteria and grouping genes that had at least 75% of their annotation terms in common, we identified 393 multi-gene groups and 1,804 singletons. If we consider each of the multi-gene groups and each of the singleton groups to represent a distinct expression pattern, this method suggests that there are up to 2,197 distinct patterns within our dataset (Additional data file 6). To further refine the number of expression categories, we developed a clustering strategy that allowed us to incorporate the quantitative temporal expression data obtained from the microarray experiments together with the qualitative, but spatially rich, data on expression patterns from the CV annotations. We implemented this approach within the framework of fuzzy c-means clustering [22,23] and developed a similarity metric that assigns different weights to the contribution of the microarray and annotation data (Materials and methods). Our goal was to find a proper balance between the contributions of annotation similarity versus microarray similarity to the overall similarity score. We desired a score that would minimize the contribution of microarray similarity for cases like those genes in Figure Figure2a,2a
The fuzzy c-means algorithm is fuzzy in that each gene is assigned to one or more clusters [24]. As multiple independent regulatory elements can drive the expression of a single gene in different tissues or at different times in development, this is a desirable property for this particular clustering problem. However, despite extensive experimentation with different clustering parameters, the large diversity of expression patterns led to clusters with ambiguous boundaries. Replication experiments using random initialization variables [25] resulted in clusters that were qualitatively similar but with numerous genes redistributed between neighboring clusters [26]. Therefore, each gene was assigned a score for each cluster, and this score was used to rank the most prototypical members of the cluster first and the most ambiguous ones last, and genes with high scores in multiple independent clusters were assigned to each cluster. This scoring allowed us to define a cutoff and determine the set of 'core' genes belonging most unambiguously to one and only one cluster (Materials and methods). Of 4,759 genes expressed in the embryo, we had microarray expression data for 4,496. The best fuzzy c-means run grouped these genes into 39 clusters, and each cluster was designated as either 'broad' or 'restricted'. Clusters containing a significant fraction of genes annotated as 'ubiquitous' were designated as broad, as were clusters containing primarily genes with unrestricted maternal only expression (Materials and methods). We also decided to include as broad those clusters of genes exhibiting maternal expression early and midgut-only expression late. Many genes annotated in this way (Figure (Figure2c)2c
Broadly expressed genes The ten clusters encompassing broadly expressed genes have relatively similar array profiles, but the diversity of annotations makes the boundaries between these clusters somewhat arbitrary (Figure (Figure3).3
Cluster 1B is one of the several broad clusters characterized by peak microarray expression around hours 4-5 (stage 10; Figure Figure4a).4a Genes in cluster 2B and many in cluster 3B are characterized by peak expression levels around hour 12 (stage 15) and by in situ hybridization appear strongest in the differentiated midgut, muscle, hindgut, and foregut (Figure 4b,c Clusters 4B and 5B are characterized by peak expression levels around hours 4-5 (stage 10) and often resolve to exhibit staining in the differentiated nervous system and midgut (Figure 4d,e Genes in cluster 6B are almost uniformly annotated as ubiquitous at all stages of embryogenesis and this annotation is supported by relatively high average array expression levels at all time points (Figure (Figure4f).4f The genes in clusters 1B-6B exhibit remarkably similar expression patterns during gastrulation and were most frequently annotated as endoderm and mesoderm anlagen (Figure (Figure4,4 Clusters 7B-10B are composed of genes with maternally deposited transcripts that diminish after stage 7 (Figure 4g,h The genes and expression patterns in broad clusters have largely failed to attract the attention of developmental biologists, as indicated by the fact that the embryonic expression of only 4.3% of them have been described in the scientific literature [18]. Yet, they represent more than half of the genes expressed in embryogenesis. Our analysis of broad patterns provides a comprehensive and unbiased overview of these neglected genes and redefines the definition of ubiquitous gene expression during development. A major lesson learned from our in situ screen is that a CV annotation strategy is insufficient to describe these patterns fully. Restricted expression patterns While the diversity of expression patterns was considerable, our hybrid clustering approach identified a number of tissue or domain specific expression patterns shared among a significant number of genes. While these clusters are more easily categorized than the broad clusters, there is still considerable ambiguity between clusters (Figure (Figure55
Clusters 1R-4R contain 383 genes expressed in various combinations of the yolk nuclei, fat body and blood related tissues (Figure 6a-c
Clusters 5R-7R contain 1,160 genes expressed late in embryogenesis (stage range 13-16) in a number of epithelial structures (Figure 6d-f Clusters 13R-16R contain 525 genes expressed specifically in the central and peripheral nervous system (Figure 6g-j Clusters 18R and 19R contain 229 genes expressed in either differentiated somatic muscle (Figure (Figure6k)6k Clusters 23R-29R contain 422 genes expressed in a domain-specific manner beginning in the blastoderm stage embryo and typically continuing in a tissue-specific manner throughout embryogenesis (Figure 6m-p An additional eight clusters contain 349 genes with late tissue-specific expression (Additional data file 9a-h). Some of these contain genes expressed throughout development in a single tissue, like the cluster of genes expressed in pole and germ-cell (Additional data file 9h), while others, like the cluster of midgut-specific genes (Additional data file 9b), are primarily expressed in a particular tissue at a particular time. Despite the significant number of genes that conform well to the patterns represented by the above clusters, a large fraction is expressed in unique combinations of tissues or organs. Fuzzy clustering assigned these genes to the set of clusters that best described their expression patterns. Of the 1,947 genes expressed in a restricted manner, 795 (41%) are assigned to more than one cluster (Table 1). We illustrate this by showing several examples of genes assigned to multiple clusters (Figure (Figure7).7
Can we estimate the number of distinct expression patterns in Drosophila embryogenesis? When we use a relatively conservative measure, requiring that genes need to share 75% or more of their annotation terms to be considered 'indistinguishable', we identify 173 multi-gene groups and 1,141 singletons among the genes in our restricted clusters. Thus, by removing the broad genes, which are prone to inconsistent annotation, the number of groups within our dataset based on this measure drops from 2,197 to 1,314, providing one estimate of the number of 'distinct' patterns (Additional data file 6). On the other hand, these patterns are not unrelated. We consider the 29 restricted clusters the most prominent recurring patterns in the dataset, and we can only speculate where to place the biologically significant number of patterns within these two extremes. It is clear that the clusters are not homogenous since 41% of the genes exhibit composite patterns. If we look at all observed combinations of cluster assignments, we find 454 distinct combinations, and 287 of these cluster combinations consist of a single gene. We favor the idea that many of the composite patterns observed result from simple additive combination of the basic patterns driven by independently acting cis-regulatory modules. Direct examination of the patterns that each of these cis-regulatory modules generates in transgenic reporter assays, rather than the patterns of entire genes, will be more powerful in revealing the underlying mechanisms and logic governing the generation and evolution of each gene's expression pattern. Relatedness of distinct tissues Besides grouping genes according to the similarity of gene expression patterns, we used our annotation dataset to define relatedness among tissues based on the similarity of the set of genes expressed in them. Figure Figure88
The largest, most interconnected set of structures roughly corresponds to the epithelial pattern defined by clusters 5R, 6R and 7R. Notably, the salivary gland duct is isolated from the salivary gland body, reflecting their functional divergence and differential gene expression. The salivary gland duct and trachea are linked by their shared expression of genes required for cuticle deposition. In terms of gene expression, the anal pads are more similar to the hindgut than to other epidermal structures. The large distance between neural and other ectodermal derivatives suggests that specification of neuronal versus epidermal cell fate leads to profound genome-wide changes in transcription. Patterns within the digestive system are interesting - while hindgut and foregut expression are strongly correlated, midgut expression is markedly different despite its functional and spatial relatedness, reflecting its distinct developmental origin. Relationship between expression and function Determining a gene's pattern of expression is a key step towards understanding its function during development. The functions of many genes have been determined, either by direct experimental analysis or by sequence homology and compiled by the GO consortium [20]. Additionally, the Uniprot database catalogs protein domains and provides phylogenetic relationships [31]. For each of our 6,003 genes, we identified associated GO terms and Uniprot domains and determined the relative distribution of these terms and domains within the broad versus restricted clusters (Figure (Figure9),9
To examine the enrichment of GO and Uniprot categories in individual gene expression clusters, we performed exhaustive pair-wise comparisons [21]. We used the binomial test to evaluate the statistical significance of overlaps between sets of genes defined by the different data-sources. In order to correct the significance estimates for multiple testing we determined the empirical chance distribution by performing a large number of random permutations of gene functional assignments and determining the rate at which we attained particular p values. We interpolated these results using a log-linear regression function to fit the empirical distribution (Materials and methods). The results of this analysis are shown in Additional data file 8, which lists all GO essential (Materials and methods [21]) and Uniprot categories significantly enriched in gene expression clusters (those with an adjusted p value of less than 0.05 and 3-fold or greater enrichment). To summarize the functional associations of gene expression clusters, we created a force-directed layout network, which brings into close proximity clusters and GO/Uniprot categories sharing a significant number of genes (Figure (Figure10).10
Interestingly, cluster 7R, containing genes with early (stage 12) onset epithelial expression, clearly separates from 5R and 6R, which contain genes with late epithelial expression (stages 13-16). Early epithelial expressing genes are associated with GO terms for tissue specific functions, such as membrane trafficking, morphogenesis, cell polarity, motility and adhesion, which makes them similar to genes found in the early blastoderm patterning gene cluster (cluster 26R). In contrast, late epithelial clusters (clusters 5R and 6R) associate clearly with cuticle formation in terminally differentiated tissues. This is the best example in our dataset of separation between regulatory developmental genes and effector genes [32] of the terminal cell fates. Genes in cluster 24R are expressed in yolk, mesoderm, dorsal ectoderm and anterior and posterior endoderm anlagen at the blastoderm stage. Consistent with this early expression, these genes are expressed later in differentiated midgut, yolk, fat body and plasmatocytes. The force directed layout suggests that these genes are functionally related to clusters 1-4R, which contain genes expressed in yolk, fat body and blood and involved in metabolite transport. Cluster 24R clearly separates from other blastoderm stage clusters, suggesting that for these particular tissues, specific effector genes are required early in and throughout embryonic development. GO terms related to membrane trafficking, such as secretory pathway, vesicle transport, Golgi apparatus, and ER, assume a central position in the layout with numerous connections to diverse clusters both broad and restricted. This likely reflects the requirement of these core cellular processes in diverse cell types, but also indicates that there are tissue specific differences in the utilization of these pathways. The modulation of these pathways is mediated by GTPases [33], which exhibit similar connectivity patterns in the force directed layout (Figure (Figure1010 CNS and muscle clusters associate with the expected GO terms for nerve impulse transmission and muscle contraction, respectively. Interestingly, despite their clear functional specialization, both tissues show a common requirement for components of the extracellular matrix. Another way to uncover relationships between gene expression and gene function is to examine the representation of GO terms in individual tissues using the 'anatograms' (Figure (Figure11).11
Cell adhesion molecules are similar to transcription factors in that they are expressed early in development in a number of anlagen, and are later abundant in the nervous system. In addition, these molecules are moderately enriched in differentiated epidermal derivatives (Figure 11d Comparison of GO terms and gene expression data often leads to self-evident observations because many functional GO assignments are based on published gene expression patterns. We used the Uniprot catalog to correlate gene expression and protein domains (Figure (Figure10).10 Conclusion We have described the most complete set of data on spatial and temporal patterns of gene expression during embryogenesis that has been compiled for any metazoan organism. The extent, quality, and unbiased nature of this dataset allowed us to describe and explore gene expression patterns during embryogenesis on a genome wide basis. Below we discuss three issues: how this data can be used as a resource by biologists; the inherent challenges in analyzing such a complex set of data; and what we learned about global strategies for regulating gene expression during embryonic development of a complex multi-cellular organism. Utility of the dataset The dataset we assembled can be used in several ways. First, it provides a rich source of candidate genes for further in-depth study. Researchers interested in a particular developmental process, for example, morphogenesis of the salivary gland, can search our annotations and retrieve a list of genes that are expressed in that structure. Such a gene set can be further subdivided by manual curation, using our primary image data. Second, the clustering classification allows one to address more abstract questions, such as: which genes are expressed in a regulated manner at cellular blastoderm? And which genes are involved in organogenesis in the late embryo? Finally, the dataset represents a starting point for an analysis of the sequence determinants of gene expression patterns. Clustering provides gene groupings based on spatio-temporal gene expression, ranging from unique patterns, through small tightly co-regulated gene sets, to large gene expression classes. These classes can be tested against cis-regulatory prediction pipelines to identify significant associations between gene expression specificity and genomic sequence features. Determining expression patterns is only a first step towards further understanding gene function and, therefore, it is important to intersect our spatial expression data with other genomic datasets. Our tools allow anyone with a list of genes, for example, derived from a targeted microarray analysis, to obtain the spatio-temporal expression patterns of these genes in the Drosophila embryo. To address the difficulty of summarizing the gene expression patterns of a group of genes, we developed a new visual aide - the anatogram. Anatograms show the 'position' of a given gene set in the complex space of spatio-temporal gene expression patterns and represent a convenient way to summarize such data for groups of genes. Anatograms also provide an intuitive comparison of differences among groups of genes, which can supplement more rigorous statistical comparisons. Any list of genes can serve to generate an anatogram; for example, the list of Drosophila genes homologous to a gene group in another organism, or the genes that contain a particular sequence motif. In this way, anatograms can be used to compare results from gene expression studies among different species. The color code is based on organ systems shared by metazoan organisms and can be adapted to spatio-temporal gene expression data from other animals, providing an organism-independent way to present spatial gene expression data. Analysis of annotated gene expression patterns Our results suggest that parallel microarray analysis should be an integral part of any in situ hybridization survey of developmental processes. Microarrays provide independent measurements that help control the artifacts of in situ hybridization methods, and also provide a quantitative measure of gene expression that is especially important for the interpretation of broadly expressed genes. The combined analysis of these two datasets is synergistic. In situ hybridization reveals the spatial diversity in tight temporal clusters and microarray clustering reduces the artificial diversity introduced by assigning annotations based on the qualitative in situ assay. In the context of an anatomically well-described system such as Drosophila embryogenesis, it is possible to achieve great precision in expression pattern description. However, making distinctions based on the fine details of patterns, such as different subsets of the CNS, can be problematic when examining genes one by one. We found that it was useful to reduce the granularity of the CV to the level where the annotation assignments are most reliable. This approach necessarily underestimates the true diversity of expression patterns; for example, the expression of GstS1 in a distinct subset of cells in the midgut was annotated simply as midgut. On the other hand, this approach enables description of undefined subsets of cells and their grouping with the correct higher order structures. The fine details of differences among expression patterns on a cellular level can be addressed by comparing images of the individual members of the broader groups defined by CV annotation, or by double labeling in situ experiments [36,37]. A complementary approach to study gene expression of transcription factors at the blastoderm stage uses high-resolution three-dimensional confocal imaging of fluorescently labeled, fixed specimen followed by computational segmentation analysis [38,39]. Many genes are expressed ubiquitously but non-uniformly, giving the appearance of a restricted expression pattern. Identifying and correctly categorizing such ubiquitous patterns is important because their description with the standard vocabulary makes it difficult to separate them from genes with true restricted expression patterns. We identified two major classes of ubiquitous patterns, a midgut CNS pattern and an endoderm mesoderm pattern. Late in embryogenesis these differentially stained structures become apparent, whereas immediately after gastrulation there are no apparent differences among the ubiquitous patterns. Gene expression patterns in development Embryonic development encompasses the complete spectrum of developmental and cell biological processes and, thus, it is not surprising that we detect the expression of 80% of the 6,003 genes we studied. Even this high number underestimates the number of genes expressed during embryogenesis. Our microarray data indicate that the late embryonic genes escaped detection in our in situ assay presumably because deposition of the cuticle prevents entry of the probe. In contrast, genes that are expressed in a very small subset of embryonic cells are more likely to be detected by in situ hybridization than by microarray analysis (data not shown). Of genes in our unbiased set, 45% are expressed in broad patterns. Broad genes tend to encode proteins that mediate core cellular processes and their apparent patterns reflect quantitative differences in requirements for basic cellular machineries in different tissues, especially late in embryogenesis. Of the genes in our dataset, 35% show spatially and/or temporally restricted gene expression. Our data reveal a tremendous diversity of gene expression patterns. Sets of genes that exhibit exactly the same tissue specific gene expression are rare and usually limited to mature organs. Genes with identical restricted expression patterns spanning multiple stages of embryogenesis were not found, even at the limited resolution level offered by our imaging technique. Genes that are expressed during mid-embryogenesis in a specific tissue very frequently show unrelated patterns earlier and later in development. Consequently, genes that serve as lineage markers by being expressed in a given organ system from anlagen, through primordia to final differentiated organs are rare and, for the most part, had already been discovered by genetic analysis. In order to classify the complex expression patterns, we used a fuzzy clustering approach that allows a gene to participate in multiple clusters. We found that nearly all genes with restricted patterns fell into one of six clearly distinguishable restricted pattern types: yolk, blood and fat; epithelia; nervous system; muscle; blastoderm; or other, less frequent, organ specific patterns. Within each of these basic types, several subtypes were distinguished by their preferential expression in particular combinations of tissues. Remarkably, 41% of the genes belong to more than one cluster, underscoring the diversity of gene expression. It is perhaps expected that the majority of gene expression patterns will be unique when one considers all developmental stages. The diversity of patterns suggests that many genes are turned on independently multiple times in development. It is less intuitive that, in terminally differentiated tissues, many genes are expressed in multiple organ systems. The existence of expression clusters indicates that the restriction of gene activities within organ systems and developmental lineages frequently occurs, whereas the fuzziness of the clusters suggests that expression in atypical combinations of tissues can be achieved. It will be interesting to investigate the cis-regulatory code that is responsible for initiating common patterns of gene expression and the potential for diversity in the control of gene expression. Since the control of gene expression is thought to be modular, it is possible that combinations of significantly smaller numbers of regulatory modules achieve the overall diversity of patterns. What is the functional significance of the observed pattern diversity? Are all the minute features of the vast number of unique patterns necessary to carry out development? Or is the complexity of patterns largely a consequence of position effects in the proximity of regulatory modules that have little deleterious effect. Careful comparisons of gene expression patterns across multiple closely related species should reveal the patterns that are under evolutionary constraint. Our genome-wide dataset of patterns in D. melanogaster serves as a starting point for further investigation of genomic regulatory networks in development and their evolution. Materials and methods Data collection Large-scale production of gene expression patterns by RNA in situ hybridization to Drosophila embryos was performed as described [12]. Briefly, we used digoxygenin-labeled RNA probes derived primarily from sequenced cDNAs to visualize gene expression patterns in Drosophila embryos by in situ hybridization and documented the expression patterns by digital microscopy. The histochemical color reaction was stopped in all wells of the 96-well plate at the same time once staining pattern appeared for three included control probes as well as in most wells of the plate (1-1.5 hours at 37°C). Individual embryo images were assigned to one of six stage ranges that coincide with major developmental transitions in embryogenesis, and the development of the pattern across time was confirmed with independently derived Affymetrix microarray time course data covering the first 12.5 hours of embryogenesis [19]. The images were annotated using a CV for embryonic anatomy. In the course of the primary screen, we performed 8,469 successful in situ experiments representing 6,580 genes. We assembled data from multiple independent experiments for 1,514 (23%) genes (labeled RNA probes were generated separately for each experiment). The same EST clone was used as the source for the probe in 1,241 of the multiple experiments, while different ESTs were used as the source in the remaining 273. Low-resolution production images were captured for all 6,003 genes. No high resolution images were captured for genes labeled as maternal, ubiquitous or no expression (2,638). At least one high-resolution image was captured for the remaining 3,365. Of these, 2,202 have high-resolution images at all six stage ranges, and 1,163 are missing at least one stage range. We captured high-resolution images only at stage ranges when a gene was expressed. Annotation The primary curator (AB) assigned anatomical terms from the CV concurrently with image acquisition for each gene, providing a first pass annotation of its expression pattern. When the dataset was finalized, a second curator (VH) reviewed and edited the initial annotation assignments. In this second round of annotation, genes with similar or related patterns of expression were examined side by side; these comparisons allowed us to significantly improve the internal consistency of our annotations. We used two approaches to review annotations (Additional data file 1). Each approach generated lists of genes with related expression patterns that were then used by the second-round curator to review the annotations of individual genes for consistency. The first approach was purely image-based and did not make use of the first round annotations. For each of the first four stage ranges, we examined images of embryos from all 6,580 genes. We developed a software tool for displaying these images in batches, and a subset of images sharing a particular feature (for example, showing regulated expression at cellular blastoderm) was manually selected. These gene lists were further subdivided until meaningful subsets could no longer be identified. The second approach used the first round annotations to generate lists of genes ordered by similarity to particular sets of CV terms. We developed a software tool that generated lists for any arbitrary set of CV terms, but we found it most productive to define 12 relatively independent sets, each focused on a single organ system, that together covered the entire annotation hierarchy. The lists generated by the two approaches were used to re-annotate similar genes en masse to make the resulting annotations as uniform as possible. CV terms were added or deleted as necessary, and genes with satisfactory and complete annotations were deemed finished and removed from all re-annotation lists. As part of this process the curator removed 577 (9%) of the genes from the dataset when the quality of the primary data was judged to be insufficient to support high-quality annotation. Annotation hierarchy To describe the spatial and temporal gene expression patterns in embryogenesis, we used only two types of relationships, 'part of' to cover spatial relations and 'develops from' to cover temporal relationships among structures. Importantly, we linked terms to six developmental stage ranges and used 'develops from' relationships exclusively to link terms that belong to consecutive stage-ranges. Our anatomical terms were organized into a hierarchical tree, starting with stage range 1-3, which used only two CV terms (maternal, pole plasm), and progressively branching through the six stage ranges until stage-range 13-16 with 126 anatomical terms (Additional data file 3). Anatomical structures that were contained within a larger structure are linked to the larger structure by the 'part of' relationships (for example, 13-16 midline glia is 'part of' 13-16 midline). Anatomical structures that develop from one another across time are linked by the 'develops from' relationship (for example, 13-16 'midline' develops from 11-12 'midline primordium' (PR)). CV terms can have simultaneously the 'part of' and the 'develops from' relationships (for example, 'midline glioblast' is part of 'midline primordium', and 'midline glia' develops from the 'midline glioblast'). Every term occurs in the hierarchy only once. In a few cases where two terms develop into a single later structure (for example, 'anterior and posterior midgut primordium' forming 'midgut'), the strictly hierarchical nature of the tree is broken, and both were linked to the child term ('midgut') by the 'develops from' relationship. This fits the directed acyclic graph (DAG) format that is used to capture many biological ontologies. Many specific structures representing small subsets of tissues had very few or none of the 6,003 interrogated genes expressed within them (50 structures had 8 genes or fewer). We summarized the data by focusing on a subset of 145 structures that make up the most common and readily distinguishable structures in our dataset. Genes annotated with more specific structures were collapsed into more general parent structures. For example, the terms 'dorsal epidermis', 'dorsal apodeme', 'dorsal histoblast nest abdominal', 'dorsal ridge' and 'leading edge cell' were collapsed into 'dorsal ectoderm'. We distinguished two levels of collapsing. Within a stage range, we collapsed the 'part of' relationships to the parent term. The resulting 'blocks' of terms represent the most relevant units of embryo anatomy for describing the RNA in situ results. Several such blocks may be defined within a single organ system, for example, 'trunk and head somatic and visceral musculature' (TrunkSomMusc, HeadSomMusc, TrunkViscMusc, HeadViscMusc) in the muscle system. We also collapsed terms referring to the same organ system across a range of stages: structures from stage range 4-6 were collapsed into early organ systems anlagen; structures from stage ranges 7-8 and 9-10 into mid organ systems (Additional data file 4); and structures from stage ranges 11-12 and 13-16 into late organ systems (Additional data file 5). For example, Endocrine_heart refers to the combined anatomical terms for all components of the circulatory and endocrine related structures at stages 11-16 (combining blocks 11-12 CardioVAsc, 11-12 RingGland, 13-16 CardMeso, 13-16 RingGland). Linear annotation profiles Enrichment in the linear annotation profiles was displayed as the statistical significance of the over- or under-representation in the number of genes annotated with the given structure. The expected number of genes was modeled as a binomial distribution with parameters n (the number of genes in the list under study) and p (the frequency of the given structure in the dataset as a whole, Fuzzy clustering There were 4,496 genes detected in at least one tissue by in situ hybridization and present on the Affymetrix Drosophila 1.0 gene chip. Microarray data were extracted and normalized as described in [12]. An additional time point of wild-type flies at 16 hours post egg-laying was obtained from Tiago Magalhães and normalized with the previous 12 time points using Bioconductor's RMA [40,41] package. Input to the fuzzy clustering algorithm was the g × s binary matrix S of CV annotations (S for 'spatial') where Si, j is 1 when gene i is annotated with term j, and 0 otherwise; g is the number of genes in the dataset and s is the number of annotation terms. We used the 145 term collapsed version of the annotations for all clustering. An additional input matrix L (for 'levels') was the real-valued g × t matrix containing the normalized microarray values [40,41]. The basic procedure was similar to that outlined in [22], where [0,1] membership levels for each gene in each of k clusters was represented by a g × k matrix M and iteratively estimated. This matrix was randomly initialized with: where r is sampled from the uniform (0,1) distribution. The matrix was then re-normalized so that Here, is a 'fuzziness' parameter that determines the level of competition between clusters for the same gene (as approaches 1, it becomes a hard partition with each gene being assigned to the single best cluster, while higher values of cause gene memberships to be more fuzzy). While fuzzy clustering is generally thought to be most useful at values of 2 through 10 [24], we found that if was set above 1.5, the dataset would converge to one or two very fuzzy clusters composed of diffuse sets of terms. On the other extreme, if we used completely hard partitions (as in k-means, = 1), the majority of clusters were empty. If we used a partitioning close to 1, we found that each resulting cluster was distinct. We tried a range of values between 1 and 1.5 and used a of 1.05, which yielded the best results (data not shown).Iterations were stopped when the average difference in the membership matrix, ΔM dropped below 5e-5, where: We tried using classical mean [22] and medoid [42] representations for cluster centroids, but these performed poorly when attempting to combine a real-valued L and a binary valued S. Instead of maintaining a discrete centroid model to obtain distance values d, we instead defined di, j as the average distance of gene i to all other genes k, where each k is given the weight of its membership in cluster j: Here, δik is the distance between gene i and gene k. In this way, distances between all pairs of genes are transformed into distances between genes and clusters. Hybrid distance function In order to use annotation similarity to determine the contribution of array similarity (as described in Results and discussion), we defined an asymmetric mixture function where microarray similarity has a significant effect when the annotation similarity was medium to high, but very little effect when the annotation similarity was low (Additional data file 7). The mixture function defines the combined similarity si, k in terms of the spatial similarity ss and the array similarity sl (the combined distance δi, k is simply 1 - si, k): The spatial similarity where ν is the median and the absolute deviation is: The microarray similarity Some stage ranges have only a single associated annotation term (stage range 1-3 maternal), while others have many more (stage range 13-16 has 37 collapsed terms of which up to 17 are used in a single annotation record). The relative abundance of stage range 13-16 terms dominate any metric where each term is given the same weight, so we calculated a stage range-specific similarity independently for each stage and then produced a weighted sum. Stage ranges 4-6 and 13-16 received higher weights because they coincide with two periods in embryogenesis when de novo transcriptional initiation most frequently occurs, cellularization and organogenesis. Stage ranges 7-8 and 9-10 represented in most cases carry-over expression from stage range 4-6, and were difficult to score and, therefore, were less reliable. The weights used for each stage range were: stage range 1-3 (7%), stage range 4-6 (36%), stage range 7-8 (7%), stage range 9-10 (7%), stage range 11-12 (7%) and stage range 13-16 (36%). The similarity score for each stage range consisted of two components: a positive 'match bonus' score for the extent to which the two genes had terms in common, and a negative 'mismatch penalty' score for the extent to which the two genes had mismatched terms. The match bonus contributed twice as much as the mismatch penalty to the overall score: Where Broad versus restricted cluster designation The classification of clusters as broad or restricted was made automatically based on the average annotation terms of the genes assigned to the cluster. We classified a cluster as broad if any of the following criteria were met: over 30% of the genes in the cluster were annotated as ubiquitous at some stage from 7 to 16; if the cluster had 75% maternal genes and no restricted annotation terms in two-thirds of the genes; if the cluster contained genes with only maternal and late midgut staining (see Results and discussion). All other clusters were classified as restricted. Assignment of genes to multiple clusters As described in the text, we did not attain a fuzzy c-means clustering result where clusters had sharp boundaries. Therefore, the raw membership matrix M often assigned a gene high membership scores in neighboring, highly related clusters. In order to mitigate this fact and assign multiple memberships in a meaningful way, we performed an exhaustive analysis of cluster similarities to assign each gene to a set of significantly unrelated clusters. First, the raw membership matrix was transformed to a binary matrix M* using a threshold of mmin. Next, a distance score Δi, j between each pair of clusters i and j was determined from M* by dividing the number of genes in common by the number in the smaller of the two clusters: Next, each gene was assigned to a set of dissimilar clusters using a greedy approach. The gene g is first assigned to the cluster inew with the highest membership score, that is, Rather than determining a single set of mmin and dmax values, we performed this procedure 10,000 times with all pairwise combinations of values within 0.15 ≤ mmin ≤ 1.0 and 0.7 ≤ dmax ≤ 1.0 (These ranges were chosen to sample the entire informative range of the space: mmin values below 0.15 produced trivial results where each gene was only assigned to its single best cluster; dmax values below 0.7 resulted in the majority of genes being assigned to multiple, highly similar, clusters.) We averaged the results across all runs, calculating the number of times a particular gene was assigned to a particular cluster. By plotting the distribution of values, we identified a natural cutoff at 400/10,000 - gene/cluster assignments occurring in at least 400 of the 10,000 runs became final. Force-directed layout The 'balls and springs' models presented in Figures Figures99 where na∩b is the number of genes shared by CV term a and CV term b, and na b is the number of genes in either a or b. For figure figure11,11Statistical comparison of large genomics datasets We used the GO and Uniprot databases from April 2006 for all comparisons. A gene was considered to belong to a particular GO category if it associated with the category itself or any descendant of the category linked via IS_A and/or PART_OF relationships in the GO DAG. For instance, if a gene were annotated as an endopeptidase (GO:0004175), it would also be considered a peptidase (GO:0008233). Because many related GO terms have highly redundant sets of genes assigned to them, we also selected a set 179 of GO terms (which we call GO essential slim [21]) and propagated genes through the DAG to these terms using the map2slim.pl tool. A gene list was created for each GO category, GO essential category and each Uniprot domain. To evaluate whether a given expression cluster was correlated with a particular functional category, we used a statistical binomial test. The background frequency for the functional category, q, was defined as the fraction of the 6,003 annotated genes assigned to the category. For an expression cluster containing n genes total, the background was modeled as a binomial (n, q) distribution. When the expression cluster contained x genes within the particular functional category, the exact probability of observing this many or more genes, Pr(X = x)~binomial (n, q), was calculated using Matlab's binocdf function (The MathWorks, Natick, MA). We corrected the statistical significance values for multiple testing by creating an empirical distribution of binomial test scores for a large number of randomized gene lists. To generate the gene lists, genes were randomly sampled (without replacement) from the entire genome of 14,586 genes (Release 4.3). We sampled 100,000 such lists for each list size from five genes to 1,000 genes (in increments of 5 genes). Each sample was then statistically tested against all GO and Uniprot categories using the binomial test as described above. The most significant (lowest) p value attained (pα) was determined within each of the four functional classes (GO process, GO component, GO function, Uniprot domain), and the empirical distribution was compiled based on the distribution of pα values obtained. In order to interpolate this empirical background distribution to any given pα value obtained for an actual test, we fit the distribution to a log curve using log-linear regression (Matlab glmval function), fitting the parameters a and b in the function log(p) = alog(pα) + b. This produced a function that was a good fit for all list sizes, whereas fitting a linear function (p = apα + b) generally produced fits that deteriorated at low p values. Additional data files The following additional data are available with the online version of this paper. Additional data file 1 is a Re-annotation flowchart. The schematic of our image-driven curation strategy is shown on the left. The curator generated lists of genes (either all genes at a stage range or subset based on annotation query) and the computer produced all the images associated with those genes. Images were presented in batches that were navigated in a manner similar to Google searches with progress tracking. Pages that were viewed were marked. Gene names were shown in the corner of the first image for that gene. The curator reviewed the annotations of all genes expressed in a tissue, for example CNS, and by clicking on an image showing CNS staining, moved that image and the associated gene name to a child gene list. The images for the selected genes were eliminated from the parent pages. Both the parent and child gene lists were stored in a specialized database called the 'list manager' [21]. Lists became inputs for further image-based sub-selection or were used as a starting point for the annotation driven curation approach. The schematic of our annotation driven curation strategy is shown on the right. The curator inputs a gene list that was based on an annotation query or based on the previously described image driven curation. The curator then selected subsets of annotation terms called the 'block'. The list was ordered according to the content of block terms in the annotations of each gene. The curator was presented with images and block-limited annotations for each gene in the list in succession and changed any annotations within the block that needed correction. The curator was then presented with the set of genes that were not annotated for block terms and corrected any omissions. Modifications to the annotations were immediately stored in the database and a history of all changes was tracked. Additional data file 2 is a comparison of the distribution of selected Gene Ontology terms (GO slim general [20]) in the 6,003 genes in our study (red bars) and the distribution of the same terms in the 14,586 genes (purple bars) in the genome (Release 4.3). Each bar represents the percentage of genes with a given GO slim category. The similarities of these two profiles suggest we have annotated a representative sample of all genes. Additional data file 3 is a Description of the annotation hierarchy. We collapsed the 16 well-defined embryonic stages [9] into six developmental stage ranges. Annotation terms were grouped into the six developmental stage ranges, which are separated by solid horizontal lines. We reduced 314 terms in the full anatomical CV to 145 terms representing the structures that were most frequently seen and most readily distinguishable in our dataset, and the 131 of these that are annotated in 10 or more genes are shown here (Additional data files 4 and 5 list all terms). Genes annotated using more specific annotation terms were collapsed to this level. For example, an annotation of 'dorsal ridge' was collapsed to 'dorsal ectoderm', because the 'dorsal ridge' is 'part of' the 'dorsal ectoderm' in our formal CV hierarchy. Relative annotation counts are shown as colored bars. The length of each bar is proportional to the number of genes annotated with that term. Each bar is in one of 16 colors indicating the particular cell fate of the lineage; for instance, endoderm structures are in red across all developmental stages. This color code follows to the 'develops from' relationship among terms in our formal annotation hierarchy. Additional data files 4 and 5 contain the raw gene counts using the same organization. Additional data file 4 is a summary by organ systems and collapsed annotation terms for early embryogenesis (stages 4-10). For each organ system, we listed: the 'total' number of genes annotated with any raw annotation term that belonged to that organ system; the number of genes restricted ('restr.') to that organ system, meaning that at the stage range indicated the gene was annotated ONLY with raw terms of that organ system; the number of genes expressed in the organ system and at the same time not expressed in an equivalent organ system at another stage of development (restr. 4-6 and restr. 7-10); the total number of genes expressed in the equivalent organ systems connected by 'develops from' relationship at stages 4-6 OR stages 7-10 (st 4-10 uni. = union); and the number of genes expressed in equivalent organ system connected by 'develops from' relationship at stages 4-6 AND stages 7-10 (stages 4-10 intrs. = intersection). For each collapsed annotation term (block) we listed: the total number of genes annotated with any raw annotation term that belongs to that block; and the number of genes restricted ('restr.') to the block, meaning that at the given stage range the gene is annotated ONLY with raw terms from the block and no other terms from another block. The organ systems and blocks are color-coded. Mapping of organ systems to blocks of terms and blocks of terms to raw terms is available as supplementary on-line material [21]. Additional data file 5 is a Summary by organ systems and collapsed annotation terms for late embryogenesis (stages 11-16). Similar to Additional data file 4, we listed for each organ system and each block of terms: the total number of genes that were annotated with ANY raw annotation term that belonged to that organ system or block; the number of genes expressed ONLY in the organ system or block at the given stage (restr.); the number of genes expressed in the organ system or block and at the same time not expressed in an equivalent organ system or block at another stage of development (restr. 11-12 and restr. 13-16); the number of genes expressed in equivalent organ systems or blocks at stage range 11-12 OR stage range 13-16 (st 11-16 uni. = union); and the number of genes expressed in equivalent organ systems at stage range 11-12 AND stage range 13-16 (stages 11-16 intrs. = intersection). The organ systems and blocks are color-coded. Mapping of organ systems to blocks of terms and blocks of terms to raw terms is available as supplementary on-line material [21]. Additional data file 6 shows the diversity of CV annotations. Genes that share between 50% and 100% of their annotation terms (x-axis; uniformity) were grouped and the resulting number of groups enumerated (y-axis). The solid line in each graph shows the total number of groups thus formed for a given level of uniformity. The dashed line shows the total number of groups with a single gene member (singletons) for a given level of uniformity. For uniformity levels of 100% (all annotation terms matched within the group) and 75% (3 out of 4 terms matched within the group) we highlight the total number of groups, the number of singletons and the difference between the two. (A) Data for all genes expressed in the embryo (B) for genes belonging to broad clusters and (C) for genes in restricted clusters. Additional data file 7 shows the asymmetric similarity mixture. Given an annotation (spatial) similarity score 0 ≤ ss ≤ 1 and an array (level) similarity score 0 ≤ sl ≤ 1, the function sc = sl + (1 - sl)sl ss gives a similarity score where microarray similarity has a significant effect when annotation similarity is medium to high, but very little effect when annotation similarity is low. Additional data file 8 is a Complete list of GO-essential terms and Uniprot domains enriched with respect to 39 clusters (both broad and restricted) formed by hybrid clustering of gene expression data. *Terms with adjusted p value smaller than 0.001. B = broad, R = restricted. ‡Total number of unique genes in cluster. §GO or Uniprot id. ¶GO or Uniprot name. ¥Ratio between observed and expected number of genes in intersection of the two gene lists. #Number of genes in cluster annotated with given GO or Uniprot term. **Number of genes in the genome annotated with given GO or Uniprot term. ttAdjusted p value of binomial z test testing the null hypothesis that the overlap between the two gene lists is random. Additional data file 9 is an overview of the remaining miscellaneous restricted expression patterns. Similar to Figure Figure66 Additional data file 1 The schematic of our image-driven curation strategy is shown on the left. The curator generated lists of genes (either all genes at a stage range or subset based on annotation query) and the computer produced all the images associated with those genes. Images were presented in batches that were navigated in a manner similar to Google searches with progress tracking. Pages that were viewed were marked. Gene names were shown in the corner of the first image for that gene. The curator reviewed the annotations of all genes expressed in a tissue, for example CNS, and by clicking on an image showing CNS staining, moved that image and the associated gene name to a child gene list. The images for the selected genes were eliminated from the parent pages. Both the parent and child gene lists were stored in a specialized database called the 'list manager' [21]. Lists became inputs for further image-based sub-selection or were used as a starting point for the annotation driven curation approach. The schematic of our annotation driven curation strategy is shown on the right. The curator inputs a gene list that was based on an annotation query or based on the previously described image driven curation. The curator then selected subsets of annotation terms called the 'block'. The list was ordered according to the content of block terms in the annotations of each gene. The curator was presented with images and block-limited annotations for each gene in the list in succession and changed any annotations within the block that needed correction. The curator was then presented with the set of genes that were not annotated for block terms and corrected any omissions. Modifications to the annotations were immediately stored in the database and a history of all changes was tracked. Click here for file(14M, eps) Additional data file 2 Comparison of the distribution of selected GO terms (GO slim general [20]) in the 6,003 genes in our study (red bars) and the distribution of the same terms in the 14,586 genes (purple bars) in the genome (Release 4.3). Each bar represents the percentage of genes with a given GO slim category. The similarities of these two profiles suggest we have annotated a representative sample of all genes. Click here for file(318K, eps) Additional data file 3 We collapsed the 16 well-defined embryonic stages [9] into six developmental stage ranges. Annotation terms were grouped into the six developmental stage ranges, which are separated by solid horizontal lines. We reduced 314 terms in the full anatomical CV to 145 terms representing the structures that were most frequently seen and most readily distinguishable in our dataset, and the 131 of these that are annotated in 10 or more genes are shown here (Additional data files 4 and 5 list all terms). Genes annotated using more specific annotation terms were collapsed to this level. For example, an annotation of 'dorsal ridge' was collapsed to 'dorsal ectoderm', because the 'dorsal ridge' is 'part of' the 'dorsal ectoderm' in our formal CV hierarchy. Relative annotation counts are shown as colored bars. The length of each bar is proportional to the number of genes annotated with that term. Each bar is in one of 16 colors indicating the particular cell fate of the lineage; for instance, endoderm structures are in red across all developmental stages. This color code follows to the 'develops from' relationship among terms in our formal annotation hierarchy. Additional data files 4 and 5 contain the raw gene counts using the same organization. Click here for file(3.0M, eps) Additional data file 4 For each organ system, we listed: the 'total' number of genes annotated with any raw annotation term that belonged to that organ system; the number of genes restricted ('restr.') to that organ system, meaning that at the stage range indicated the gene was annotated ONLY with raw terms of that organ system; the number of genes expressed in the organ system and at the same time not expressed in an equivalent organ system at another stage of development (restr. 4-6 and restr. 7-10); the total number of genes expressed in the equivalent organ systems connected by 'develops from' relationship at stage range 4-6 OR stage range 7-10 (st 4-10 uni. = union); and the number of genes expressed in equivalent organ system connected by 'develops from' relationship at stage range 4-6 AND stage range 7-10 (stages 4-10 intrs. = intersection). For each collapsed annotation term (block) we listed: the total number of genes annotated with any raw annotation term that belongs to that block; and the number of genes restricted ('restr.') to the block, meaning that at the given stage range the gene is annotated ONLY with raw terms from the block and no other terms from another block. The organ systems and blocks are color-coded. Mapping of organ systems to blocks of terms and blocks of terms to raw terms is available as supplementary on-line material [21]. Click here for file(2.0M, eps) Additional data file 5 Similar to Additional data file 4, we listed for each organ system and each block of terms: the total number of genes that were annotated with ANY raw annotation term that belonged to that organ system or block; the number of genes expressed ONLY in the organ system or block at the given stage (restr.); the number of genes expressed in the organ system or block and at the same time not expressed in an equivalent organ system or block at another stage of development (restr. 11-12 and restr. 13-16); the number of genes expressed in equivalent organ systems or blocks at stage range 11-12 OR stage range 13-16 (st 11-16 uni. = union); and the number of genes expressed in equivalent organ systems at stage range 11-12 AND stage range 13-16 (stages 11-16 intrs. = intersection). The organ systems and blocks are color-coded. Mapping of organ systems to blocks of terms and blocks of terms to raw terms is available as supplementary on-line material [21]. Click here for file(1001K, eps) Additional data file 6 Genes that share between 50% and 100% of their annotation terms (x-axis; uniformity) were grouped and the resulting number of groups enumerated (y-axis). The solid line in each graph shows the total number of groups thus formed for a given level of uniformity. The dashed line shows the total number of groups with a single gene member (singletons) for a given level of uniformity. For uniformity levels of 100% (all annotation terms matched within the group) and 75% (3 out of 4 terms matched within the group) we highlight the total number of groups, the number of singletons and the difference between the two. (A) Data for all genes expressed in the embryo (B) for genes belonging to broad clusters and (C) for genes in restricted clusters. Click here for file(396K, eps) Additional data file 7 Given an annotation (spatial) similarity score 0 ≤ ss ≤ 1 and an array (level) similarity score 0 ≤ sl ≤ 1, the function sc = sl + (1 - sl)sl ss gives a similarity score where microarray similarity has a significant effect when annotation similarity is medium to high, but very little effect when annotation similarity is low. Click here for file(52K, pdf) Additional data file 8 *Terms with adjusted p value smaller than 0.001. B = broad, R = restricted. ‡Total number of unique genes in cluster. §GO or Uniprot id. ¶GO or Uniprot name. ¥Ratio between observed and expected number of genes in intersection of the two gene lists. #Number of genes in cluster annotated with given GO or Uniprot term. **Number of genes in the genome annotated with given GO or Uniprot term. ttAdjusted p value of binomial z test testing the null hypothesis that the overlap between the two gene lists is random. Click here for file(331K, eps) Additional data file 9 Similar to Figure Figure66 Click here for file(7.7M, eps) Acknowledgements We thank Erwin Frise and Joseph Carlson for informatics support in producing the Release 2 public release, and Michael Ashburner for assistance with the anatomy CV. We also thank Tiago Magalhães for providing late embryogenesis microarray results ahead of publication, and Ann Hammonds and the anonymous reviewers for providing very useful comments on the manuscript. BPB wishes to thank Dave Hendrix for fruitful discussions concerning fuzzy clustering, and Mike Eisen for funding support. This work was supported by the Howard Hughes Medical Institute and by NIH Grants P50 GH00750 (to GMR) and R01 GM076655 (to SEC) and HG00750 (to BPB via Michael Eisen). References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||||||||||||||||
Nat Genet. 1999 Jan; 21(1 Suppl):33-7.
[Nat Genet. 1999]Chromosoma. 1970; 31(3):331-45.
[Chromosoma. 1970]Chromosoma. 1989 Aug; 98(2):81-5.
[Chromosoma. 1989]Science. 2002 Sep 27; 297(5590):2270-5.
[Science. 2002]Science. 2001 Aug 31; 293(5535):1629-33.
[Science. 2001]Cell. 1992 Jan 24; 68(2):201-19.
[Cell. 1992]Proc Natl Acad Sci U S A. 1998 Aug 18; 95(17):9973-8.
[Proc Natl Acad Sci U S A. 1998]Genome Biol. 2002; 3(12):RESEARCH0088.
[Genome Biol. 2002]Cell. 1984 Jul; 37(3):833-41.
[Cell. 1984]Cell. 1990 Jan 12; 60(1):9-16.
[Cell. 1990]Genome Biol. 2002; 3(12):RESEARCH0080.
[Genome Biol. 2002]Genome Res. 2002 Aug; 12(8):1294-300.
[Genome Res. 2002]Genome Biol. 2002; 3(12):RESEARCH0088.
[Genome Biol. 2002]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D484-8.
[Nucleic Acids Res. 2006]Genome Biol. 2002; 3(12):RESEARCH0088.
[Genome Biol. 2002]Nat Genet. 2000 May; 25(1):25-9.
[Nat Genet. 2000]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D484-8.
[Nucleic Acids Res. 2006]Genome Biol. 2002 Oct 10; 3(11):RESEARCH0059.
[Genome Biol. 2002]Math Biosci. 2002 Mar; 176(1):99-121.
[Math Biosci. 2002]J Biol Chem. 2003 Sep 26; 278(39):37840-8.
[J Biol Chem. 2003]Genes Dev. 2003 Sep 15; 17(18):2308-20.
[Genes Dev. 2003]Science. 2004 Feb 6; 303(5659):832-5.
[Science. 2004]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D484-8.
[Nucleic Acids Res. 2006]Nat Genet. 2000 May; 25(1):25-9.
[Nat Genet. 2000]Nucleic Acids Res. 2007 Jan; 35(Database issue):D193-7.
[Nucleic Acids Res. 2007]Nat New Biol. 1973 Oct 24; 245(147):251-3.
[Nat New Biol. 1973]Nat Rev Mol Cell Biol. 2001 Feb; 2(2):107-17.
[Nat Rev Mol Cell Biol. 2001]Nat Cell Biol. 2003 Oct; 5(10):895-901.
[Nat Cell Biol. 2003]Gene Expr Patterns. 2004 Jul; 4(4):413-21.
[Gene Expr Patterns. 2004]BMC Bioinformatics. 2004 Dec 16; 5():202.
[BMC Bioinformatics. 2004]Comput Syst Bioinformatics Conf. 2006; ():293-8.
[Comput Syst Bioinformatics Conf. 2006]Genome Biol. 2006; 7(12):R124.
[Genome Biol. 2006]Genome Biol. 2006; 7(12):R123.
[Genome Biol. 2006]Genome Biol. 2002; 3(12):RESEARCH0088.
[Genome Biol. 2002]Genome Biol. 2002; 3(12):RESEARCH0088.
[Genome Biol. 2002]Genome Biol. 2004; 5(10):R80.
[Genome Biol. 2004]Nucleic Acids Res. 2003 Feb 15; 31(4):e15.
[Nucleic Acids Res. 2003]Genome Biol. 2004; 5(10):R80.
[Genome Biol. 2004]Nucleic Acids Res. 2003 Feb 15; 31(4):e15.
[Nucleic Acids Res. 2003]Genome Biol. 2002 Oct 10; 3(11):RESEARCH0059.
[Genome Biol. 2002]Proc Natl Acad Sci U S A. 1998 Dec 8; 95(25):14863-8.
[Proc Natl Acad Sci U S A. 1998]Genome Biol. 2002; 3(12):RESEARCH0088.
[Genome Biol. 2002]Nat Genet. 2000 May; 25(1):25-9.
[Nat Genet. 2000]Nat Genet. 2000 May; 25(1):25-9.
[Nat Genet. 2000]