![]() | ![]() |
Formats:
|
||||||||||||||||||||||||
Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures 1 The Broad Institute, Massachusetts Institute of Technology and Harvard University, Cambridge, Massachusetts 02140, USA 2 Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massachusetts 02139, USA 3 The Bioinformatics Centre, Department of Molecular Biology, University of Copenhagen, Ole Maaloes Vej 5, 2200 Copenhagen N, Denmark 4 Center for Biomolecular Science and Engineering, University of California, Santa Cruz, California 95064, USA 5 Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK 6 Institute of Computer Science, University of Tartu, Estonia 7 BDGP, LBNL, 1 Cyclotron Road MS 64-0119, Berkeley, California 94720, USA 8 FlyBase, The Biological Laboratories, Harvard University, 16 Divinity Avenue, Cambridge, Massachusetts 02138, USA 9 Department of Computer Science, University of New Mexico, Albuquerque, New Mexico 87131, USA 10 Department of Biology, MIT, Cambridge, Massachusetts 02139, USA 11 Whitehead Institute, Cambridge, Massachusetts 02142, USA 12 Cold Spring Harbor Laboratory, Watson School of Biological Sciences, 1 Bungtown Road, Cold Spring Harbor, New York 11724, USA 13 University of California, San Francisco/University of California, Berkeley Joint Graduate Group in Bioengineering, Berkeley, California 97210, USA 14 EMBL Nucleotide Sequence Submissions, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK 15 Department of Cell Biology and Molecular Medicine, G-629, MSB, 185 South Orange Avenue, UMDNJ-New Jersey Medical School, Newark, New Jersey 07103, USA 16 Department of Biology and School of Informatics, Indiana University, Indiana 47405, USA 17 Department of Biology, Connecticut College, New London, Connecticut 06320, USA 18 Laboratory of Neurogenetics, Department of Molecular and Developmental Genetics, VIB, 3000 Leuven, Belgium 19 Department of Human Genetics, K. U. Leuven School of Medicine, 3000 Leuven, Belgium 20 Department de Biologie Moleculaire, Universite Libre de Bruxelles, 1050 Brussels, Belgium 21 Department of Biology, Indiana University, Bloomington, Indiana 47405, USA 22 Department of Mathematics and Computer Science, Wesleyan University, Middletown, Connecticut 06459, USA 23 Biology Department, Wesleyan University Middletown, Connecticut 06459, USA 24 Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, Wisconsin 53706, USA 25 Department of Mathematics, University of California at Berkeley, Berkeley, California 94720, USA 26 Department of Computer Science, University of California at Berkeley, Berkeley, California 94720, USA 27 Department of Developmental Biology, Memorial Sloan-Kettering Cancer Center, New York, New York 10021, USA 28 Graduate Group in Biophysics, Department of Molecular and Cell Biology, and Center for Integrative Genomics, University of California, Berkeley, California 94720, USA 29 Lawrence Berkeley National Laboratory, Life Sciences Division, Berkeley, California 94720, USA 30 Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York 14853, USA 31 Agencourt Bioscience Corporation, 500 Cummings Center, Suite 2450, Beverly, Massachusetts 01915, USA 32 The Department of Molecular and Cellular Biology, Harvard University, Cambridge, Massachusetts 02138, USA Author Information Reprints and permissions information is available at www.nature.com/reprints. Correspondence and requests for materials should be addressed to M.K. (Email: manoli/at/mit.edu) *These authors contributed equally to this work. †Lists of participants and affiliations appear at the end of the paper. Author Contributions Organizing committee: Manolis Kellis, William Gelbart, Doug Smith, Andrew G. Clark, Michael E. Eisen, Thomas C. Kaufman; protein-coding gene prediction: Michael F. Lin, Ameya N. Deoras, Mira V. Han, Matthew W. Hahn, Donald G. Gilbert, Michael Weir, Michael Rice, Manolis Kellis; manual curation of protein-coding genes: Madeline A. Crosby, Harvard FlyBase curators, William M. Gelbart; validation of protein-coding genes: Joseph W. Carlson, Berkeley Drosophila Genome Project, Susan E. Celniker; non-coding RNA gene prediction: Jakob S. Pedersen, David Haussler, Yongkyu Park, Seung-Won Park, Manolis Kellis; microRNA gene prediction: Alexander Stark, Pouya Kheradpour, Leopold Parts, Manolis Kellis; microRNA cloning and sequencing: Julius Brennecke, Emily Hodges, Gregory J. Hannon; microRNA target prediction: Alexander Stark, J. Graham Ruby, Manolis Kellis, Eric C. Lai, David P. Bartel; motif identification: Alexander Stark, Pouya Kheradpour, Manolis Kellis; motif instance prediction: Alexander Stark, Pouya Kheradpour, Sushmita Roy, Morgan L. Maeder, Benjamin J. Polansky, Bryanne E. Robson, Deborah A. Eastman, Stein Aerts, Bassem Hassan, Jacques van Helden, Manolis Kellis; genome alignments: Angie S. Hinrichs, W. James Kent, Anat Caspi, Lior Pachter, Colin N. Dewey, Benedict Paten; phylogeny and branch length estimation: Matthew D. Rasmussen, Manolis Kellis; final manuscript preparation: Alexander Stark, Michael F. Lin, Pouya Kheradpour, Jakob Pedersen, Manolis Kellis. Abstract Sequencing of multiple related species followed by comparative genomics analysis constitutes a powerful approach for the systematic understanding of any genome. Here, we use the genomes of 12 Drosophila species for the de novo discovery of functional elements in the fly. Each type of functional element shows characteristic patterns of change, or ‘evolutionary signatures’, dictated by its precise selective constraints. Such signatures enable recognition of new protein-coding genes and exons, spurious and incorrect gene annotations, and numerous unusual gene structures, including abundant stop-codon readthrough. Similarly, we predict non-protein-coding RNA genes and structures, and new microRNA (miRNA) genes. We provide evidence of miRNA processing and functionality from both hairpin arms and both DNA strands. We identify several classes of pre- and post-transcriptional regulatory motifs, and predict individual motif instances with high confidence. We also study how discovery power scales with the divergence and number of species compared, and we provide general guidelines for comparative studies. The sequencing of the human genome and the genomes of dozens of other metazoan species has intensified the need for systematic methods to extract biological information directly from DNA sequence. Comparative genomics has emerged as a powerful methodology for this endeavour1,2. Comparison of few (two–four) closely related genomes has proven successful for the discovery of protein-coding genes3–5, RNA genes6,7, miRNA genes8–11 and catalogues of regulatory elements3,4,12–14. The resolution and discovery power of these studies should increase with the number of genomes15–20, in principle enabling the systematic discovery of all conserved functional elements. The fruitfly Drosophila melanogaster is an ideal system for developing and evaluating comparative genomics methodologies. Over the past century, Drosophila has been a pioneering model in which many of the basic principles governing animal development and population biology were established21. In the past decade, the genome sequence of D. melanogaster provided one of the first systematic views of a metazoan genome22, and the ongoing effort by the FlyBase and Berkeley Drosophila Genome Project (BDGP) groups established a systematic high-quality genome annotation23–25. Moreover, the fruit-fly benefits from extensive experimental resources26–28, which enable novel functional elements to be systematically tested and used in the evaluation of genetic screens29,30. The fly research community has sequenced, assembled and annotated the genomes of 12 Drosophila species22,31,32 at a range of evolutionary distances from D. melanogaster (Fig. 1a, b
Here, we report genome-wide alignments of the 12 species (Supplementary Information 1), and the systematic discovery of euchromatic functional elements in the D. melanogaster genome. We predict and refine thousands of protein-coding exons, RNA genes and structures, miRNAs, pre- and post-transcriptional regulatory motifs and regulatory targets. We validate many of these elements using complementary DNA (cDNA) sequencing, human curation, small RNA sequencing, and correlation with experimentally supported transcription factor and miRNA targets. In addition, our analysis leads to several specific biological findings, listed below.
Comparative genomics and evolutionary signatures Although multiple closely related genomes provide sufficient neutral divergence for recognition of functional regions in stretches of highly conserved nucleotides16,17,33, measures of nucleotide conservation alone do not distinguish between different types of functional elements. Moreover, functional elements that tolerate abundant ‘silent’ mutations, such as protein-coding exons and many regulatory motifs, might not be detected when searching on the basis of strong nucleotide conservation. Across many genomes spanning larger evolutionary distances, the information in the patterns of sequence change reveals evolutionary signatures (Fig. 2
We find that these signatures can be much more precise for genome annotation than the overall level of nucleotide conservation (for example, Fig. 3a
Revisiting the protein-coding gene catalogue The annotation of protein-coding genes remains difficult in metazoan genomes owing to short exons and complex gene structures with abundant alternative splicing. Comparative information has improved computational gene predictors5, but their accuracy still falls far short of well-studied gene catalogues such as the FlyBase annotation, which combines computational gene prediction37, high-throughput experimental data38–42 and extensive manual curation23. Recognizing this, we set out not only to produce an independent computational annotation of protein-coding genes in the fly genome, but also to assess and refine its already high-quality annotations43. Our analyses of D. melanogaster coding genes are based on two independent evolutionary signatures unique to protein-coding regions (Fig. 2a Assessing and refining existing gene annotations We first assessed the 13,733 euchromatic genes in FlyBase47 release 4.3. Using the above measures, we defined tests that ‘confirmed’ genes supported by the evolutionary evidence, ‘rejected’ genes inconsistent with protein-coding selection, or ‘abstained’ for genes that were not aligned or with ambiguous comparative evidence (Supplementary Methods 2a). Of the 4,711 genes with descriptive names, we confirmed 97%, rejected 1% and abstained for 2%, whereas the same criteria applied to 15,000 random non-coding regions ≥300 nucleotides rejected 99% of candidates and confirmed virtually none (Table 1). Together, these results illustrate the high sensitivity and specificity of our criteria.
Applying the same criteria to the 9,022 genes lacking a descriptive name (genes designated only by a CG identifier, referred to hereafter as CGid-only genes), our tests accepted 87%, rejected 5% (414 genes) and abstained for 8%. This provides strong evidence that most CGid-only genes encode proteins, but also suggests that they may be less constrained20,32 and/or may include incorrect annotations. Indeed, on manual review, 222 (54%) of the 414 rejected CGid-only genes were re-categorized as non-protein-coding or deleted (of which 55 were due to genomically primed clones), 73 (18%) were flagged as being of uncertain quality, and the remaining 119 (29%) were kept unchanged (Fig. 3b In addition, we proposed specific corrections and adjustments to hundreds of existing transcript models, including translation start site adjustments (Supplementary Fig. 2b), alternative splice boundaries (Supplementary Fig. 2b), recent nonsense mutations (Supplementary Fig. 2c) and alternative translational reading frames43. Identifying new genes and exons To predict new protein-coding exons, we integrated our metrics into a probabilistic algorithm that determines an optimal segmentation of the genome into protein-coding and non-coding regions (Fig. 3a We manually reviewed 928 of these predictions according to FlyBase standards23 (Supplementary Methods 2a), leading to 142 new gene models (incorporating 192 predictions) and 438 revised gene models (incorporating 562 predictions) (Fig. 3b Overall, 83% of the 948 predicted exons that we assessed by manual curation or cDNA sequencing were incorporated into FlyBase, resulting in 150 new genes and modifications to hundreds of existing gene models. Finally, the 245 predictions that we did not assess were in non-coding regions of existing transcript models, or were already included in FlyBase independent of our study. In an independent analysis52, we predicted 98 new genes on the basis of inferred homology to predicted genes in the informant species32, of which 63% matched the above predictions. Discovering unusual features of protein-coding genes Our analysis also predicted an abundance of unusual protein-coding genes that call for follow-up experimental investigation. First, we found open reading frames with clear protein-coding signatures and conserved start and stop sites on the transcribed strand of annotated UTRs, indicative of polycistronic transcripts23,53,54. These include 73% of 115 annotated dicistronic transcripts and 135 new candidate cistrons of 123 genes (Supplementary Fig. 2b). Second, we predicted that 149 genes undergo stop codon readthrough, with protein-coding selection continuing past a deeply conserved stop codon (Fig. 3d Third, we found four genes in which CSF signatures abruptly shift from one reading frame to another in the absence of nearby intron–exon boundaries or insertions and deletions (Fig. 3e Overall, our results affected over 10% of protein-coding genes, and will be available in future releases of FlyBase. They also suggest that several types of unusual protein-coding gene structure may be more prevalent in the fly than previously appreciated. RNA genes and structures Several comparative approaches to RNA gene identification have been developed6,7,65 that recognize their characteristic properties: compensatory double substitutions of paired nucleotides (for example, A•U↔C•G), structure-preserving single-nucleotide mutations involving G•U base pairs (G•U↔G•C and G•U↔A•U), and few nucleotide substitutions disrupting functional base pairs (Fig. 2b Our search led to 394 predictions, recovering 68 known RNA structures (primarily transfer RNA genes) in 0.02% of the genome (570-fold enrichment). The novel candidates consisted of 177 structures in intergenic regions (54%), 103 in introns (32%), 36 in 3′ UTRs (11%) and 10 in 5′ UTRs (3%). In addition, we predicted 200 structures in protein-coding regions (Supplementary Methods 3). Notably, 75% of 3′ UTR structures and 80% of 5′ UTR structures were predicted on the transcribed strand, suggesting that they are frequently part of the messenger RNA. In contrast, only 47% of intronic structures are on the transcribed strand, suggesting that they are largely independent of the surrounding genes. Known and novel types of RNA genes Of the 177 predicted intergenic structures, 30 were detected in a tiling-array expression study42. This fraction (17%) is significantly above that for all conserved intergenic regions (12%, P =0.007), but lower than that of known intergenic ncRNAs (21%), suggesting that these candidates may be of lower abundance, temporally or spatially constrained, or might include false positives. Two predictions were expressed throughout development, one extending the annotation of a previously reported but uncharacterized ncRNA66 and the other probably representing a novel type of ncRNA. The predictions also included nine novel H/ACA-box small nucleolar RNA candidates in introns of ribosomal genes, known to frequently contain small nucleolar RNAs that guide post-transcriptional base modifications of ncRNAs67. Likely A-to-I editing structures Many of the 48 intronic candidates on the transcribed strand and many of the 200 hairpins in coding sequence are probably involved in A-to-I editing or post-transcriptional regulation (Fig. 4a
Likely regulatory UTR structures We predicted 38 structures in 3′ UTRs, a density twofold higher than the genomic average, whereas fewer than 10 such examples are currently known70. A considerable fraction of these lies in regulatory genes (14 out of 38; P =10−4), including several transcriptional regulators (for example, cas, spen and Alh), the tyrosine phosphatase PTP-ER and the translation initiation factor eIF3-S8. This suggests that many regulatory genes may themselves be regulated post-transcriptionally through these structures. 3′ UTR structures were also enriched for genes involved in mRNA localization (3 out of 38, P =2.7 ×10−4), including oo18 RNA-binding protein (orb) and staufen (stau), both of which contain double-stranded RNA-binding domains, are involved in axis specification during oogenesis, and interact with the mRNA of maternal effect protein oskar. The hairpin in orb is known to be important for mRNA transport and localization71, whereas the highly similar stau hairpin has not been previously described to our knowledge. The ten structures found in 5′ UTRs probably contain binding sites for factors that regulate translation. For example, the fly homologue of yeast ribosomal protein RPL24 contains a hairpin structure overlapping its start codon (Fig. 4c Conserved RNA structures in roX2 recruit MSL In an independent study74, we searched for conserved regions in the non-coding roX1 and roX2 (RNA on the X) genes to gain insights into their function. Both RNAs are components of the MSL (Male-specific lethal) complex and are crucial for dosage compensation in male flies, inducing lysine 16 acetylation of histone H4, leading to upregulation of hundreds of genes on the X chromosome75. We identified several stem-loop structures with repeated sequence motifs (for example, GUUNUACG), and found that tandem repeats of one of these were sufficient to recruit MSL complexes to the X chromosome and to induce acetylation of lysine 16 of histone H4. Although this structure could not fully rescue roX-deficient males, our results suggest that it mediates MSL recruitment during roX2-dependent chromatin modification and dosage compensation, illustrating the power of evolutionary evidence for directing experimental studies. Prediction and characterization of miRNA genes Focusing on specific classes of RNA genes markedly increases the accuracy of RNA gene prediction, reviewed in refs 35, 76 and illustrated here for Drosophila miRNA genes. The common biogenesis and function of miRNAs77 lead to evolutionary and structural signatures (Fig. 2c Comparison of our predictions with high-throughput sequencing data of short RNA libraries from different stages and tissues of D. melanogaster78,79 revealed that 84 of the 101 predictions (83%), including 24 of the 41 novel predictions (59%), were authentic miRNA genes (Fig. 5a
Several of the validated miRNAs were on the transcribed strand of introns or clustered with other miRNAs. For example, mir-11 and mir-998 (the vertebrate homologue of which, mir-29, has been implicated in cancer80) were both found in the last intron of E2f, and might be involved in cell-cycle regulation (Fig. 5b High-throughput sequencing data discovered an additional 50 miRNAs not found computationally79,81, thereby illustrating the limitations of purely computational approaches. Some of these had precursor structures not seen previously for animal miRNAs, including unusually long hairpins79 and hairpins corresponding to short introns (mirtrons)81,82. The remaining were often less broadly conserved or showed unusual conservation properties. Signatures for mature miRNA annotation The exact position of 5′ cleavage of mature miRNAs is important, because it dictates the core of the target recognition sequence83–85. This leads to unique structural and evolutionary signatures, including direct signals, present at the 5′ cleavage site, and indirect signals, stemming from the relationship of miRNAs with their target genes (Supplementary Methods 4a, c). Combined into a computational framework78, these signatures predicted the exact start position in 47 of the 60 cloned Rfam miRNAs (78%), and were within 1 bp in 51 cases (85%). The method disagreed with the previous annotation in 9 of the 14 Rfam miRNAs that were not previously cloned, of which 6 were confirmed by sequencing reads78,79, leading to marked changes in the inferred target spectrum (Fig. 5d New insights into miRNA function and biogenesis We predicted targets for all conserved miRNAs identified by high-throughput sequencing79 searching for conserved matches to the seed region (similar to ref. 86) evaluated using the branch length score (Supplementary Methods 5a), a new scoring scheme described below. Whereas the resulting miRNA targeting network changed substantially79, we found that the novel and revised miRNAs shared many of their predicted targets with previously known miRNAs, resulting in a denser network with increased potential for combinatorial regulation78,79. For ten miRNA hairpins, the mature miRNA and the corresponding miRNA star sequence (miRNA*, the small RNA from the opposite arm of the hairpin) both appeared to be functional: both reached high computational scores and were frequently sequenced78,79, often exceeding the abundance of many mature miRNAs (Supplementary Table 4e). The Hox miRNA mir-10 showed a particularly striking example of a functional star sequence (Fig. 5e In addition, for 20 miRNA loci, the anti-sense strand also folded into a high-scoring hairpin suggestive of a functional miRNA78 (Supplementary Table 4f). Indeed, sequencing reads confirmed that four of these anti-sense hairpins are processed into small RNAs in vivo79. Thus, a single genomic miRNA locus may produce up to four miRNAs, each with distinct targets. Regulatory motif discovery and characterization Regulatory motifs recognized by proteins and RNAs to control gene expression have been difficult to identify due to their short length, their many weakly specified positions, and the varying distances at which they can act87,88. Recent studies have shown that comparative genomics of a small number of species can be used for motif discovery3,4,12–14, on the basis of hundreds of conserved instances across the genome (Fig. 2d To account for the unique properties of regulatory motifs, we developed a phylogenetic framework to assess the conservation of each motif instance across many genomes89. Briefly, we searched for motif instances in each of the aligned genomes, and based on the set of species that contained them, we evaluated the total branch length over which the D. melanogaster motif instance appears to be conserved (Supplementary Methods 5a, b), which we call the branch length score (BLS). We used BLS for the discovery of novel motifs (this section) and for the prediction of individual functional motif instances (next section). Predicted motifs recover known regulators To discover motifs, we estimated the conservation level of candidate sequence patterns with a motif excess conservation (MEC) score compared to overall conservation levels in promoters, UTRs, introns, protein-coding exons and intergenic regions (Supplementary Methods 5a). Our search in regions with roles in pre-transcriptional regulation resulted in 145 distinct motifs (Table 2), obtained by collapsing variants across 83 motifs discovered in promoters, 35 in enhancers, 20 in 5′ UTRs, 35 in core promoters, 30 in introns and 84 in the remaining intergenic regions. Motifs discovered in each region showed similar properties and large overlap: 66 (46%) were discovered independently in at least two regions and 40 (28%) in at least three, consistent with shared regulatory elements in these regions90.
The 145 discovered motifs match 40 (46%) of the 87 known transcription factors in Drosophila (Supplementary Table 5c) compared to 8% expected at random (P =1 ×10−20). Several of the non-discovered known motifs are involved in early anterior–posterior segmentation of the embryo, consistent with reports that they are largely non-conserved91; indeed, 74% of these did not exceed the conservation expected by chance in promoter regions. Other non-discovered motifs often lacked characteristics expected for transcription factor motifs, suggesting that some may be spurious: 49% were unusually long (>10 nucleotides) compared to 23% of recovered ones, and showed only one or a few total instances genome-wide, suggestive of individual regulatory sites rather than motifs. Tissue-specific and functional enrichment of novel motifs The discovered motifs showed strong signals with respect to embryonic expression patterns (Fig. 6a
In total, 68% of discovered and 70% of known motifs were enriched or depleted in one of the functional categories (14% random). Noteworthy examples include motif ME93 (GCAACA), which was more highly enriched in neuroblasts (P =4 ×10−12) than either of the two well-known regulators of neuroblast development, prospero and asense (P =4 ×10−5 and 2 ×10−7, respectively). Similarly, motifs ME89 (CACRCAC), ME11 (MATTAAWNATGCR) and ME117 (MAAMNNCAA) were highly enriched in malpighian tubule (P =4 ×10−7), trachea (P =4 ×10−5) and surface glia (6 ×10−7), respectively, in each case ranking above motifs for factors known to be important in these tissues (Supplementary Table 5c). These presumably correspond to as-yet-unknown regulators for these tissues. Exclusion, clustering and positional constraints A large number of motifs were depleted in coding sequence (57% of discovered versus 57% of known and 10% of random motifs, P =3 ×10−18) and in 3′ UTRs (30% versus 22% and 0%, P =4 ×10−11), suggesting specific exclusion similar to in vivo binding92. Many of the intergenic or intronic instances occurred in clusters, a property of motifs that has been used to identify enhancer elements91,94–96. We assessed increased conservation of motifs when found near other instances of the same motif (whether conserved or not, to correct for regional conservation biases), and found significant multiplicity for 19% of the discovered motifs (compared to 24% of known and 4% of random motifs). In addition, 15 of the discovered motifs (10%) were significantly enriched near transcription start sites (compared to 14% of known and 1% of random motifs). Several were enriched at precise positions and preferred orientations (Fig. 6b Regulatory motifs involved in post-transcriptional regulation We also used BLS/MEC to discover motifs involved in post-transcriptional regulation, and developed methods to distinguish motifs acting at the DNA level, motifs acting at the RNA level and motifs stemming from protein-coding codon biases (Supplementary Methods 5a). Motifs acting post-transcriptionally at the RNA level generally showed highly asymmetric conservation12, as functional instances can only occur on the transcribed strand. Indeed, 71 of 90 motifs (79%) discovered in 3′ UTRs showed strand-specific conservation (compared with only 3% of 5′ UTR motifs and 5% of intron motifs, suggesting that these act primarily in pre-transcriptional regulation). Overall, 33 motifs discovered in 3′ UTRs were complementary to the 5′ end of Rfam miRNAs, recovering 72% of known miRNAs (68% of 5′ unique miRNA families). An additional 21 motifs matched to 5′ ends of novel miRNAs predicted above, of which 12 were validated experimentally78,79, and 3 motifs matched uniquely to miRNA star sequences, all of which were abundantly expressed in vivo (Supplementary Table 4e). We found 33 additional motifs in 3′ UTRs that were apparently not associated with miRNAs. MO40 (TGTANWTW) closely matches the Puf-family Pumilio motif98. MO32 (AATAAA) corresponds to the polyadenylation signal and displays both very strong conservation and a sharply defined distance preference with respect to the end of the annotated 3′ UTR (P =10−69). Finally, several motifs (for example, MO24 =TAATTTAT; MO94 =TTATTTT) are variants of known AU-rich elements, which are known to mediate mRNA instability and degradation99. MicroRNA targeting in protein-coding regions Protein-coding regions can also harbour functional regulatory motifs, such as exonic splicing regulatory elements100. However, motif conservation is difficult to assess within protein-coding regions because of the overlapping selective pressures. Indeed, the most highly conserved nucleotide sequence patterns of length seven (7mers) in coding sequence showed strong reading-frame-biased conservation, suggesting that they reflect protein-coding constraints rather than regulatory roles at the DNA or RNA level (Fig. 6c MicroRNA motifs, which function at the RNA level, instead showed high conservation in all three reading frames, suggesting that they are specifically selected within coding regions for their RNA-level function. Indeed, previous studies have shown that miRNA motifs in coding regions are preferentially conserved in vertebrates86, that they can lead to repression in experimental assays101,102, and that they are avoided in genes co-expressed with the miRNA103. Frame-invariant conservation allows us to demonstrate the coding-region targeting of individual miRNAs, and also enables the de novo discovery of miRNA motifs in coding regions. Using frame-invariant conservation, we recovered 11 miRNA motifs within the top 20 coding-region motifs (Supplementary Table 5g), whereas using overall conservation required several hundred candidates to recover 11 miRNA motifs. Moreover, 7mers complementary to different positions in the mature miRNA show a distinctive conservation pattern indicative of functional targeting in coding regions (Fig. 6d Prediction of individual regulator binding sites Previous methods for regulatory motif discovery3,4,12–14 integrated conservation information over hundreds of motif instances across the genome, leading to an exceedingly clear signal for motif discovery even if many of these instances are only marginally conserved. In contrast, the reliable identification of individual motif instances has been hampered by lack of neutral divergence and would require many related genomes15–19. In the absence of such data, previous studies have relied on motif clustering91,94–96 or other sequence characteristics106 to predict regulatory targets or regions. With the availability of the 12 fly genomes, we inferred high-confidence instances of regulatory motifs by mapping the BLS of each motif instance to a confidence value (Supplementary Methods 5a). This value represents the probability that a motif instance is functional, on the basis of the conservation level of appropriate control motifs evaluated in the same type of region (promoter, 3′ UTR, coding, and so on). Because the number of conserved instances decreases much more rapidly for control motifs than for real motifs, the many genomes allowed us to reach high confidence values for many transcription factors and miRNAs, even at relatively modest BLS thresholds (Fig. 2e Conserved motif instances identify functional in vivo targets We found that increasing confidence levels selected for functional instances for both transcription factor and miRNA motifs: the normalized fraction of transcription factor motif instances within promoter regions rose from 20% to 90%; that of miRNA motif instances within 3′ UTRs rose from 20% to 90%; and the fraction of miRNA motif instances on the transcribed strand of 3′ UTRs rose from 50% (uniform) to 100% (Fig. 7a
We further assessed how predicted motif instances compared with in vivo targets in promoter regions, defined experimentally (without comparative information). We used a set of high-confidence direct CrebA targets107 and three genome-wide chromatin immunoprecipitation (ChIP) data sets for Snail, Mef2 and Twist92,108,109, and in each case found that the enrichment between conserved motif instances and known in vivo regions increased sharply for increasing confidence values (Fig. 7b We also found that a large fraction of motif instances in experimentally determined target regions was conserved (Fig. 7c ChIP-determined and conservation-determined targets show similar enrichment To determine whether ChIP-bound motifs that lack conservation are biologically meaningful, we studied their enrichment in muscle gene promoters. We found that motifs that were both bound and evolutionarily conserved showed very strong correlation with muscle genes for all three factors: Mef2 showed eightfold enrichment, Twist showed sevenfold enrichment and Snail, a mesodermal repressor, showed threefold depletion for muscle genes. However, when only non-conserved sites were considered, the correlation dropped significantly to 1–2-fold for all three factors, suggesting that non-conserved ChIP-bound sites may be of decreased biological significance (Fig. 7d We also used the correlation with muscle genes to compare ChIP-on-chip and evolutionary conservation as two complementary methods for target identification (Fig. 7d In an independent study113 we compared several strategies for the prediction of motif instances and cis-regulatory modules and found that using the 12 fly genomes led to substantial improvements. In another study, we reported the recovery of conserved motifs for several known regulators, including Suppressor of Hairless, in genes of the Enhancer of split complex114. A regulatory network of D. melanogaster at 60% confidence Having established the accuracy of conserved motif instances, we present an initial regulatory network for D. melanogaster at 60% confidence (Supplementary Fig. 5i), containing 46,525 regulatory connections between 67 transcription factors and 8,287 genes, and 3,662 connections between 81 cloned miRNAs (clustered in 49 families with unique seed sequences) and 2,003 genes. The distribution of predicted sites per target gene is highly nonuniform and indicative of varying levels of regulatory control. Genes with the highest number of sites appeared to be enriched in morphogenesis, organogenesis, neurogenesis and a variety of tissues, whereas ubiquitously expressed genes and maternal genes with housekeeping functions had the fewest sites104. Interestingly, transcription factors appeared to be more heavily targeted than other genes, both by transcription factors (10 sites versus 5.5 on average, P =10−15) and by miRNAs (2.3 versus 1.8 miRNAs, P =5 ×10−5). Moreover, genes with many transcription factor sites also had many miRNA sites, and conversely, genes with few transcription factor sites also had few miRNA sites (P =10−4 and P =7 ×10−3, respectively). Several of the predicted regulatory connections have independent experimental support (Supplementary Table 5h), including direct regulation of achaete by Hairy115, of giant by Bicoid116, of Enhancer of split complex genes by Suppressor of Hairless117, and of bagpipe by Tinman (known to cooperate in mesoderm induction and heart specification118). More generally, when tissue-specific expression data were available, we found that on average 46% of all targets were co-expressed with their factor in at least one tissue (Supplementary Fig. 5i), which is significantly higher than expected by chance (P =2 ×10−3). Scaling of comparative genomics power Theoretical considerations and pilot studies on selected genomic regions showed that the discovery power of comparative methods scales with the number and phylogenetic distance of the species compared16–20,46,119,120. We extended these analyses by investigating the scaling of genome-wide discovery power using evolutionary signatures for each class of functional elements (Fig. 8
We found that recovery consistently increased with the total number of informant species, and that multi-species comparisons outperformed pairwise comparisons within the same phylogenetic clade. When we examined subsets of informants with similar total branch length (for example, several close species versus one distant species), multi-species comparisons sometimes performed better (protein-coding exons, ncRNAs), comparably (motifs), or worse (miRNAs) than pairwise comparisons. This complex relationship between total branch length and actual discovery power probably reflects imperfect genome assemblies/alignments, characteristics of each class of functional elements, and the specific methods we used. For example, ncRNA discovery probably benefits from observing more compensatory changes across more genomes, whereas miRNA discovery may be more sensitive to artefacts in low-coverage genomes, given the expected high conservation of miRNA arms. As expected, longer elements were easier to discover than shorter elements. Long protein-coding exons (>300 nucleotides) were recovered at very high rates even with few species at close distances (leaving little room for improvement with additional species). In contrast, more informant species and larger distances were crucial for recovering short exons, miRNAs and regulatory motifs. Notably, the optimal evolutionary distance for pairwise comparisons to D. melanogaster also seemed to depend on element length: for long protein-coding exons, the best pairwise informant was the closely related D. erecta, for exons of intermediate lengths D. ananassae, and for the shortest exons the distant D. willistoni (Supplementary Table 7a). Distant species were also optimal for other classes of short elements (ncRNAs, miRNAs and motifs, Fig. 8b–d Finally, we investigated the effect of alignment choice on our results (Supplementary Fig. 8). We found high similarity between different alignment strategies for longer elements (>93% agreement for exons), whereas shorter elements showed larger discrepancies between alignments (81% and 59% agreement for miRNA and motif instances, respectively). Although factors such as genome size, repeat density, pseudogene abundance and physiological differences might confound a simple analogy to the vertebrate phylogeny based on neutral branch length (Fig. 1c Discussion Our results demonstrate the potential of comparative genomics for the systematic characterization of functional elements in a complete genome. Even in a species as intensely studied as D. melanogaster, our methods predicted several thousand new functional elements, including protein-coding genes and exons, novel RNA genes and structures, miRNA genes, regulatory motifs, and regulator targets. Our novel predictions have overwhelming statistical support, often surpassing that of known functional elements, and are additionally supported by experimental evidence in hundreds of cases. The common underlying methodology in this study has been the recognition of specific evolutionary signatures associated with each class of functional elements, which can be much more informative for genome annotation than overall measures of nucleotide conservation. These signatures are general and are immediately relevant to the analysis of the human genome and more generally of any species. In addition to the many new elements, we gained specific biological insights and formulated hypotheses that we hope will guide follow-up experiments. We found 149 genes with potential translational readthrough, showing protein-like evolution downstream of a highly conserved stop codon, and possibly encoding additional protein domains or peptides specific to certain developmental contexts. We also found several candidate programmed frameshifts, which might be part of regulatory circuits (as for ODC/Oda 64) or help expand the diversity of protein products generated from one mRNA, similar to their role in prokaryotes121. We also presented evidence of miRNA processing from both arms of a miRNA hairpin and from both DNA strands of a miRNA locus in some cases, potentially leading to as many as four functional miRNAs per locus. As miRNA/miRNA* pairs are expressed from a single precursor and thus co-regulated, whereas sense/anti-sense pairs are expressed from distinct promoters, the use of both arms or both strands provides compelling general building blocks for higher-level miRNA-mediated regulation. The newly discovered elements did not dramatically increase the total number of annotated nucleotides. Known and predicted elements explain 42% of nucleotides in phastCons elements33, compared to 35.5% for previous annotations (Supplementary Fig. 6), an 18% increase (mostly owing to conserved motif instances). The remaining phastCons elements and independent estimates based on transcriptional activity42 would suggest that a much higher fraction of the genome may be functional (Supplementary Fig. 6). Although it is possible that these estimates are artificially high and that we are in fact converging on a complete annotation of the fly genome, they might instead indicate that much remains to be discovered, which may require the recognition of as-yet-unknown classes of functional elements with distinct evolutionary signatures. Our results also allowed us to compare and contrast evolutionary and experimental methods for the recovery of functional elements, particularly for the identification of regulator targets. We found that comparative genomics resulted in many functionally meaningful sites for transcription factors Mef2, Twist and Snail outside ChIP-bound regions, probably representing targets from diverse conditions not surveyed experimentally. Similarly, ChIP resulted in many additional sites outside those recovered by comparative genomics: some of these may have been replaced by functionally equivalent non-orthologous sequence, rendering them apparently non-conserved in sequence alignments122–124; others may have species- or lineage-specific roles, thus lacking sufficient signal for their comparative detection; finally, some bound sites may be biochemically active yet selectively neutral125. It is worth noting, however, that ChIP-bound motifs that were not conserved showed decreased enrichment in muscle/mesoderm development where the factors are known to act, suggesting that potential lineage-specific roles may lie outside the regulators’ conserved functions. To resolve these questions, comparative genomics studies would benefit greatly from experimental studies in several related species in parallel. Overall, comparative genomics and species-specific experimental studies provide complementary approaches to biological signal discovery. Comparative studies help pinpoint evolutionarily selected functional elements across diverse conditions, whereas experimental studies reveal stage- and tissue-specific information, as well as species-specific sites. Ultimately, their integration is a necessary step towards a comprehensive understanding of animal genomes. METHODS SUMMARY The Methods are described in Supplementary Information, with more details found in the cited companion papers for each section. The sections of the Supplementary Methods are arranged in the same order as the manuscript to facilitate cross-referencing, with an index on the first page to aid navigation. supplement Supplementary Information is linked to the online version of the paper at www.nature.com/nature. Click here to view.(3.6M, pdf) Acknowledgments We thank the National Human Genome Research Institute (NHGRI) for continued support. A.S. was supported in part by the Schering AG/Ernst Schering Foundation and in part by the Human Frontier Science Program Organization (HFSPO). P.K. was supported in part by a National Science Foundation Graduate Research Fellowship. J.S.P. thanks B. Raney and R. Baertsch, and the Danish Medical Research Council and the National Cancer Institute for support. J.B. thanks the Schering AG/Ernst Schering Foundation for a postdoctoral fellowship. L. Parts thanks J. Vilo. S.R. was supported by a HHMI-NIH/NIBIB Interfaces Training Grant and thanks T. Lane and M. Werner-Washburne. D.H., D.P.B., G.J.H. and T.C.K. are Investigators of the Howard Hughes Medical Institute, and B.P., J.G.R., E.H. and J.B. are affiliated with these investigators. J.W.C. and S.E.C. were supported by the NHGRI. M.K. was supported by start-up funds from the MIT Electrical Engineering and Computer Science Laboratory, the Broad Institute of MIT and Harvard, and the MIT Computer Science and Artificial Intelligence Laboratory, and by the Distinguished Alumnus (1964) Career Development Professorship. Footnotes 1FlyBase, The Biological Laboratories, Harvard University, 16 Divinity Avenue, Cambridge, Massachusetts 02138, USA. 2BDGP, LBNL, 1 Cyclotron Road MS 64-0119, Berkeley, California 94720, USA. References 1. Miller W, Makova KD, Nekrutenko A, Hardison RC. Comparative genomics. Annu Rev Genomics Hum Genet. 2004;5:15–56. [PubMed] 2. Ureta-Vidal A, Ettwiller L, Birney E. Comparative genomics: genome-wide analysis in metazoan eukaryotes. Nature Rev Genet. 2003;4:251–262. [PubMed] 3. Kellis M, et al. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature. 2003;423:241–254. [PubMed] 4. Cliften P, et al. Finding functional features in Saccharomyces genomes by phylogenetic footprinting. Science. 2003;301:71–76. [PubMed] 5. Brent MR. Genome annotation past, present, and future: how to define an ORF at each locus. Genome Res. 2005;15:1777–1786. [PubMed] 6. Washietl S, Hofacker IL, Stadler PF. Fast and reliable prediction of noncoding RNAs. Proc Natl Acad Sci USA. 2005;102:2454–2459. [PubMed] 7. Pedersen JS, et al. Identification and classification of conserved RNA secondary structures in the human genome. PLoS Comput Biol. 2006;2:e33. [PubMed] 8. Lim LP, et al. The microRNAs of Caenorhabditis elegans. Genes Dev. 2003;17:991–1008. [PubMed] 9. Lim LP, et al. Vertebrate microRNA genes. Science. 2003;299:1540. [PubMed] 10. Lai EC, Tomancak P, Williams RW, Rubin GM. Computational identification of Drosophila microRNA genes. Genome Biol. 2003;4:R42. [PubMed] 11. Berezikov E, et al. Phylogenetic shadowing and computational identification of human microRNA genes. Cell. 2005;120:21–24. [PubMed] 12. Xie X, et al. Systematic discovery of regulatory motifs in human promoters and 3′ UTRs by comparison of several mammals. Nature. 2005;434:338–345. [PubMed] 13. Ettwiller L, et al. The discovery, positioning and verification of a set of transcription-associated motifs in vertebrates. Genome Biol. 2005;6:R104. [PubMed] 14. Chan CS, Elemento O, Tavazoie S. Revealing posttranscriptional regulatory elements through network-level conservation. PLoS Comput Biol. 2005;1:e69. [PubMed] 15. Boffelli D, et al. Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science. 2003;299:1391–1394. [PubMed] 16. Cooper GM, et al. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 2005;15:901–913. [PubMed] 17. Margulies EH, Blanchette M, Haussler D, Green ED. Identification and characterization of multi-species conserved sequences. Genome Res. 2003;13:2507–2518. [PubMed] 18. Thomas JW, et al. Comparative analyses of multi-species sequences from targeted genomic regions. Nature. 2003;424:788–793. [PubMed] 19. Eddy SR. A model of the statistical power of comparative genome sequence analysis. PLoS Biol. 2005;3:e10. [PubMed] 20. Bergman CM, et al. Assessing the impact of comparative genomic sequence data on the functional annotation of the Drosophila genome. Genome Biol. 2002;3:RESEARCH0086. [PubMed] 21. Rubin GM, Lewis EB. A brief history of Drosophila’s contributions to genome research. Science. 2000;287:2216–2218. [PubMed] 22. Adams MD, et al. The genome sequence of Drosophila melanogaster. Science. 2000;287:2185–2195. [PubMed] 23. Misra S, et al. Annotation of the Drosophila melanogaster euchromatic genome: a systematic review. Genome Biol. 2002;3:RESEARCH0083. [PubMed] 24. Celniker SE, Rubin GM. The Drosophila melanogaster genome. Annu Rev Genomics Hum Genet. 2003;4:89–117. [PubMed] 25. Ashburner M, Bergman C. M Drosophila melanogaster: a case study of a model genomic sequence and its consequences. Genome Res. 2005;15:1661–1667. [PubMed] 26. Matthews KA, Kaufman TC, Gelbart WM. Research resources for Drosophila: the expanding universe. Nature Rev Genet. 2005;6:179–193. [PubMed] 27. Venken KJ, He Y, Hoskins RA, Bellen HJ. P[acman]: a BAC transgenic platform for targeted insertion of large DNA fragments in D. melanogaster. Science. 2006;314:1747–1751. [PubMed] 28. Dietzl G, et al. A genome-wide transgenic RNAi library for conditional gene inactivation in Drosophila. Nature. 2007;448:151–156. [PubMed] 29. Spradling AC, et al. The Berkeley Drosophila Genome Project gene disruption project: Single P-element insertions mutating 25% of vital Drosophila genes. Genetics. 1999;153:135–177. [PubMed] 30. St Johnston D. The art and design of genetic screens: Drosophila melanogaster. Nature Rev Genet. 2002;3:176–188. [PubMed] 31. Richards S, et al. Comparative genome sequencing of Drosophila pseudoobscura: chromosomal, gene, and cis-element evolution. Genome Res. 2005;15:1–18. [PubMed] 32. Drosophila 12 Genomes Consortium. Evolution of genes and genomes on the Drosophila phylogeny. Nature. 2007 doi: 10.1038/nature06341. this issue. 33. Siepel A, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005;15:1034–1050. [PubMed] 34. Nekrutenko A, Makova KD, Li WH. The KA/KS ratio test for assessing the protein-coding potential of genomic regions: an empirical and simulation study. Genome Res. 2002;12:198–202. [PubMed] 35. Eddy SR. Computational genomics of noncoding RNA genes. Cell. 2002;109:137–140. [PubMed] 36. Bompfuenewerer AF, et al. Evolutionary patterns of non-coding RNAs. Theor Biosci. 2004;123:301–369. 37. Reese MG, et al. Genome annotation assessment in Drosophila melanogaster. Genome Res. 2000;10:483–501. [PubMed] 38. Rubin GM, et al. A Drosophila complementary DNA resource. Science. 2000;287:2222–2224. [PubMed] 39. Stapleton M, et al. A Drosophila full-length cDNA resource. Genome Biol. 2002;3:RESEARCH0080. [PubMed] 40. Hild M, et al. An integrated gene annotation and transcriptional profiling approach towards the full gene content of the Drosophila genome. Genome Biol. 2003;5:R3. [PubMed] 41. Yandell M, et al. A computational and experimental approach to validating annotations and gene predictions in the Drosophila melanogaster genome. Proc Natl Acad Sci USA. 2005;102:1566–1571. [PubMed] 42. Manak JR, et al. Biological function of unannotated transcription during the early development of Drosophila melanogaster. Nature Genet. 2006;38:1151–1158. [PubMed] 43. Lin MF, et al. Revisiting the protein-coding gene catalog of Drosophila melanogaster using twelve fly genomes. Genome Res. doi: 10.1101/gr.6679507. in the press. 44. Yang Z, Bielawski JP. Statistical methods for detecting molecular adaptation. Trends Ecol Evol. 2000;15:496–503. [PubMed] 45. Mignone F, Grillo G, Liuni S, Pesole G. Computational identification of protein coding potential of conserved sequence tags through cross-species evolutionary analysis. Nucleic Acids Res. 2003;31:4639–4645. [PubMed] 46. Zhang L, Pavlovic V, Cantor CR, Kasif S. Human-mouse gene identification by comparative evidence integration and evolutionary analysis. Genome Res. 2003;13(6A):1190–1202. [PubMed] 47. Crosby MA, et al. FlyBase: genomes by the dozen. Nucleic Acids Res. 2007;35(Database issue):D486–D491. [PubMed] 48. Ashburner M, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genet. 2000;25:25–29. [PubMed] 49. Ochman H, Ajioka JW, Garza D, Hartl DL. Inverse polymerase chain reaction. Bio/Technology. 1990;8:759–760. [PubMed] 50. Hoskins RA, et al. Rapid and efficient cDNA library screening by self-ligation of inverse PCR products (SLIP). Nucleic Acids Res. 2005;33:e185. [PubMed] 51. Wan KH, et al. High-throughput plasmid cDNA library screening. Nature Protocols. 2006;1:624–632. 52. Hahn MW, Han MV, Han SG. Gene family evolution across 12 Drosophila genomes. PLoS Genet. 2007;3:e197. [PubMed] 53. Andrews J, et al. The stoned locus of Drosophila melanogaster produces a dicistronic transcript and encodes two distinct polypeptides. Genetics. 1996;143:1699–1711. [PubMed] 54. Brogna S, Ashburner M. The Adh-related gene of Drosophila melanogaster is expressed as a functional dicistronic messenger RNA: multigenic transcription in higher organisms. EMBO J. 1997;16:2023–2031. [PubMed] 55. Hatfield DL, Gladyshev VN. How selenium has altered our understanding of the genetic code. Mol Cell Biol. 2002;22:3565–3576. [PubMed] 56. Kryukov GV, et al. Characterization of mammalian selenoproteomes. Science. 2003;300:1439–1443. [PubMed] 57. Copeland PR. Regulation of gene expression by stop codon recoding: selenocysteine. Gene. 2003;312:17–25. [PubMed] 58. Castellano S, et al. In silico identification of novel selenoproteins in the Drosophila melanogaster genome. EMBO Rep. 2001;2:697–702. [PubMed] 59. von der Haar T, Tuite MF. Regulated translational bypass of stop codons in yeast. Trends Microbiol. 2007;15:78–86. [PubMed] 60. Luo GX, et al. A specific base transition occurs on replicating hepatitis delta virus RNA. J Virol. 1990;64:1021–1027. [PubMed] 61. Casey JL, Gerin JL. Hepatitis D virus RNA editing: specific modification of adenosine in the antigenomic RNA. J Virol. 1995;69:7593–7600. [PubMed] 62. Steneberg P, et al. Translational readthrough in the hdc mRNA generates a novel branching inhibitor in the Drosophila trachea. Genes Dev. 1998;12:956–967. [PubMed] 63. Bass BL. RNA editing by adenosine deaminases that act on RNA. Annu Rev Biochem. 2002;71:817–846. [PubMed] 64. Ivanov IP, et al. The Drosophila gene for antizyme requires ribosomal frameshifting for expression and contains an intronic gene for snRNP Sm D3 on the opposite strand. Mol Cell Biol. 1998;18:1553–1561. [PubMed] 65. Eddy SR. Non-coding RNA genes and the modern RNA world. Nature Rev Genet. 2001;2:919–929. [PubMed] 66. Yuan G, et al. RNomics in Drosophila melanogaster: identification of 66 candidates for novel non-messenger RNAs. Nucleic Acids Res. 2003;31:2495–2507. [PubMed] 67. Lestrade L, Weber M. J snoRNA-LBME-db, a comprehensive database of human H/ACA and C/D box snoRNAs. Nucleic Acids Res. 2006;34(Database issue):D158–D162. [PubMed] 68. Bier E. Drosophila, the golden bug, emerges as a tool for human genetics. Nature Rev Genet. 2005;6:9–23. [PubMed] 69. Hoopengardner B, Bhalla T, Staber C, Reenan R. Nervous system targets of RNA editing identified by comparative genomics. Science. 2003;301:832–836. [PubMed] 70. Mignone F, et al. UTRdb and UTRsite: a collection of sequences and regulatory motifs of the untranslated regions of eukaryotic mRNAs. Nucleic Acids Res. 2005;33(Database issue):D141–D146. [PubMed] 71. Cohen RS, Zhang S, Dollar GL. The positional, structural, and sequence requirements of the Drosophila TLS RNA localization element. RNA. 2005;11:1017–1029. [PubMed] 72. Allemand F, et al. Escherichia coli ribosomal protein L20 binds as a single monomer to its own mRNA bearing two potential binding sites. Nucleic Acids Res. 2007;35:3016–3031. [PubMed] 73. Okumura T, Matsumoto A, Tanimura T, Murakami R. An endoderm-specific GATA factor gene, dGATAe, is required for the terminal differentiation of the Drosophila endoderm. Dev Biol. 2005;278:576–586. [PubMed] 74. Park SW, et al. An evolutionarily conserved domain of roX2 RNA is sufficient for induction of H4-Lys16 acetylation on the Drosophila X chromosome. Genetics. in the press. 75. Park Y, Kuroda MI. Epigenetic aspects of X-chromosome dosage compensation. Science. 2001;293:1083–1085. [PubMed] 76. Berezikov E, Cuppen E, Plasterk RH. Approaches to microRNA discovery. Nature Genet. 2006;38(Suppl 1):S2–S7. [PubMed] 77. Bartel DP. MicroRNAs: genomics, biogenesis, mechanism, and function. Cell. 2004;116:281–297. [PubMed] 78. Stark A, et al. Systematic discovery and characterization of fly microRNAs using 12 Drosophila genomes. Genome Res. doi: 10.1101/gr.6593807. in the press. 79. Ruby JG, et al. Evolution, biogenesis, expression, and target predictions of a substantially expanded set of Drosophila microRNAs. Genome Res. doi: 10.1101/gr.6597907. in the press. 80. Pekarsky Y, et al. Tcl1 expression in chronic lymphocytic leukemia is regulated by miR-29 and miR-181. Cancer Res. 2006;66:11590–11593. [PubMed] 81. Ruby JG, Jan CH, Bartel DP. Intronic microRNA precursors that bypass Drosha processing. Nature. 2007;448:83–86. [PubMed] 82. Okamura K, et al. The mirtron pathway generates microRNA-class regulatory RNAs in Drosophila. Cell. 2007;130:89–100. [PubMed] 83. Lewis BP, et al. Prediction of mammalian microRNA targets. Cell. 2003;115:787–798. [PubMed] 84. Stark A, Brennecke J, Russell RB, Cohen SM. Identification of Drosophila microRNA targets. PLoS Biol. 2003;1:E60. [PubMed] 85. Lai EC. Micro RNAs are complementary to 3′UTR sequence motifs that mediate negative post-transcriptional regulation. Nature Genet. 2002;30:363–364. [PubMed] 86. Lewis BP, Burge CB, Bartel DP. Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell. 2005;120:15–20. [PubMed] 87. Tompa M. Identifying functional elements by comparative DNA sequence analysis. Genome Res. 2001;11:1143–1144. [PubMed] 88. Stormo GD. DNA binding sites: representation and discovery. Bioinformatics. 2000;16:16–23. [PubMed] 89. Kheradpour P, Stark A, Roy S, Kellis M. Reliable prediction of regulator targets using 12 Drosophila genomes. Genome Res. doi: 10.1101/gr.7090407. in the press. 90. Stathopoulos A, Levine M. Genomic regulatory networks and animal development. Dev Cell. 2005;9:449–462. [PubMed] 91. Schroeder MD, et al. Transcriptional control in the segmentation gene network of Drosophila. PLoS Biol. 2004;2:e271. [PubMed] 92. Zeitlinger J, et al. Whole-genome ChIP-chip analysis of Dorsal, Twist, and Snail suggests integration of diverse patterning processes in the Drosophila embryo. Genes Dev. 2007;21:385–390. [PubMed] 93. Kanehisa M, et al. The KEGG resource for deciphering the genome. Nucleic Acids Res. 2004;32(Database issue):D277–D280. [PubMed] 94. Berman BP, et al. Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc Natl Acad Sci USA. 2002;99:757–762. [PubMed] 95. Markstein M, et al. A regulatory code for neurogenic gene expression in the Drosophila embryo. Development. 2004;131:2387–2394. [PubMed] 96. Philippakis AA, et al. Expression-guided in silico evaluation of candidate cis regulatory codes for Drosophila muscle founder cells. PLoS Comput Biol. 2006;2:e53. [PubMed] 97. Smale ST, Kadonaga JT. The RNA polymerase II core promoter. Annu Rev Biochem. 2003;72:449–479. [PubMed] 98. Gerber AP, et al. Genome-wide identification of mRNAs associated with the translational regulator PUMILIO in Drosophila melanogaster. Proc Natl Acad Sci USA. 2006;103:4487–4492. [PubMed] 99. Zubiaga AM, Belasco JG, Greenberg ME. The nonamer UUAUUUAUU is the key AU-rich sequence motif that mediates mRNA degradation. Mol Cell Biol. 1995;15:2219–2230. [PubMed] 100. Fairbrother WG, Yeh RF, Sharp PA, Burge CB. Predictive identification of exonic splicing enhancers in human genes. Science. 2002;297:1007–1013. [PubMed] 101. Kloosterman WP, Wienholds E, Ketting RF, Plasterk RH. Substrate requirements for let-7 function in the developing zebrafish embryo. Nucleic Acids Res. 2004;32:6284–6291. [PubMed] 102. Grimson A, et al. MicroRNA targeting specificity in mammals: determinants beyond seed pairing. Mol Cell. 2007;27:91–105. [PubMed] 103. Farh KK, et al. The widespread impact of mammalian MicroRNAs on mRNA repression and evolution. Science. 2005;310:1817–1821. [PubMed] 104. Stark A, et al. Animal microRNAs confer robustness to gene expression and have a significant impact on 3′ UTR evolution. Cell. 2005;123:1133–1146. [PubMed] 105. Rajewsky N. microRNA target predictions in animals. Nature Genet. 2006;38(suppl 1):S8–S13. [PubMed] 106. Elnitski L, et al. Distinguishing regulatory DNA from neutral sites. Genome Res. 2003;13:64–72. [PubMed] 107. Abrams EW, Andrew DJ. CrebA regulates secretory activity in the Drosophila salivary gland and epidermis. Development. 2005;132:2743–2758. [PubMed] 108. Sandmann T, et al. A temporal map of transcription factor activity: mef2 directly regulates target genes at all stages of muscle development. Dev Cell. 2006;10:797–807. [PubMed] 109. Sandmann T, et al. A core transcriptional network for early mesoderm development in Drosophila melanogaster. Genes Dev. 2007;21:436–449. [PubMed] 110. Sethupathy P, Corda B, Hatzigeorgiou AG. TarBase: A comprehensive database of experimentally supported animal microRNA targets. RNA. 2006;12:192–197. [PubMed] 111. Lee TI, et al. Control of developmental regulators by Polycomb in human embryonic stem cells. Cell. 2006;125:301–313. [PubMed] 112. Boyer LA, et al. Core transcriptional regulatory circuitry in human embryonic stem cells. Cell. 2005;122:947–956. [PubMed] 113. Aerts S, van Helden J, Sand O, Hassan B. Fine-tuning enhancer models to predict transcriptional targets across multiple genomes. PLoS ONE. 2007;2(11):e1115. [PubMed] 114. Maeder M, Polansky B, Robson B, Eastman D. Phylogenetic footprinting analysis in the upstream regulatory regions of the Drosophila Enhancer of split genes. Genetics. in the press. 115. Van Doren M, et al. Negative regulation of proneural gene activity: hairy is a direct transcriptional repressor of achaete. Genes Dev. 1994;8:2729–2742. [PubMed] 116. Kraut R, Levine M. Spatial regulation of the gap gene giant during Drosophila development. Development. 1991;111:601–609. [PubMed] 117. Bailey AM, Posakony JW. Suppressor of hairless directly activates transcription of enhancer of split complex genes in response to Notch receptor activity. Genes Dev. 1995;9:2609–2622. [PubMed] 118. Yin Z, Frasch M. Regulation and function of tinman during dorsal mesoderm induction and heart specification in Drosophila. Dev Genet. 1998;22:187–200. [PubMed] 119. Margulies EH, et al. An initial strategy for the systematic identification of functional elements in the human genome by low-redundancy comparative sequencing. Proc Natl Acad Sci USA. 2005;102:4795–4800. [PubMed] 120. Margulies EH, Chen CW, Green ED. Differences between pair-wise and multi-sequence alignment methods affect vertebrate genome comparisons. Trends Genet. 2006;22:187–193. [PubMed] 121. Farabaugh PJ. Programmed translational frameshifting. Annu Rev Genet. 1996;30:507–528. [PubMed] 122. Odom DT, et al. Tissue-specific transcriptional regulation has diverged significantly between human and mouse. Nature Genet. 2007;39:730–732. [PubMed] 123. Ludwig MZ, Kreitman M. Evolutionary dynamics of the enhancer region of even-skipped in Drosophila. Mol Biol Evol. 1995;12:1002–1011. [PubMed] 124. Ludwig MZ, et al. Functional evolution of a cis-regulatory module. PLoS Biol. 2005;3:e93. [PubMed] 125. The ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007;447:799–816. [PubMed] 126. Kent WJ, et al. The human genome browser at UCSC. Genome Res. 2002;12:996–1006. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||
Annu Rev Genomics Hum Genet. 2004; 5():15-56.
[Annu Rev Genomics Hum Genet. 2004]Nat Rev Genet. 2003 Apr; 4(4):251-62.
[Nat Rev Genet. 2003]Nature. 2003 May 15; 423(6937):241-54.
[Nature. 2003]Genome Res. 2005 Dec; 15(12):1777-86.
[Genome Res. 2005]Proc Natl Acad Sci U S A. 2005 Feb 15; 102(7):2454-9.
[Proc Natl Acad Sci U S A. 2005]Science. 2000 Mar 24; 287(5461):2216-8.
[Science. 2000]Science. 2000 Mar 24; 287(5461):2185-95.
[Science. 2000]Genome Biol. 2002; 3(12):RESEARCH0083.
[Genome Biol. 2002]Genome Res. 2005 Dec; 15(12):1661-7.
[Genome Res. 2005]Nat Rev Genet. 2005 Mar; 6(3):179-93.
[Nat Rev Genet. 2005]Science. 2000 Mar 24; 287(5461):2185-95.
[Science. 2000]Genome Res. 2005 Jan; 15(1):1-18.
[Genome Res. 2005]Genome Res. 2005 Jul; 15(7):901-13.
[Genome Res. 2005]Genome Res. 2003 Dec; 13(12):2507-18.
[Genome Res. 2003]Genome Res. 2005 Aug; 15(8):1034-50.
[Genome Res. 2005]Genome Res. 2002 Jan; 12(1):198-202.
[Genome Res. 2002]Nature. 2003 May 15; 423(6937):241-54.
[Nature. 2003]Cell. 2002 Apr 19; 109(2):137-40.
[Cell. 2002]Genome Biol. 2003; 4(7):R42.
[Genome Biol. 2003]Cell. 2005 Jan 14; 120(1):21-4.
[Cell. 2005]Genome Res. 2005 Dec; 15(12):1777-86.
[Genome Res. 2005]Genome Res. 2000 Apr; 10(4):483-501.
[Genome Res. 2000]Science. 2000 Mar 24; 287(5461):2222-4.
[Science. 2000]Nat Genet. 2006 Oct; 38(10):1151-8.
[Nat Genet. 2006]Genome Biol. 2002; 3(12):RESEARCH0083.
[Genome Biol. 2002]Nature. 2003 May 15; 423(6937):241-54.
[Nature. 2003]Genome Res. 2002 Jan; 12(1):198-202.
[Genome Res. 2002]Trends Ecol Evol. 2000 Dec 1; 15(12):496-503.
[Trends Ecol Evol. 2000]Genome Res. 2003 Jun; 13(6A):1190-202.
[Genome Res. 2003]Nucleic Acids Res. 2007 Jan; 35(Database issue):D486-91.
[Nucleic Acids Res. 2007]Genome Biol. 2002; 3(12):RESEARCH0086.
[Genome Biol. 2002]Nat Genet. 2000 May; 25(1):25-9.
[Nat Genet. 2000]Genome Biol. 2002; 3(12):RESEARCH0083.
[Genome Biol. 2002]Biotechnology (N Y). 1990 Aug; 8(8):759-60.
[Biotechnology (N Y). 1990]PLoS Genet. 2007 Nov; 3(11):e197.
[PLoS Genet. 2007]Genome Biol. 2002; 3(12):RESEARCH0083.
[Genome Biol. 2002]Genetics. 1996 Aug; 143(4):1699-711.
[Genetics. 1996]EMBO J. 1997 Apr 15; 16(8):2023-31.
[EMBO J. 1997]Mol Cell Biol. 2002 Jun; 22(11):3565-76.
[Mol Cell Biol. 2002]EMBO Rep. 2001 Aug; 2(8):697-702.
[EMBO Rep. 2001]Trends Microbiol. 2007 Feb; 15(2):78-86.
[Trends Microbiol. 2007]Genome Biol. 2002; 3(12):RESEARCH0080.
[Genome Biol. 2002]J Virol. 1990 Mar; 64(3):1021-7.
[J Virol. 1990]Mol Cell Biol. 1998 Mar; 18(3):1553-61.
[Mol Cell Biol. 1998]Proc Natl Acad Sci U S A. 2005 Feb 15; 102(7):2454-9.
[Proc Natl Acad Sci U S A. 2005]PLoS Comput Biol. 2006 Apr; 2(4):e33.
[PLoS Comput Biol. 2006]Nat Rev Genet. 2001 Dec; 2(12):919-29.
[Nat Rev Genet. 2001]Nat Genet. 2006 Oct; 38(10):1151-8.
[Nat Genet. 2006]Nucleic Acids Res. 2003 May 15; 31(10):2495-507.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D158-62.
[Nucleic Acids Res. 2006]Annu Rev Biochem. 2002; 71():817-46.
[Annu Rev Biochem. 2002]Nat Rev Genet. 2005 Jan; 6(1):9-23.
[Nat Rev Genet. 2005]Science. 2003 Aug 8; 301(5634):832-6.
[Science. 2003]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D141-6.
[Nucleic Acids Res. 2005]RNA. 2005 Jul; 11(7):1017-29.
[RNA. 2005]Nature. 2003 May 15; 423(6937):241-54.
[Nature. 2003]Science. 2003 Jul 4; 301(5629):71-6.
[Science. 2003]Nucleic Acids Res. 2007; 35(9):3016-31.
[Nucleic Acids Res. 2007]Dev Biol. 2005 Feb 15; 278(2):576-86.
[Dev Biol. 2005]Science. 2001 Aug 10; 293(5532):1083-5.
[Science. 2001]Cell. 2002 Apr 19; 109(2):137-40.
[Cell. 2002]Nat Genet. 2006 Jun; 38 Suppl():S2-7.
[Nat Genet. 2006]Cell. 2004 Jan 23; 116(2):281-97.
[Cell. 2004]Genes Dev. 2003 Apr 15; 17(8):991-1008.
[Genes Dev. 2003]Cell. 2005 Jan 14; 120(1):21-4.
[Cell. 2005]Cancer Res. 2006 Dec 15; 66(24):11590-3.
[Cancer Res. 2006]Nature. 2007 Jul 5; 448(7149):83-6.
[Nature. 2007]Cell. 2007 Jul 13; 130(1):89-100.
[Cell. 2007]Cell. 2003 Dec 26; 115(7):787-98.
[Cell. 2003]Nat Genet. 2002 Apr; 30(4):363-4.
[Nat Genet. 2002]Cell. 2005 Jan 14; 120(1):15-20.
[Cell. 2005]Genome Res. 2001 Jul; 11(7):1143-4.
[Genome Res. 2001]Bioinformatics. 2000 Jan; 16(1):16-23.
[Bioinformatics. 2000]Nature. 2003 May 15; 423(6937):241-54.
[Nature. 2003]Science. 2003 Jul 4; 301(5629):71-6.
[Science. 2003]Nature. 2005 Mar 17; 434(7031):338-45.
[Nature. 2005]Dev Cell. 2005 Oct; 9(4):449-62.
[Dev Cell. 2005]PLoS Biol. 2004 Sep; 2(9):E271.
[PLoS Biol. 2004]Genes Dev. 2007 Feb 15; 21(4):385-90.
[Genes Dev. 2007]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D277-80.
[Nucleic Acids Res. 2004]Genes Dev. 2007 Feb 15; 21(4):385-90.
[Genes Dev. 2007]PLoS Biol. 2004 Sep; 2(9):E271.
[PLoS Biol. 2004]Proc Natl Acad Sci U S A. 2002 Jan 22; 99(2):757-62.
[Proc Natl Acad Sci U S A. 2002]PLoS Comput Biol. 2006 May; 2(5):e53.
[PLoS Comput Biol. 2006]Annu Rev Biochem. 2003; 72():449-79.
[Annu Rev Biochem. 2003]Nature. 2005 Mar 17; 434(7031):338-45.
[Nature. 2005]Proc Natl Acad Sci U S A. 2006 Mar 21; 103(12):4487-92.
[Proc Natl Acad Sci U S A. 2006]Mol Cell Biol. 1995 Apr; 15(4):2219-30.
[Mol Cell Biol. 1995]Science. 2002 Aug 9; 297(5583):1007-13.
[Science. 2002]Cell. 2005 Jan 14; 120(1):15-20.
[Cell. 2005]Nucleic Acids Res. 2004; 32(21):6284-91.
[Nucleic Acids Res. 2004]Mol Cell. 2007 Jul 6; 27(1):91-105.
[Mol Cell. 2007]Science. 2005 Dec 16; 310(5755):1817-21.
[Science. 2005]Nature. 2005 Mar 17; 434(7031):338-45.
[Nature. 2005]Cell. 2003 Dec 26; 115(7):787-98.
[Cell. 2003]Science. 2005 Dec 16; 310(5755):1817-21.
[Science. 2005]Cell. 2005 Dec 16; 123(6):1133-46.
[Cell. 2005]Cell. 2005 Jan 14; 120(1):15-20.
[Cell. 2005]Nature. 2003 May 15; 423(6937):241-54.
[Nature. 2003]Science. 2003 Jul 4; 301(5629):71-6.
[Science. 2003]Nature. 2005 Mar 17; 434(7031):338-45.
[Nature. 2005]PLoS Comput Biol. 2005 Dec; 1(7):e69.
[PLoS Comput Biol. 2005]Science. 2003 Feb 28; 299(5611):1391-4.
[Science. 2003]Development. 2005 Jun; 132(12):2743-58.
[Development. 2005]Genes Dev. 2007 Feb 15; 21(4):385-90.
[Genes Dev. 2007]Dev Cell. 2006 Jun; 10(6):797-807.
[Dev Cell. 2006]Genes Dev. 2007 Feb 15; 21(4):436-49.
[Genes Dev. 2007]Cell. 2005 Dec 16; 123(6):1133-46.
[Cell. 2005]RNA. 2006 Feb; 12(2):192-7.
[RNA. 2006]Cell. 2006 Apr 21; 125(2):301-13.
[Cell. 2006]Cell. 2005 Sep 23; 122(6):947-56.
[Cell. 2005]PLoS One. 2007 Nov 7; 2(11):e1115.
[PLoS One. 2007]Cell. 2005 Dec 16; 123(6):1133-46.
[Cell. 2005]Genes Dev. 1994 Nov 15; 8(22):2729-42.
[Genes Dev. 1994]Development. 1991 Feb; 111(2):601-9.
[Development. 1991]Genes Dev. 1995 Nov 1; 9(21):2609-22.
[Genes Dev. 1995]Dev Genet. 1998; 22(3):187-200.
[Dev Genet. 1998]Genome Res. 2005 Jul; 15(7):901-13.
[Genome Res. 2005]Genome Biol. 2002; 3(12):RESEARCH0086.
[Genome Biol. 2002]Genome Res. 2003 Jun; 13(6A):1190-202.
[Genome Res. 2003]Proc Natl Acad Sci U S A. 2005 Mar 29; 102(13):4795-800.
[Proc Natl Acad Sci U S A. 2005]Trends Genet. 2006 Apr; 22(4):187-93.
[Trends Genet. 2006]Mol Cell Biol. 1998 Mar; 18(3):1553-61.
[Mol Cell Biol. 1998]Annu Rev Genet. 1996; 30():507-28.
[Annu Rev Genet. 1996]Genome Res. 2005 Aug; 15(8):1034-50.
[Genome Res. 2005]Nat Genet. 2006 Oct; 38(10):1151-8.
[Nat Genet. 2006]Nat Genet. 2007 Jun; 39(6):730-2.
[Nat Genet. 2007]PLoS Biol. 2005 Apr; 3(4):e93.
[PLoS Biol. 2005]Nature. 2007 Jun 14; 447(7146):799-816.
[Nature. 2007]Genome Res. 2005 Aug; 15(8):1034-50.
[Genome Res. 2005]Genome Res. 2002 Jun; 12(6):996-1006.
[Genome Res. 2002]