![]() | ![]() |
Formats:
|
|||||||||||||||||||||||||||||||
Copyright : © 2006 Yandell et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Large-Scale Trends in the Evolution of Gene Structures within 11 Animal Genomes 1 Department of Molecular and Cell Biology, University of California Berkeley, Berkeley, California, United States of America 2 Howard Hughes Medical Institute, University of California Berkeley, Berkeley, California, United States of America 3 Department of Genome Sciences, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America Peter Li, Editor Celera Genomics, United States of America * To whom correspondence should be addressed. E-mail: myandell/at/genetics.utah.edu ¤a Current address: Department of Human Genetics, Eccles Institute of Human Genetics, University of Utah, Salt Lake City, Utah, United States of America ¤b Current address: U.S. Department of Energy Joint Genome Institute, Walnut Creek, California, United States of America ¤c Current address: Department of Bioinformatics, Genentech, South San Francisco, California, United States of America Received November 7, 2005; Accepted January 18, 2006. This article has been cited by other articles in PMC.Abstract We have used the annotations of six animal genomes (Homo sapiens, Mus musculus, Ciona intestinalis, Drosophila melanogaster, Anopheles gambiae, and Caenorhabditis elegans) together with the sequences of five unannotated Drosophila genomes to survey changes in protein sequence and gene structure over a variety of timescales—from the less than 5 million years since the divergence of D. simulans and D. melanogaster to the more than 500 million years that have elapsed since the Cambrian explosion. To do so, we have developed a new open-source software library called CGL (for “Comparative Genomics Library”). Our results demonstrate that change in intron–exon structure is gradual, clock-like, and largely independent of coding-sequence evolution. This means that genome annotations can be used in new ways to inform, corroborate, and test conclusions drawn from comparative genomics analyses that are based upon protein and nucleotide sequence similarities. Synopsis Just as protein sequences change over time, so do gene structures. Over comparatively short evolutionary timescales, introns lengthen and shorten; and over longer timescales the number and positions of introns in homologous genes can change. These facts suggest that the intron–exon structures of genes may provide a source of evolutionary information. The utility of gene structures as materials for phylogenetic analyses, however, depends upon their independence from the forces driving protein evolution. If, for example, intron–exon structures are strongly influenced by selection at the amino acid level, then using them for phylogenetic investigations is largely pointless, as the same information could have been more easily gained from protein analyses. Using 11 animal genomes, Yandell et al. show that evolution of intron lengths and positions is largely—though not completely—independent of protein sequence evolution. This means that gene structures provide a source of information about the evolutionary past independent of protein sequence similarities—a finding the authors employ to investigate the accuracy of the protein clock and to explore the utility of gene structures as a means to resolve deep phylogenetic relationships within the animals. Introduction Sequence alignment and comparison have revealed much about evolution at the nucleotide and amino acid level, but much less is known about the structural evolution of genes—how their intron–exon structures, intron lengths, alternative splicing, and untranslated regions change over time. Genome annotations comprise an invaluable resource for answering such questions because they describe the essential parts of a gene and their relationships to one another [1]—information that is missing from protein and transcript sequence files. Although the origins and mobility of introns are still subjects of debate, previous studies [2,3] have established that just as amino acid sequences change over time, so do gene structures. Over comparatively short evolutionary timescales, introns lengthen and shorten [4]; and over longer timescales the number and positions of introns in orthologous and paralogous genes can change [5]. These facts suggest that the intron–exon structures of genes may provide a novel source of evolutionary information irrespective of the mechanistic details of intron origin and dispersal. Indeed, several studies have already employed them for this purpose [6–8]. The utility of gene structures as materials for phylogenetic analyses, however, depends upon their independence from the forces driving protein sequence evolution. If, for example, intron–exon structures are strongly influenced by selection at the protein level, then using them for phylogenetic investigations is largely pointless, as the same information could have been more easily gained from protein analyses. Also needed is a better understanding of the rates at which different aspects of gene structures evolve. Clearly, more slowly evolving aspects of gene-structure—intron positions [9–11], for example—are best suited to probing deep phylogenetic relationships, whereas more rapidly evolving components—such as intron lengths—are better suited for investigations of more recent events. Here too, comparison to protein evolution is also essential. If intron positions change more rapidly than protein sequences do, their power to resolve ancient relationships will be limited, even if they evolve independently of proteins. In order to address these issues, we have characterized the number, position, and length of introns and exons in 11 individual genomes representing four phyla. These data provide a panoramic perspective from which to investigate the evolution of gene structures on a variety of timescales—from the less than 5 million years since the divergence of Drosophila simulans and D. melanogaster, to the more than 500 million years that have elapsed since the Cambrian explosion. We show that evolution of intron lengths and positions is largely—though not completely—independent of protein sequence evolution. Thus, gene structures provide a source of information about the evolutionary past independent of protein sequence similarities. We use this fact to investigate the accuracy of the protein clock and to explore the utility of gene structures as a means to resolve deep phylogenetic relationships within the animals. Results Development of an Open-Source Software Library for Comparative Genomics In order to facilitate the use of genome annotations as substrates for computational analyses, we developed an open-source software library (CGL) for comparative genomics using genome annotations. The software and a tutorial on its use are available at http://www.yandell-lab.org/cgl. CGL can convert the annotations from many different databases into a single standardized format; thus the software can be used to assemble very large repositories of annotations that encompass the contents of multiple genome databases. For purposes of the analyses presented here, we have used CGL to convert the genome annotations of Homo sapiens [12,13], Mus musculus [14], and Caenorhabditis elegans [15,16] as distributed by GenBank; D. melanogaster annotations from FlyBase [17–19]; the Anopheles gambiae [20] annotations from Ensembl [21]; and the Ciona intestinalis [22] annotations from the JGI [23] into a single standardized file format that greatly facilitates computational analyses. The resulting repository is unique in that no single database or genome project maintains or distributes all of these annotations. The Bilaterian animals are generally classified as either protostomes or deuterostomes. In deuterostomes, the blastopore lip becomes the anus, whereas in the protostomes it becomes an anterior oral structure. The two lineages are believed to have last shared a common ancestor more than 500 million years ago, and the nematodes may have diverged from both lineages even earlier [24]. We chose the genomes included in this study in such a way as to facilitate inquiries into the evolution of gene structure across various timescales using a minimum number of genomes. Accordingly, we chose to analyze the genomes of three deuterostomes, H. sapiens, M. musculus, and C. intestinalis, and an equal number of protostomes: D. melanogaster, A. gambiae, and C. elegans. This dataset also contains a deep split in both the protostome and deuterostome clades. C. intestinalis, a Urochordate, is believed to have diverged from the Craniata—the phylum to which humans and mice belong—about 500 million years ago [25,26]; likewise among the protostomes, D. melanogaster and A. gambiae are believed to have diverged from one another approximately 250 million years ago [26]. The dataset thus contains a number of deeply divergent animal genomes, making it ideal for the investigation of long-term trends in the evolution of gene structures. Inclusion of the human and mouse genomes makes possible investigations of more rapidly changing aspects of gene structure, as they are believed to have diverged from one another about 70 million years ago [14]. In order to survey gene evolution during even shorter time intervals, we also included in our dataset five recently sequenced but unannotated genomes: D. simulans [27], D. yakuba [27], D. ananassae [28], D. pseudoobscura [29], and D. virilis [28]. These five Drosophila species are believed to have diverged from the melanogaster lineage around 5 million, 13 million, 44 million, 55 million, and 63 million years ago, respectively [30]. Because CGL can extract a wide array of information pertaining to the evolution of gene structure even from incompletely assembled and unannotated genomes, this effectively gave us a dataset of 11 genomes for our analyses. The inclusion of these provisionally assembled and unannotated genomes also allowed us to examine the utility of unfinished genomes for analyses of gene evolution. Intron–Exon Structure in Six Animal Genomes As our collection of annotated genomes contained more than 100,000 annotations, we sought first to survey and summarize the contents of each genome's annotations with regards to gene structure. We choose three basic measures: intron length, exon length, and intron density. These measures provide a concise summary of the similarities and differences in intron–exon structure for the six annotated genomes. Placing these data in their phylogenetic context allows trends in the evolution of gene structure to emerge. Intron length. Figure 1
Exon length. We also characterized each genome with respect to coding-exon length (Figure 1 One process that might explain the longer exons characteristic of the protostome genomes is retro-transposition-mediated gene duplication [32]. Because this process results in intronless copies of existing genes, each event will tend to stretch the distribution of exon lengths, shown in Figure 2
Intron density. In order to further investigate the distribution of introns, we have made use of a simple summary statistic of gene structure: intron density, or the number of coding introns associated with a particular protein divided by that protein's length [33]. Although in principle, genome-wide fluctuations in protein lengths might also affect this measure, this does not appear to be the case. To control for this possibility, we recalculated the intron densities for each of the six genomes, using only conserved portions of each annotated protein (unpublished data); the resulting distributions (discussed below) were essentially unchanged, demonstrating that changes in intron density reflect differences in intron numbers, not changes in protein lengths. Intron density thus provides a precise definition with which to distinguish intron-rich from intron-poor genes. While intron density is an attribute of a single annotated transcript, when applied to entire annotated genomes it can also be used to provide a summary statistic regarding the distribution of introns within a genome. Consistent with the exon-length distributions shown in Figure 1 To explore these data more closely, we also examined the frequency distributions of intron density in each of the six annotated genomes (Figure 1 The data in Figure 1 No matter what the ancestral animal distribution may have looked like, the diversity of the present-day intron density distributions makes it certain that extensive remodeling of intron–exon structures has occurred in at least some of these genomes since the six animals last shared a common ancestor. Several lines of evidence suggest that this process has been a slow one. Current estimates of the rate of intron insertion and deletion in animal genomes have placed it at less than one event/gene/200 million years [11]. Each of the animal genomes in our study contains tens of thousands of introns; this fact, together with the low intron indel rate, means that a vast period of time will have to elapse before any fluctuation in the ratio of intron insertion to deletion will act to alter the global distribution of introns within a genome. Intron density distributions are thus likely to be among the more slowly evolving attributes of any animal genome. The two insect distributions serve well to illustrate this point: their intron density distributions (Figure 1 A Survey of Proteome-Wide Patterns of Protein Similarities Next, we sought to characterize and compare the six annotated proteomes to one another with respect to protein similarities. These analyses are a necessary prerequisite for an examination of intron–exon structures in the context of protein similarities. As a first step, we preformed an all-against-all BLASTP [34] search of the six annotated proteomes, and recovered sets of pair-wise reciprocal best hits. From each BLASTP hit, we then selected the highest-scoring high-scoring segment pair (HSP) to avoid complications arising from overlapping sequence alignments. These reciprocal best-hit best HSPs provide nonidentical but intersecting sets of putative orthologs with which to examine patterns of protein evolution. A strength of this approach is that it makes available the largest possible set of putative orthologs for subsequent analyses. This means that gene families restricted to a subset of the proteomes will be included, as will more rapidly evolving proteins that lack clear orthologs over all evolutionary distances. Thus, the analyses presented below provide an overview of protein similarities on the largest possible scale, and complement previous analyses employing smaller subsets of orthologous proteins drawn from different combinations of annotated proteomes [35–37]. Proteome-wide patterns of similarity. Figure 2 Figure 2 In order to assay the impact of unequal rates of protein evolution on these data, we also compared the six animal proteomes to the A. thaliana [41] proteome. Previous studies of C. elegans 18s ribosomal sequences and proteins have suggested that they are rapidly evolving [40,42], and our data demonstrate that this is also the case for the proteome as a whole: C. elegans reciprocal best-hit best HSPs are consistently less similar to their Arabidopsis partners than are human and A. thaliana HSPs (Figure 2 Recasting Trends in Protein Similarity as a Phylogenetic Tree For purposes of further analysis, we recast the distributions shown in Figure 2
This approach to consensus phylogenetic tree construction differs from standard methods [43] in that it bypasses the need to construct multiple alignments as a prelude to tree construction; thus, it is much faster than existing approaches, and scales well for comparisons of multiple annotated proteomes. An additional strength of the approach is that it lends itself in a natural fashion to bootstrap analysis [44]. Bootstrap values for each node in the tree can be obtained by randomly and repeatedly resampling a subset of the HSPs from each pair-wise comparison of proteomes, constructing a new tree using these data, and then ascertaining how frequently the resulting trees contain the same nodes as the consensus tree (see Materials and Methods for more details). As the bootstrap values in Figure 3 Intron–Exon Structures in the Context of Protein Similarities Our characterization of proteome-wide patterns of amino acid similarities (summarized in Figures 2 Cursory examination of these HSPs makes clear two important facts. First, genome-wide trends in intron–exon structural similarities roughly parallel those of phylogeny and protein similarity. For example, 92% of human–mouse, 36% of human–C. intestinalis, and 15% of human–D. melanogaster reciprocal best-hit best HSPs have identical intron–exon structures. Summarizing similarities in intron–exon structures as simple percentages, however, fails to account for the fact that intron densities vary between genomes. As our earlier characterization of intron densities revealed (Figure 1 Quantifying similarities in intron–exon structures. In order to address differences in intron density, we formulated a more exacting, though less intuitive, definition of intron–exon structural similarity that takes intron density into account. To do so, we calculated a log odds ratio (LOD) score for each set of concatenated reciprocal best-hit best HSPs in toto, wherein the ratio of the observed number of aligned splice junctions to the expected frequency was used as a measure of global similarities in intron positions for two genomes. To obtain the expected frequency of aligned introns, we multiplied the frequencies of introns within query and subject portions of the concatenated alignment. Thus this measure of intron–exon similarity controls for the differing frequencies of introns in the different genomes. It is also essentially identical to the standard LOD score approach used to measure protein similarities [45]. To summarize the results of this analysis, we recast the resulting matrix of LOD scores into the phylogenetic tree shown in Figure 3 As was the case for protein similarities (Figure 3 Intron–Exon Structures Evolve Independently of Protein Sequences Protein identity versus intron–exon structure. One issue not addressed by our previous analyses is the extent to which evolution of intron–exon structures is coupled to that of protein sequences. A clear understanding of the impact of protein-sequence evolution on gene structures is desirable if gene structures are to be used for phylogenetic investigations. Figure 4
Although phylogeny is the primary factor structuring the data in Figure 4 Controlling for the impact of protein conservation. The finding that the rate of change in a gene's intron–exon structure is influenced by selection on the protein it encodes (Figure 4
Note that the tree in 5B suggests the same phylogenetic relationships as the tree shown in Figure 3 Evolution of Intron Lengths The quartet dataset. Having examined the evolution of intron–exon-structures, we next sought to investigate the evolution of intron lengths. Previous work [46] in this area has shown that similarities in intron lengths can be used for phylogenetic analyses. Our analyses further characterize the evolution of intron lengths. As a first step toward these investigations, we used a reciprocal best-hit approach to identify sets of human and mouse orthologous paralog pairs that we term “quartets.” Each quartet consists of four genes: a pair of human paralogs and their corresponding mouse orthologs, all of which share the same intron–exon structure as judged by the positions of their annotated splice junctions relative to the protein alignments of their reciprocal best-hit best HSPs. In total, we were able to identify 1,265 quartets. Note that every quartet is in theory the product of the same historical process—some gene duplicated before the time humans and mice last shared a common ancestor, and the products of this duplication event are represented today by genes i and j in the human genome and i′ and j′ in the mouse genome. This implies that the time since speciation will be less than (or equal to) the time since duplication. Hence, the orthologous members of a quartet share a more recent common ancestor than do the paralogous members of a quartet. Vertebrate intron pairs. As Figure 6
If this interpretation is correct, then the data in Figure 6 A possible clock. The data in Figure 6 To further investigate these questions, we turned to the six Drosophila genomes in our collection. Unfortunately, none of the recently sequenced Drosophila genomes has yet been annotated. Thus we could not use the reciprocal best-hits approach we used to explore correlations in intron lengths in Figure 6
As can be seen, the lengths of the inferred D. pseudoobscura orthologous introns are highly correlated with their D. melanogaster partners, despite 55 million years of independent evolution. These results show that, in both vertebrates and insects, orthologous intron lengths can remain correlated over tens of millions of years following speciation events, despite the different distributions of intron lengths (Figure 1 Forces Shaping Correlations in Intron Lengths The two distributions of orthologous intron pairs shown in Figures 6 In general, the transposon load of the vertebrate introns is higher than that of the insects, and much of the central bulge is due to the presence of additional LINE elements in either the human or mouse member of the pair (unpublished data). This is in sharp distinction to the two insects. Although some of the larger off-diagonal intron pairs in the insect distribution (Figure 7 Although transposons seem to explain the central bulge in the human–mouse distribution, they do not explain the details of the melanogaster–pseudoobscura distribution, since most of the intron pairs that comprise the arrowhead-like portion of the insect distribution are entirely transposon-free. Simple repeats and repetitive sequences also do not appear to play an important role in structuring this portion of the distribution, as there is no obvious tendency for the longer partner of the pair to contain additional low-complexity sequences (unpublished data). We also investigated the possibility that the arrowhead region might be an artifact of the assembly process. Although it is difficult to rule out this possibility, gaps in the D. pseudoobscura assembly did not seem especially over-represented in this portion of the distribution; moreover, given the mature state of the D. melanogaster assembly, it is inconceivable that there is sequence missing from a large number of D. melanogaster introns. Instead, we believe that the arrowhead-like portion of the insect distribution shown in Figure 7 The preceding observations imply that the rate at which intron pairs leave the diagonal in Figures 6 No doubt, other less easily measured factors also affect the rate at which intron lengths evolve within a species. If transposon load and/or rates of transposition, for example, vary greatly within two genomes, correlations in the lengths of orthologous introns will be a poor indicator of time since last common ancestor. Rather than attempt to measure the impact of a host of factors on intron lengths, we chose instead to ask a related question. Namely, how constant is the decline in the correlation in orthologous intron lengths over time? Doing so allowed us to directly assess not only whether hypothetical differences in transposition rates actually do act to modify the rates at which length correlations among homologous Drosophila introns decline with time, but also if any other factors that we have failed to consider thus far might influence the process as well. Figure 8
Intron lengths and the protein clock. Correlations in orthologous intron lengths seem to accord well with the passage of time (Figure 8 Figure 9
The strong correlation between the intron and protein clocks demonstrated in Figure 9 Discussion Intron Lengths Our investigations of intron length evolution focused on discovering the forces driving changes in intron lengths; the rate at which they change, whether or not the rate is constant; and if so, over what duration and phylogenetic scope. Intron lengths vary greatly among the six annotated genomes, yet when placed in their phylogenetic context general trends emerge. Every deuterostome genome in our collection is characterized by a predominance of class-II (>100 nt) introns, whereas class-I (<100 nt) introns predominate in the protostome genomes. The similarity in the human and mouse distributions suggests that these distributions change slowly over periods of tens of millions of years. Our examinations of intron lengths within the Drosophilae support the same conclusion. Moreover, these data suggest that introns do not simply grow longer and shorter over evolutionary timescales, but rather that the relative proportion of introns belonging to either class grows and shrinks over periods of hundreds of millions of years. In order to further investigate the evolution of intron lengths, we used a transitive reciprocal best-hit strategy to assemble a dataset of genes we term quartets. Each quartet consists of four genes: a pair of human paralogs and their mouse orthologs. In theory, the orthologous members of each quartet share a more recent common ancestor than do the paralogous members of the quartet. The strong correlation in intron lengths characteristic of orthologous quartet members demonstrates that intron lengths within the vertebrates remain correlated for tens of millions of years following speciation events. Our comparisons of orthologous and paralogous intron lengths in the Drosophilae show this to be true of these genomes as well. An Intron-Based Molecular Clock To measure the rate at which intron lengths change, we examined them in the context of the protein clock. Our results show that correlations in the lengths of orthologous introns have declined at a constant rate within the Drosophilae during the past 60 million years. We also demonstrate that change in intron length is largely independent of protein evolution. These two results mean that intron lengths provide a molecular measure of time independent of the protein clock. Moreover, we show that the information necessary to employ the intron clock can be extracted from incompletely sequenced genomes. As the distributions in Figure 8 The intron and protein clocks complement one another in a number of ways. Rates of change among protein sequences are reasonably constant for any given set of orthologous genes across phyla but vary widely among different gene families. On the other hand, our results show that the speed of the intron clock may vary between phyla, but not between gene families within a genus. These facts mean that the intron clock is well suited for investigating the evolutionary history of gene families. To see why, consider that a collection of genes all having the same intron–exon structures and intron lengths are likely the result of recent duplication events, regardless of whether they encode rapidly or slowly evolving proteins. Large-Scale Trends in Intron–Exon Structures Our analyses of gene structures demonstrate that change in intron–exon structures is subject to greater lineage-specific variation than is protein sequence evolution. The jagged right-hand side of Figure 3 Despite the variability in their rate of evolution, the fact that genome-wide trends in intron–exon structures support the same phylogeny as proteome-wide trends in protein sequence similarities (Figure 3 Intron Densities The large numbers of introns and low rate of intron insertion and deletion characteristic of animal genomes make it likely that intron density distributions are among the more slowly evolving traits of any animal genome. Consistent with this hypothesis, the D. melanogaster and A. gambiae distributions are well correlated after 250 million years of independent evolution. Our discovery that intron density distributions (Figure 1 Materials and Methods Software. CGL can be downloaded from http://www.yandell-lab.org/cgl. This site also provides extensive documentation on how to install and use the software. We also employed the Bioperl [52] libraries in our analyses. Obtaining the genomes and their annotations. The human, mouse, and C. elegans genomes were downloaded (August 2004) from the Genomes division of GenBank (ftp://ftp.ncbi.nih.gov/genomes), and converted to Chaos.xml—an input file format to CGL—using the script cx-genbank2chaos.pl provided with CGL. The A. mellifera genome was downloaded from GenBank on 21 July 2005. The D. melanogaster genome (release 3.1) was obtained from the Berkeley GadFly database [53], converted to Chado-xml (http://www.gmod.org), and then converted to Chaos-xml using the CGL script cx-chadoxml2chaos.pl. The A. gambiae genome was downloaded as an Ensembl database [21] using the CGL script cx-download-enscore.pl and then converted to Chaos.xml using the cx-ensembl2chaos.pl. To convert the C. intestinalis genome to Chaos.xml, we obtained its genome and transcript fasta files from the JGI Web site [23], and used sim4 [54] to realign each transcript to the genome, loaded the results into a GadFly database [53], and then converted the resulting annotations to Chaos-xml using the same process that was used for the D. melanogaster genome. The sequences of the five unannotated Drosophilae genomes were obtained as follows. The D. simulans W501 assembly (15 March 2004) was downloaded from http://www.dpgp.org; the D. yakuba assembly was downloaded from ftp://genome.wustl.edu/pub/seqmgr/yakuba on 15 March 2004; the D. virilis and D. ananassae assemblies were downloaded from http://rana.lbl.gov/drosophila on 21 June 2004 and 30 June 2004, respectively; the D. pseudoobscura assembly is as used in [29]. Reciprocal best-hit best HSPs were recovered from proteome-versus-proteome BLASTP searches using WU-BLAST [55] cut off: E = 10−5; wordmask = seg of the two corresponding nonredundant multi-fasta files. For each search, the database size (WU-BLAST parameter Z) was fixed to the size of the combined nonredundant protein multi-fasta file for all six genomes. Details of the specific analyses are given below. Figure 1 The tree shown in Figure 3 To produce Figure 4 Figure 5 To extract orthologous introns from unannotated genomes, each annotated D. melanogaster protein was searched against a genome assembly using WU-TBLASTN [55] (cut off: E = 1e−5; wordmask = seg). For all searches, the database size (Z) was set to 128,000,000 nt, the approximate size of the D. melanogaster euchromatic genome. CGL was then used to infer whether or not the details of the resulting TBLASTN HSPs of the best hit to the target genome were consistent with the presence of an intron in the target genome at the same position as an annotated splice junction on the melanogaster protein. Orthologous introns were counted as found only if the portion of the TBLASTN alignment flanking each inferred intron junction had greater than 25% identity and was at least 15 amino acids long—and only then if the putative intron began with the sequence GT and ended with an AG dimer; the procedure was thus quite stringent. The length distributions of these introns are shown in Figure S1. Current D. melanogaster annotation standards forbid the creation of an annotation having an intron less than 40 bases in length [19]. We adopted the same rule when constructing Figure 8 Figure S1: Annotated and Inferred Intron Lengths for Six Drosophila Species All intron lengths are inferred, with the exception of D. melanogaster. dana, D. ananassae; dmel, D. melanogaster; dpse, D. pseudoobscura; dsim6, D. simulans (strain 6); dvir, D. virilis; dyak, D. yakuba. x-axis, intron length (log 10); y-axis, frequency. (2.1 MB PSD) Click here for additional data file.(2.1M, psd) Figure S2: Neighbor-Joining Tree of Pair-Wise Correlations in Orthologous Intron Lengths Pair-wise Spearman correlation coefficients were used as a similarity measure. Bootstraps were produced by randomly resampling intron pairs with replacement. All intron lengths are inferred, with the exception of D. melanogaster. The long D. simulans branch length is a consequence of the low sequence coverage and the provisional nature of its genomic assembly. dana, D. ananassae; dmel, D. melanogaster; dpse, D. pseudoobscura; dsim6, D. simulans (strain 6); dvir, D. virilis; dyak, D. yakuba. (39 KB PSD) Click here for additional data file.(40K, psd) Figure S3: Similarity in Orthologous Intron Lengths Is Little Influenced by the Intensity of Selection on Flanking Exons x-axis, average D. melanogaster–D. yakuba Ka/Ks for each pair of exons flanking each orthologous intron pair. y-axis, fractional difference in length of the corresponding orthologous D. melanogaster–D. yakuba intron pair, Lc; where Lc = [(Li + Lj) − |Li – Lj|] / (Li + Lj); and Li and Lj refer to length of orthologous introns i, and j, respectively. If the two introns are the same length, Lc equals 1. If one member of the pair is twice the length of the other, Lc equals 0.5. Thus Lc provides a simple means to associate a similarity value with each pair of orthologous introns. For purposes of display 1 − Lc is plotted so that two introns having exactly the same length, flanked by exons with a Ka/Ks = 0 will lie at the graph's origin. Orange line, best-fitting linear regression (y = 0.0457x + 0.0863; R2 = 0.0015). No significant Spearman correlation coefficient was observed for these data. (2.6 MB PSD) Click here for additional data file.(2.5M, psd) Acknowledgments The authors would like to thank S. Mount, G. Marth, I. Korf, D. Shook, G. Miklos, and J. Stajich for providing constructive criticism of a draft of this manuscript; S. Shu and K. Eilbeck for database assistance; and W. Pearson for helpful suggestions regarding how to summarize large amounts of protein similarity data. Abbreviations
Footnotes Author contributions. MY conceived and designed the experiments, contributed to and coordinated the analyses of the data, contributed analysis tools, and wrote the paper. CJM, CS, SP, JK, and GH contributed to the analyses of the data. CJM, CS, SP, JK, GH, and SL contributed analysis tools. GMR contributed to experimental design and writing the manuscript. Funding. This work was supported by the Howard Hughes Medical Institute and by NIH grants HG00750 and HG00739. Competing interests. The authors have declared that no competing interests exist. Citation: Yandell M, Mungall CJ, Smith C, Prochnik S, Kaminker J, et al (2006) Large-scale trends in the evolution of gene structures within 11 animal genomes. PLoS Comput Biol 2(3): e15. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
||||||||||||||||||||||||||||||
Genome Biol. 2005; 6(5):R44.
[Genome Biol. 2005]Proc Natl Acad Sci U S A. 1998 Apr 28; 95(9):5094-9.
[Proc Natl Acad Sci U S A. 1998]Curr Biol. 2004 May 4; 14(9):R351-2.
[Curr Biol. 2004]Genetics. 2003 Dec; 165(4):1843-51.
[Genetics. 2003]Mol Biol Evol. 2004 Jul; 21(7):1252-63.
[Mol Biol Evol. 2004]Proc Natl Acad Sci U S A. 2005 Mar 22; 102(12):4403-8.
[Proc Natl Acad Sci U S A. 2005]Curr Opin Genet Dev. 2002 Dec; 12(6):701-10.
[Curr Opin Genet Dev. 2002]Proc Natl Acad Sci U S A. 2004 Aug 3; 101(31):11362-7.
[Proc Natl Acad Sci U S A. 2004]Science. 2001 Feb 16; 291(5507):1304-51.
[Science. 2001]Nature. 2001 Feb 15; 409(6822):860-921.
[Nature. 2001]Nature. 2002 Dec 5; 420(6915):520-62.
[Nature. 2002]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D383-9.
[Nucleic Acids Res. 2005]Science. 1998 Dec 11; 282(5396):2012-8.
[Science. 1998]Nature. 2002 Dec 5; 420(6915):520-62.
[Nature. 2002]Genome Res. 2005 Jan; 15(1):1-18.
[Genome Res. 2005]Mol Biol Evol. 2004 Jan; 21(1):36-44.
[Mol Biol Evol. 2004]Nucleic Acids Res. 1992 Aug 25; 20(16):4255-62.
[Nucleic Acids Res. 1992]Genome Res. 2002 Dec; 12(12):1854-9.
[Genome Res. 2002]Proc Natl Acad Sci U S A. 2002 Nov 26; 99(24):15513-7.
[Proc Natl Acad Sci U S A. 2002]Proc Natl Acad Sci U S A. 2004 Aug 3; 101(31):11362-7.
[Proc Natl Acad Sci U S A. 2004]J Mol Biol. 1990 Oct 5; 215(3):403-10.
[J Mol Biol. 1990]Science. 1996 Jan 26; 271(5248):470-7.
[Science. 1996]J Mol Biol. 1991 Jun 5; 219(3):555-65.
[J Mol Biol. 1991]Nature. 1997 May 29; 387(6632):489-93.
[Nature. 1997]Nature. 2000 Dec 14; 408(6814):796-815.
[Nature. 2000]Nature. 1997 May 29; 387(6632):489-93.
[Nature. 1997]Genome Biol. 2005; 6(5):R41.
[Genome Biol. 2005]Proc Natl Acad Sci U S A. 1996 Nov 12; 93(23):13429-34.
[Proc Natl Acad Sci U S A. 1996]Proc Natl Acad Sci U S A. 1992 Nov 15; 89(22):10915-9.
[Proc Natl Acad Sci U S A. 1992]Genome Res. 2004 Aug; 14(8):1610-6.
[Genome Res. 2004]J Mol Biol. 1990 Oct 5; 215(3):403-10.
[J Mol Biol. 1990]Mol Biol Evol. 2004 Jan; 21(1):36-44.
[Mol Biol Evol. 2004]Mol Biol Evol. 2004 Jan; 21(1):36-44.
[Mol Biol Evol. 2004]Trends Genet. 2003 Apr; 19(4):200-6.
[Trends Genet. 2003]Mol Biol Evol. 1994 Sep; 11(5):725-36.
[Mol Biol Evol. 1994]Mol Biol Evol. 2000 Jan; 17(1):32-43.
[Mol Biol Evol. 2000]Genome Res. 2002 Oct; 12(10):1611-8.
[Genome Res. 2002]Genome Biol. 2002; 3(12):RESEARCH0081.
[Genome Biol. 2002]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D468-70.
[Nucleic Acids Res. 2004]Genome Res. 1998 Sep; 8(9):967-74.
[Genome Res. 1998]Genome Res. 2005 Jan; 15(1):1-18.
[Genome Res. 2005]Genome Biol. 2002; 3(12):RESEARCH0083.
[Genome Biol. 2002]Genome Biol. 2002; 3(12):RESEARCH0084.
[Genome Biol. 2002]Mol Biol Evol. 2004 Jan; 21(1):36-44.
[Mol Biol Evol. 2004]Mol Biol Evol. 2004 Jan; 21(1):36-44.
[Mol Biol Evol. 2004]