![]() | ![]() |
Formats:
|
||||||||||||||||||||||||
Copyright © 2008, Cold Spring Harbor Laboratory Press Large-scale comparative analysis of splicing signals and their corresponding splicing factors in eukaryotes 1 Department of Human Molecular Genetics and Biochemistry, Sackler Faculty of Medicine, Tel-Aviv University, Ramat Aviv 69978, Israel; 2 Biomedical Informatics Unit, Pompeu Fabra University, PRBB E08003, Barcelona, Spain; 3 Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Ramat Aviv 69978, Israel; 4 Catalan Institution for Research and Advanced Studies (ICREA), E08010, Barcelona, Spain 5Corresponding author.E-mail gilast/at/post.tau.ac.il; fax 972-3-640-5168. Received June 17, 2007; Accepted October 10, 2007. This article has been cited by other articles in PMC.Abstract Introns are among the hallmarks of eukaryotic genes. Splicing of introns is directed by three main splicing signals: the 5′ splice site (5′ss), the branch site (BS), and the polypyrimdine tract/3′splice site (PPT-3′ss). To study the evolution of these splicing signals, we have conducted a systematic comparative analysis of these signals in over 1.2 million introns from 22 eukaryotes. Our analyses suggest that all these signals have dramatically evolved: The PPT is weak among most fungi, intermediate in plants and protozoans, and strongest in metazoans. Within metazoans it shows a gradual strengthening from Caenorhabditis elegans to human. The 5′ss and the BS were found to be degenerate among most organisms, but highly conserved among some fungi. A maximum parsimony-based algorithm for reconstructing ancestral position-specific scoring matrices suggested that the ancestral 5′ss and BS were degenerate, as in metazoans. To shed light on the evolutionary variation in splicing signals, we have analyzed the evolutionary changes in the factors that bind these signals. Our analysis reveals coevolution of splicing signals and their corresponding splicing factors: The strength of the PPT is correlated to changes in key residues in its corresponding splicing factor U2AF2; limited correlation was found between changes in the 5′ss and U1 snRNA that binds it; but not between the BS and U2 snRNA. Thus, although the basic ability of eukaryotes to splice introns has remained conserved throughout evolution, the splicing signals and their corresponding splicing factors have considerably evolved, uniquely shaping the splicing mechanisms of different organisms. Splicing of pre-mRNA is a key step in eukaryotic gene expression, contributing to gene regulation, protein diversity, and phenotypic complexity. Introns are removed from the pre-mRNA by the spliceosome, which is composed of five snRNPs (small nuclear ribonucleoprotein) (U1, U2, U4, U5, and U6), each containing a small RNA bound by proteins. High-precision recognition of introns is required for correct splicing. This recognition is achieved by the binding of splicing factors to signals of varying specificity that are located both in the intron and its flanking exons (Hastings and Krainer 2001; Black 2003; Collins and Penny 2005). In vertebrates, three signals are known to direct splicing: The 5′ splice site (5′ss) at the 5′ end of the intron, the polypyrimdine tract/3′ splice site (PPT-3′ss) at the 3′ end of the intron, and a branch site (BS) upstream of the PPT-3′ss (Hastings and Krainer 2001; Black 2003). Spliceosome assembly is initiated by the binding of specific splicing factors to these signals: the U1 snRNP to the 5′ss, the protein SF1 to the BS, the U2 snRNP auxiliary factor U2AF large subunit (U2AF2; also known as U2AF65) to the PPT, and the U2AF small subunit (U2AF1; also known as U2AF35) to the 3′ss. In an ensuing reaction, U2 snRNP associates with the pre-mRNA through a base-pairing interaction between U2 snRNA (small nuclear RNA) and the BS; and subsequent recruitment of the U4/U5/U6 tri-snRNP leads to the formation of the mature spliceosome (Kent et al. 2005). Comparing the splicing signals among available genomes is of great interest, because these signals are also known to be regulators of alternative splicing (Cartegni et al. 2002). Reconstructing the evolution of these signals is therefore important for understanding when and how alternative splicing evolved. We have previously suggested an evolutionary process for the appearance of alternative splicing in which ancestral splicing signals that supported constitutive splicing accumulated mutations. These mutations suboptimized the splicing signals, allowing them to be used in alternative splicing as well (Ast 2004). This is of special interest, because alternative splicing is believed to have contributed to the creation of phenotypic complexity among higher eukaryotes by increasing transcriptional and proteomic diversity within a given genome (Graveley 2001). Nonetheless, only few studies attempting to characterize splicing signals among different organisms have been performed. A major drawback of these studies is that they are limited in their taxonomic sampling (e.g., Lim and Burge 2001; Bon et al. 2003; Kupfer et al. 2004; Abril et al. 2005; Sheth et al. 2006) and in terms of splicing signals investigated (e.g., Irimia et al. 2007). Moreover, since different methods have been applied for analysis of the various signals, it is difficult to integrate the results from different studies. In some cases, these studies have even yielded contradictory results. Such is the case, for example, regarding the PPT in Schizosaccharomyces pombe: Some studies maintain that this organism’s introns lack a PPT (Zhang and Marr 1994), while others maintain that its introns have a PPT (Kaufer and Potashkin 2000) and that it contributes to splicing (Romfo and Wise 1997). Splicing signals are variable. In Saccharomyces cerevisiae, for example, the 5′ss is characterized by a set of highly conserved nucleotides at the 5′ end of the intron that serve as a binding platform for U1 snRNP during an early step of splicing. However, in humans these positions are considerably less conserved. Similarly, although the BS in S. cerevisiae is a highly conserved heptamer (“TACTAAC”) that binds the U2 snRNP, this signal is much more degenerate among vertebrates (Kaufer and Potashkin 2000; Izquierdo and Valcarcel 2006). The PPT shows even greater variability: In higher eukaryotes, the PPT adjacent to the 3′ss is a clear and essential splicing signal (Moore 2000; Reed 2000; Banerjee et al. 2004), whereas in some fungi its very existence is controversial. What may underlie changes in splicing signals? Splicing signals serve as binding sites for splicing factors. Thus, changes in splicing signals may be linked with, or due to, corresponding changes in splicing factors. Different studies have examined various components of the spliceosome and found it to be highly conserved across evolution (Kaufer and Potashkin 2000; Anantharaman et al. 2002; Koonin et al. 2004; Collins and Penny 2005). However, no studies to date have attempted to correlate changes in the splicing signals, on the one hand, with the factors binding them, on the other. Shifts in terms of genome architecture and intron-exon structure may underlie changes in splicing signals as well. Intron lengths, for example, have changed dramatically across evolution (Aury et al. 2006). The number of introns per gene has also changed considerably, with introns being relatively scarce in fungi but much more abundant in vertebrates (Collins and Penny 2006). Splicing signals are influenced by such changes: In many organisms, for example, the strength of the splice sites correlates positively with intron length (Fields 1990; Kupfer et al. 2004; Weir and Rice 2004; Dewey et al. 2006). The goal of this study was to characterize and analyze splicing signals and their corresponding splicing factors across the eukaryotic tree by employing a wide, systematic, comparative genomic approach to determine the extent to which changes in splicing signals can be attributed to complementary changes in splicing factors. We therefore compiled a data set of introns from 22 organisms, including organisms from each of the four major eukaryotic kingdoms: Plants, Protozoa, Fungi, and Metazoa. We adapted and developed a variety of algorithms for identifying and quantifying splicing signals in these introns. In parallel, we compiled data sets of the splicing factors that bind these signals during an early stage of spliceosome assembly. We found high variability in all splicing signals, often correlating with corresponding changes in the factors binding these signals. The most variable signal was the PPT: This signal is very weak among most fungi, intermediate in plants and protozoans, but gradually increasing in strength from invertebrates, to non-mammalian vertebrates, to vertebrates. This pattern correlated with changes in U2AF2, both in terms of domain conservation and in key residues that contact the PPT. Our results indicate that the three splicing signals underwent extensive changes during evolution, in parallel with considerable changes in terms of exon-intron architecture and domain structure of the splicing regulators. These changes were presumably shaped by the lifestyle of the organism, selective pressure on maintaining multi-intron genes, and the need to support alternative splicing. Results Database compilation and global overview of genomes We compiled a database of over 1.2 million introns from 22 fully sequenced organisms, including one plant, two protozoans, 12 fungi, and seven metazoans, based on the NCBI databases and GenBank annotations. These annotations were previously shown to be reliable sources for global analysis of splicing patterns (Collins and Penny 2006). We chose to include a relatively large number of fungi in our data set because, as a monophyletic group, fungi comprise both unicellular and multicellular organisms, making them good candidates for studying the changes in intron-exon structure and in splicing signals at the transition stage from unicellular to multicellular organisms. For the sake of simplicity, fungi were divided into two groups: Hemiascomycetous fungi, including all fungi between S. cerevisiae and Yarrowia lipolytica, and non-hemiascomycetous fungi, including the fungi from Neurospora crassa to Cryptococcus neoformans. This classification was based on the grouping pattern in many of our results. Notably, whereas hemiascomycetous fungi form a monophyletic group, non-hemiascomycetous fungi were paraphyletic, containing euascomycetes, archiascomycetes, and basidiomycetes. Considerable changes in terms of exon-intron architecture In terms of exon-intron architecture, considerable differences were found between the organisms (Table 1). Metazoans and the plant Arabidopsis thaliana were rich in introns, as reflected by the percentage of genes with introns and by the measure of intron density; hemiascomycetous fungi and the protozoan Cryptosporidium parvum were extremely intron poor; non-hemiascomycetous fungi and the protozoan Dictyostelium discoideum formed an intermediate group. Variations were also observed with respect to intron length: Introns were found to be relatively short throughout eukaryotic evolution, with the exception of vertebrates. An opposite trend was observed with regard to exon length, where metazoans, but also the plant A. thaliana, had considerably shorter exons than most protozoans and fungi. These observations are all consistent with past findings (Ruskin et al. 1985; Zhang and Marr 1994; Deutsch and Long 1999; Bon et al. 2003; Aury et al. 2006). We could also confirm, using our large data set, that internal exons (those flanked by introns on both sides) tend to be considerably shorter than external exons (which lack an intron at one of their sides) (Chen et al. 2002). See Table 1 for further details.
Analysis of the 5′ splice site Varying degrees of 5′ss conservation Based on a preliminary analysis in which we found considerable nucleotide bias between position −4 and 8 (the fourth position upstream of the 5′ss and the eighth position downstream from the 5′ss, respectively), we defined this 12-nucleotide (nt) region as the 5′ss (see also Lim and Burge 2001; Carmel et al. 2004). The most striking observation regarding the 5′ss was its varying degrees of conservation (Fig. 1
These trends were observed more clearly following quantification of the 5′ss signal by means of information content. Information content is a measure of sequence conservation, with high and low information content corresponding to high and low degrees of conservation, respectively (Hertz and Stormo 1999; Lim and Burge 2001). We divided the 5′ss region into exonic and intronic regions and calculated the information content in each of these regions (Fig. 2A
In view of the low conservation levels in the exonic part of the 5′ss in fungi, we next performed a functional analysis of these positions by assessing whether adherence to the consensus nucleotides in these positions anti-correlated with adherence to the consensus nucleotides in positions in the intronic part of the 5′ss. We found that, similar to the situation in metazoans, such anti-correlations between positions −1 and −2 and different exonic positions exist, which is indicative of the functional importance of these positions. Among many organisms, positions +7 and +8 were involved in significant correlations as well, highlighting the importance of these positions (see Supplemental Material and Supplemental Fig. S3). Reconstruction of 5′ss among early eukaryotic ancestors We next sought to understand whether the 5′ss of early eukaryotes was conserved, as we had proposed in the past (Ast 2004), or degenerate. For this purpose, we developed a maximum parsimony-based algorithm that receives a set of position-specific scoring matrices (PSSMs) and an evolutionary tree as input and reconstructs the most parsimonious ancestral PSSM at each node (see Methods). Examining the reconstruction of the ancestral 5′ss (Fig. 1 Correlation between changes in U1 snRNA and in the 5′ss Various fluctuations were observed when the consensus nucleotides for each of the 5′ss positions were compared among the different organisms. We were interested in assessing to what degree these fluctuations can be attributed to changes in the U1 snRNA sequence, which binds this signal. For this purpose, we compiled a data set of the U1 snRNA sequences in the various organisms. Using a battery of tools and algorithms (see Methods and Supplemental Material), we were able to identify the U1 snRNA sequence in 20 of the 22 species (Fig. 2B,C Many changes in the 5′ss cannot be correlated with changes in U1 snRNA. For example, the varying trends in conservation and in consensus nucleotides between position −1 and position +6 cannot be correlated with changes in U1 snRNA, as these seven positions remain unchanged throughout evolution. This underscores the importance of additional factors in determining the 5′ss. Nonetheless, various correlations between the 5′ss and U1 snRNA were found. These correlations involved predominantly positions −3 and +7, whose preferred nucleotide composition often correlated with the nucleotide binding this position in U1 snRNA (see Supplemental Material for complete analysis and discussion). Analysis of the PPT-3′ss PPT region We analyzed the PPT and the 3′ss as two separate signals. The PPT analysis is presented in this section. The 3′ss analysis, involving the last four intronic positions and the first two exonic positions, is presented in Supplemental Material. The PPT is located between the BS and the 3′ss, in what we term the “PPT region.” The PPT region was defined as lying between two borders: The 3′ border was invariably set as position −5 (i.e., the fifth-to-last position within the intron), whereas the 5′ border of this region was set at the median distance between the termination of the BS and the 3′ss (see Results section “Analysis of Branch Site”). To globally assess whether PPTs exist in this region in various organisms, we first searched this region for bias in nucleotide composition. The bias of each nucleotide at each position relative to the background frequency was defined as the difference between the position-specific and the background frequency of each nucleotide. Positional bias plots, visualizing statistically significant bias of each nucleotide at each position, are shown in Figure 3
Particularly interesting observations regarding the PPT region were made in metazoans, excluding Caenorhabditis elegans. In these organisms:
Comparative analysis of the PPT We next analyzed the PPT directly. For this purpose, we developed an algorithm that identifies stretches of pyrimidines and scores them based on their length and pyrimidine content. This is a good measure of PPT strength because longer, pyrimidine-rich PPTs are more efficient than shorter, purine-rich ones in directing splicing (Roscigno et al. 1993). We applied this algorithm to the last 50 nt of all introns and accepted a stretch as a putative PPT only if it ended within 10 nt of the 3′ss. This value was set because this is the maximal distance from the 3′ss in which PPTs were shown to be functional (Coolidge et al. 1997; Kol et al. 2005). Finally, a “PPT enrichment index” was calculated for each organism. This index was defined as the quotient of the mean PPT strength in introns of the particular organism divided by mean PPT strength in a randomized data set. The random data set corresponded to the original data set in terms of nucleotide composition and incorporated the BS and 3′ss signals in order to take into account the bias introduced by these two signals. Thus, the “PPT enrichment index” represents the fold-change in the mean PPT score in the original data set relative to the random one (see Methods). Results of this analysis are plotted in Figure 4A
A further observation from Figure 4A
PPT composition Different analyses corroborated our findings regarding the compositional biases of the PPT. These analyses confirmed a general bias for “T” in the PPT, a gradual enrichment in the “C” content of the PPT when moving from nonvertebrates through vertebrates to mammals, and a general bias against “A.” In addition, these analyses confirmed that different nucleotide biases are present in the PPT in the positions preceding and following position −10. These analyses are fully presented and described in Supplemental Material. Correlation between changes in the PPT and in factors binding it We next set out to determine to what extent changes in the PPT were determined by corresponding changes in the splicing factors that bind the 3′ end of introns during early stages of splicing. Specifically, we focused on U2AF2 and U2AF1, which recognize the PPT and the 3′ss, respectively (Zamore and Green 1989; Kent et al. 2005), and on SF1, which binds the BS and facilitates the binding of U2AF2 to the adjacent PPT (Manceau et al. 2006). We moreover conducted an in silico functional analysis of the identified proteins: We concentrated on known functional residues in these proteins, including regions that are important for RNA binding, as well as residues that are important for interactions with other splicing factors, and predicted to what extent these residues have undergone changes compromising their functionality. A description of the functional domains in these proteins and the full analysis is described and presented in the Supplemental Material. Here we report the highlights of our findings. For these three proteins, the 22 organisms can be divided into three groups: human-like (15 organisms), S. cerevisiae-like (four organisms), and human-reminiscent (three organisms). The human-like group comprises all sampled metazoans, plants, non-hemiascomycetous fungi, and the protozoan D. discoideum. The organisms in this group have functional homologs of all three proteins and we therefore concluded that in these organisms the 3′ intron end recognition is likely to take place as it does in human, with the U2AF heterodimer interacting with SF1 and with the PPT. This correlates with the fact that a PPT signal was observed in all these organisms. The S. cerevisiae-like group comprises S. cerevisiae, Candida glabrata, Eremothecium gossypii, and K. lactis. These organisms all lack functional copies of U2AF and instead contain homologs of MUD2, which is the analog of U2AF in S. cerevisiae. Therefore, 3′ss recognition in these organisms presumably follows the pattern of S. cerevisiae. The human-reminiscent group includes the two hemiascomycetes fungi Y. lipolytica and Debaryomyces hansenii and the protozoan C. parvum. In the two hemiascomycetes, fully functional SF1 and U2AF1 homologs were found. However, while the U2AF2 homolog has retained its ability of binding U2AF1 and SF1, it appears to have lost the ability to bind the PPT due to mutations in regions responsible for binding the PPT. We concluded that in this group the recognition of the 3′ss and BS is likely performed by U2AF1 and SF1, respectively, with U2AF2 functioning as a bridge between them. This is supported by the lack of PPT in these two organisms. In C. parvum, various mutations were found in all three proteins, suggestive of a 3′ss recognition mechanism considerably divergent from human. We next focused on the three RNA recognition motifs (RRMs) of U2AF2, two of which (RRM1 and RRM2) bind the PPT and the third, called U2AF homology motif (UHM), mediates the interaction between SF1 and U2AF2. Comparing the RRMs of U2AF2 of the different species to the corresponding human RRM (Fig. 5A
While the above results suggest that the PPT coevolved with RRMs that bind it, the decreased conservation may also reflect increased phylogenetic distances. To assess the functional importance of the decreased conservation, we focused on specific, key residues in RRM1 and RRM2 that have previously been shown to be required for PPT binding in human (Sickmier et al. 2006). These included residues participating in main-chain, side-chain, and water-mediated interactions (Sickmier et al. 2006). The characteristics of these residues, in terms of polarity, charge, and aromaticity, are therefore important for U2AF2 binding to the PPT: A change in polarity will affect the water-mediated interactions, whereas any change in charge, polarity, or aromaticity is expected to affect the side-chain interactions. Among non-hemiascomycetous fungi, we identified many such changes, with respect to metazoans, in key residues both in RRM1 (Fig. 5B Finally, we found that, in general, the UHM of the U2AF2 homologs has higher sequence similarity to the RRM domain of MUD2 than to the RRM1 and RRM2 domains (see Supplemental Fig. S17). In fact, the UHM from U2AF2 and the RRM from MUD2 present similar RNP motifs (see Supplemental Fig. S13). Moreover, the UHM in C. parvum represents an intermediate between the RRM of MUD2 and the UHM of U2AF2 in other organisms in terms of sequence conservation (see Supplemental Fig. S18). This provides further evidence for a common evolutionary history for U2AF2 and MUD2 (see also Abovich et al. 1994). Analysis of branch site To examine the BS signal, we developed a simple algorithm that extracts the putative BS motifs from the introns of the various organisms. The algorithm is based on the BS characteristics of two model yeasts, S. cerevisiae and S. pombe, as well as on general characteristics of the hemiascomycetous BS, as found by Bon et al. (2003) (see Methods). We validated our results by comparing our identified BSs with those identified by algorithms that have been used in the past (Kupfer et al. 2004; Kol et al. 2005) and against a set of biologically proven BSs in human (see Supplemental Material). The stringent requirements of the algorithm yielded satisfactory results, albeit at the price of (1) discarding a relatively large percentage of introns in which it cannot determine between two putative BSs, and (2) the branch site motifs yielded by the algorithm tending to be somewhat more conserved than they are (see Supplemental Material). Branch site motifs and ancestral reconstruction Branch motifs identified by the algorithm for representative organisms are shown in Figure 6A Based on the degree of conservation in the BS (Fig. 6B Based on the motifs of the various organisms, we reconstructed the ancestral BS motifs using the same maximum parsimony-based algorithm for reconstructing ancestral position-specific scoring matrices that was used for the reconstruction of the 5′ss. The ancestral BS at the root of the tree contained only 6.5 bits of information. This result is in agreement with the observation that highly conserved BSs were found only in one monophyletic group, the hemiascomycetous fungi, whereas among all other groups the BS was much more degenerate. Correlation between changes in the BS and in U2 snRNA To test whether changes in the BS motifs are correlated with changes in U2 snRNA, we compiled a data set of U2 snRNAs from the different organisms. Multiple sequence alignments of the regions binding the BS can be viewed in Supplemental Figure S2. The six positions of U2 snRNA that bind the BS were unchanged among all organisms. Therefore, variation in U2 snRNA cannot account for the variation in the BS motifs among the different organisms. However, we observed that for various species the bias around the BS signals extends beyond the BS heptamer, possibly reflecting an extended region of base-pairing with U2 snRNA. The hemiascomycetous fungi E. gossypii presented a clear example for such extended base-pairing (Fig. 6C Finally, we found that Y. lipolytica has a unique BS distance distribution. Although this organism has relatively long introns, the BSs were almost invariably found at the same location, immediately upstream of the 3′ss: Of the 709 introns in which BSs were identified in this organism, 560 (79%) are located at position −11 and 686 (97%) are located between positions −10 and −14. Thus, in Y. lipolytica, the BS and 3′ss form one consecutive stretch. This finding, along with the observation of the C-rich region upstream of the PPT (Fig. 6D Discussion In this study, the three major splicing signals were analyzed and compared across a wide array of eukaryotes. Our analyses have yielded several major findings. The first pertains to the high variability of the PPT signal. Although there is a certain bias for pyrimidines toward the 3′ end of the intron among all organisms, this signal is very weak among most fungi, stronger in plants and in protozoans, but by far strongest among metazoans. Moreover, among the latter group this signal appears to have evolved in terms of its nucleotide composition, length, and abundance. Second, we found various correlations between variations in splicing signals and in splicing factors: Such correlations were found between the 5′ss and U1 snRNA, as well as between the PPT-3′ss signal and the factors binding it. These findings highlight the importance of these splicing factors in determining splicing signals, but also underscore the importance of other factors in shaping them. Our final major finding pertains to the relatively low conservation of splicing signals at the root of the eukaryotic tree, based on a maximum parsimony reconstruction applied to the 5′ss and the BS. Evolution of the PPT The evolution of the PPT splicing signal and its functional importance in eukaryotes have been poorly addressed by previous studies. By using comparative genomics to analyze this signal, we were able to ask whether a PPT exists in an organism, to what extent, and how it compares with PPT signals found in vertebrates. Although a bias toward pyrimidines was detected just upstream of the 3′ss among most organisms, this bias was intermediate among plants and protozoans, very low, but existent, among most fungi, and very high among metazoans. Particularly interesting trends were observed in the metazoan PPT. First, we found a gradual increase in PPT strength along the metazoan lineages. In C. elegans the detected PPTs are very short and stem from a highly conserved “TTTTCAG” heptamer at the 3′ intron end. The “TTTTC” sequence has been shown to cross-link with U2AF2 (Zorio and Blumenthal 1999) and to be critical for 3′ss recognition (Hollins et al. 2005), while the “AG” interacts with U2AF1 (Zorio and Blumenthal 1999). Therefore, this heptamer appears to be a very reduced version of the PPT combined with the 3′ss. As the phylogenetic distance from human decreases, the PPT gradually increases in strength, length, and abundance. This gradual increase was apparent predominantly in D. melanogaster, zebrafish, and chicken; among the three mammals, the PPT strength is similar. These results extend the results of a previous study, which examined the last 20 nt within introns and found them homogeneous within tetrapoda but noticeably distinct from those of other vertebrate and invertebrate taxa (Abril et al. 2005). Using different approaches for analyzing the PPT, we consistently observed two different signals within the region just upstream of the 3′ss: A signal more biased in “T”, located toward the 5′ end of the PPT, and a more “C”-rich signal, located at the 3′ end of the PPT. Position −10, relative to the 3′ss, appears to serve as a key position, with the “T” signal peaking and the “C” signal falling at this position. What mechanism may underlie these two signals? One possibility is that they serve as binding regions for the RNA recognition motifs of U2AF2. It has previously been shown that RRM1 and RRM2 bind different regions along the PPT in vertebrates: RRM2 binds the more 5′ region of the PPT, whereas RRM1 binds the more 3′ region (Banerjee et al. 2003). Thus, the two signals may reflect the differential binding affinities of the two RRMs. In addition, the invariable “T” peak at position −10 may reflect function, as the maximal distance from the 3′ss in which PPTs were shown to be functional is 10 nt (Coolidge et al. 1997; Kol et al. 2005). We further observed a gradual increase in the bias toward “C” and an increase in the length of the region with this bias from C. elegans to human: PPTs among lower metazoans are biased for “T”s with a limited “C” signal at the 3′ end of the PPT only, whereas among higher metazoans the “C” signal was more widespread throughout the entire PPT. Differential preferences of RRMs cannot explain these observations since the RRMs have remained highly conserved along these lineages. These findings may therefore reflect a gradual increase, or shift, in selective pressure exerted by factors other than U2AF2 that bind to the PPT. These factors may require a more “C”-rich PPT nucleotide composition. NOVA and PTB are examples of two such potential splicing factors. The consensus binding sequences of PTB are “UCUUC” and “CUCUCU” (Spellman and Smith 2006) and that of NOVA is “YCAY” (with Y representing pyrimidines) (Ule et al. 2006). Thus, the increase in “C” may reflect the growing importance of such factors in splicing regulation among higher metazoans. Finally, with regard to the PPT composition, we noted two general phenomena: a preference for “T” over “C” and a preference for “G” over “A”. The preference for “T” over “C” is explained by past findings that “T”s and “C”s do not function equivalently within PPTs, with consecutive “T”s constituting the strongest PPT (Bouck et al. 1995; Coolidge et al. 1997). The bias against “A” may be due to a tendency to avoid the formation of a cryptic branch point (Ruskin et al. 1985; Kol et al. 2005). With “A” being the preferred branch point, there may be a selection against this nucleotide at positions following the true branch point. Correlation between changes in the PPT and in U2AF2 We found that changes in the PPT signal correlate with changes in U2AF2. First, among all organisms in which the U2AF heterodimer was found to have retained its functionality, we found statistically significant PPTs. Second, the inability of U2AF2 to bind the PPT in D. hansenii and Y. lipolytica corresponds with the lack of PPT in these organisms. Third, the gradual decrease in the conservation of the RNA binding domains in U2AF2, with respect to human, correlates with the gradual decrease in strength of the PPT. Fourth, we found that various residues known to be critical for U2AF2 binding in metazoans are not conserved among fungi, but are conserved to a greater degree in D. discoideum and A. thaliana. This pattern corresponds to the pattern of PPT strengths among the various organisms, suggesting that these key residues are indeed related to the strength of the PPT. However, the correlation between changes in U2AF2 and in the PPT is not perfect. Similar PPT enrichment indexes were found among the non-hemiascomycetous fungi and yet the pattern of substitutions in key residues in the RRMs of these fungi differed considerably. For example, the non-hemiascomycetous fungi C. neoformans and U. maydis exhibited fewer changes in the RRMs than other fungi, and yet share similar PPT strengths. In addition, in E. gossypii, C. glabrata, K. lactis, and S. cerevisiae, we would naively expect to find PPTs, as all contain S. cerevisiae-like factors. Yet, the first two organisms do not appear to have PPTs, although the latter two do. Finally, we observed that RRM2 was the most conserved RRM among all organisms with U2AF2 homologs. This indicates that RRM2 may be the dominant RRM in terms of recognizing the PPT, in particular in non-metazoans. A factor that may have influenced the increased dominance of RRM2 is the fact that the size of the cross-linking site for RRM2 is much more variable than that of other RRMs (Banerjee et al. 2003). Analysis and reconstruction of the 5′ss and BS We previously suggested that the ancestral eukaryote was characterized by strong splice sites and little or no alternative splicing and that the weakening of the splicing signals brought forth the rise of alternative splicing in multicellular eukaryotes (Ast 2004). This hypothesis was based on the comparison of the human 5′ss to those of two yeasts. Based on a much broader taxonomic sampling, our current data suggest that the increased conservation of splicing signals is a phenomenon mostly limited to hemiascomycetic yeasts, whereas among most organisms the splicing signals are degenerate, probably reflecting the ancestral state. In this respect, our results agree with, and extend, recent results obtained by Irimia et al. (2007): After comparing the 5′ss of many eukaryotes, these authors suggest that the ancestral eukaryotic 5′ss was degenerate. Here, we have developed a maximum parsimony algorithm in order to quantify evolutionary shifts in splice site motifs. Our results validate the degeneracy of the ancestral 5′ss and suggest that the ancestral BS was degenerate as well. We note, however, that our analysis includes only one plant species. Increasing the taxonomic sampling of deep-branching organisms would increase the accuracy of the ancestral splicing motif reconstruction. Splicing in the eukaryotic ancestor Past studies concluded that the early, eukaryotic ancestors were relatively rich in introns and that their genes contained relatively high intron densities (Rogozin et al. 2003; Nguyen et al. 2005; Raible et al. 2005; Roy and Gilbert 2005a, b; Sverdlov et al. 2005; Carmel et al. 2007). This observation, combined with our finding that splicing signals are degenerate, may suggest that the genome of the ancestral eukaryote was more similar to the mammalian genome than previously anticipated and opens up the possibility that alternative splicing existed early in eukaryotic evolution. However, this conclusion must be tempered by a further observation emerging from this study, namely that there is no direct correlation between degenerate splicing signals and levels of alternative splicing. This can be demonstrated by the fact that various organisms with very little, or no, known alternative splicing in the form of exon skipping, such as S. pombe, N. crassa, M. grisea, and D. discoideum (Ast 2004), have splicing signals that are as degenerate as those of metazoans, such as human and mouse, with very high levels of alternative splicing (Kim et al. 2007). Our findings imply that the eukaryotic ancestor resembled extant vertebrates in terms of splicing factors, as well. We found that the U2AF heterodimer and SF1 are conserved in most of the analyzed organisms. This indicates that these splicing factors existed in the early, eukaryotic ancestor. In support of this hypothesis, we also found that the RRM domain of MUD2 resembles the UHM domain, raising the possibility that MUD2 and U2AF2 have a common origin. The ancestral U2AF2 presumably had a domain organization similar to U2AF2 and has evolved differently along various lineages: In hemiascomycetous fungi, the UHM domain became dominant and other domains were lost, whereas among other organisms RRM2 became more dominant. The residual similarities between RRM domains and the existence of possible pseudogenes for some of the splicing factors serve as further evidence that the changes in the splicing regulatory proteins have taken place gradually and have occurred independently in different organisms. In E. gossypii, for example, a U2AF1 homolog was found, but it appears to have lost its function. The splicing factors in the three organisms in the human-reminiscent group show similarity to, but a considerable functional divergence from, their human counterpart, serving as further proof for their gradual evolution. Thus, our results suggest that in terms of both splicing signals and splicing factors, the eukaryotic ancestor resembled extant vertebrates. This is in agreement with results of previous studies that have made similar conclusions both regarding the splicing signals (Irimia et al. 2007) and the spliceosome components (Kaufer and Potashkin 2000; Anantharaman et al. 2002; Koonin et al. 2004; Collins and Penny 2005). Splicing in Y. lipolytica Several unique features were observed in Y. lipolytica. Most prominently, we found that in this organism the BS and 3′ss form one combined sequence. It is noteworthy that a similar, 12-nt BS–3′ss juxtaposition was observed in the two deep-branching eukaryotes Trichomonas vaginalis and Giardia lamblia (Vanacova et al. 2005). With regard to these two organisms, it has been hypothesized that the juxtaposition of the BS and the 3′ss reflects a simplified spliceosomal assembly, combining the two steps of BS and 3′ss recognition (Vanacova et al. 2005). Our observation that the RRMs of U2AF2 are not conserved and apparently nonfunctional in Y. lipolytica sheds a somewhat different light on the BS–3′ss juxtaposition. Our findings suggest that U2AF2, which retained its capability to bind SF1 and U2AF1, may serve as a bridge between the two molecules, without binding the pre-mRNA. Alternatively, U2AF2 may have lost its functionality completely in Y. lipolytica. Is U2AF2 loss, or modification of function, responsible for the BS–3′ss juxtapositions among other organisms as well? This analysis could not be carried out for G. lamblia as its genome sequence is not sufficiently complete to perform the homology search. However, an analysis of the splicing factors in T. vaginalis revealed a situation very similar to the one in Y. lipolytica. We were able to identify functional homologs for SF1 and U2AF1, but U2AF2 was found to be very divergent from the human protein. Additionally, the U2AF2 in T. vaginalis has only a UHM functional domain and no arginine-rich region, similar to the one we found in C. parvum. Thus, in this organism, too, the BS–3′ss juxtaposition appears to be coupled with modifications in the function of U2AF2, as in Y. lipolytica. In Y. lipolytica we also found a clear, C-rich signal upstream of the BS. The composition, location, and lack of functional U2AF2 indicate that this signal is not a classic PPT. What may underlie this signal? Previous studies have indicated that initial binding of U2 snRNP to the BS region must be stabilized by an interaction with an anchoring site, located upstream of the BS (Gozani et al. 1996; Kramer 1996; Ast et al. 2001). Thus, this signal may serve as such an anchoring site. Overall, our results indicate that the three major splicing signals have changed considerably throughout evolution, in parallel with shifts in exon-intron architecture and in concert with domain structure and key residues of splicing regulators. These components interact with and impact on each other and, though they are presumably determined by the lifestyle of organisms, they also, in turn, help determine an organism’s lifestyle by means of diversifying mechanisms such as alternative splicing. Methods Database assembly The complete genomes of C. parvum (Build 1.1), D. discoideum (Build 2.1), S. cerevisiae (Build 2.1), C. glabrata (Build 1.1), K. lactis (Build 1.1), E. gossypii (Build 1.1), D. hansenii (Build 1.1), Y. lipolytica (Build 1.1), N. crassa (Build 1.1), M. grisea (Build 1.1), A. fumigatus (Build 1.1), S. pombe (Build 1.1), U. maydis (Build 1.1), and C. neoformans (Build 1.1) were downloaded from the National Center for Biotechnology Information website (http://www.ncbi.nlm.nih.gov/). Gene, intron, and exon information was extracted from the annotated genomes using a BioPerl script. This script reads a sequence of annotated GenBank files and generates database records of exons and introns, along with statistical information pertaining to the exon-intron composition, the individual genes, and the entire genome. To allow analysis of the 5′ss and 3′ss, introns were extracted along with the last 15 nt of the upstream exon and the first 10 nt of the downstream exon. Introns and exons for human (Homo sapiens, Build 35.4), mouse (Mus musculus, Build 34.1), dog (Canis familiaris, Build 2.1), chicken (Gallus gallus, Build 1.1), zebrafish (Danio rerio, release Zv4), C. elegans (release 2003), D. melanogaster (Build 4.1), and A. thaliana (release 2004) were extracted from the Exon-Intron Database (http://hsc.utoledo.edu/depts/bioinfo/database.html) (Saxonov et al. 2000), along with the upstream and downstream regions of the flanking exons. In cases of different alternatively spliced isoforms of the same gene, only the first annotated isoform was extracted; all other isoforms were excluded, in order to avoid redundancy. Three filtrations were applied to the original data set: (1) All non-GT–AG introns were discarded (2) all introns <15 nt were discarded, and (3) U12 introns were discarded as well. U12 introns were identified based on conformance with position-specific scoring matrices of their 5′ss and BSs. A 20-position matrix representing the U12 5′ss and a 12-position matrix representing the BS were obtained from Levine and Durbin (2001). BSs were searched in the last 38 positions of the introns, as in Sheth et al. (2006). Any sequence whose log-odd score of both the 5′ss and the BS exceeded an empirically derived threshold of −43.54 and −19.73, respectively, was considered a potential U12 intron. This empirical threshold was designed to include 100% of the 5′ss and 99% of the BSs in the database of U12 introns composed by Levine and Durbin (2001). See Supplemental Table S1 for details on the number of introns filtrated at each step. Position-specific scoring matrices (PSSMs) and scoring Once a data set of 5′ss and BS was compiled for each organism, PSSMs containing the frequency of each nucleotide at each position were generated. Scores were assigned, both to the 5′ss and the BS of each intron, based on their adherence to their respective PSSMs. This score (S) is calculated as follows:
Background frequencies Intronic background frequencies were calculated by separately pooling all introns of each organism and calculating the relative frequency of each nucleotide. See Supplemental Table S1 for a list of the intronic background frequencies in each organism. Information content Information content (I) is a measure of sequence conservation, with high and low information content corresponding to high and low degrees of conservation, respectively. It is measured as follows:
Pictograms Graphical representations of PSSMs were composed by the PICTOGRAM application developed by Burge et al. (1999), downloaded from http://hollywood.mit.edu/burgelab/software. The height of each letter is proportional to the frequency of the corresponding base at the given position, and bases are listed in descending order of frequency from top to bottom. Ancestral 5′ss and BS reconstruction The ancestral reconstruction of the 5′ss and BS PSSMs was performed using the maximum parsimony paradigm. Specifically, a PSSM was assigned to each tree node so that the overall change in the PSSMs along the tree branches was minimized. We assume independence among PSSM positions, and in the following we formally describe the algorithm for reconstructing ancestral PSSM at a single position. We first define a distance between two PSSMs at a specific position. Let f(X) be the frequency of nucleotide X (f(A) + f(C) + f(G) + f(T) = 1). The distance between two PSSMs, x and y, at that position is defined as:
Positional bias plots To provide graphical representation of bias in nucleotide composition at given positions with respect to expected background frequencies, we used positional bias plots. These plots visualize the bias of each nucleotide at each position relative to the background frequency, by showing the difference (Δ) between the position-specific and the background frequency. Only statistically significant biases (P < 0.01) are shown. Statistical significance was determined by performing χ2 tests between the observed frequency of each nucleotide at each position and the expected background frequency. The y-axis presents the extent of the bias: For example, assuming a 25% background frequency of “C”, a value of 0.1 for “C” at a given position indicates that at this position the frequency of “C” is 35%, while a value of −0.1 would indicate that “C” appears at a frequency of only 15% at that position. BS detection In the case of S. cerevisiae, the BS consensus sequence is a highly conserved heptamer composed of “TACTAAC”, but with large variability in its location, appearing as far as >100 nt upstream of the 3′ss (Pikielny et al. 1983; Langford et al. 1984; Fouser and Friesen 1987). Among other hemiascomycetous yeasts, the first two positions of the BS are less conserved (Bon et al. 2003). In the case of S. pombe, the BS sequence is more degenerate, based on different variations of the NNYTRAY sequence, where Y stands for pyrimidines, R for purines, and N for any nucleotide, but with a greater tendency to appear close to the 3′ss. Thus, to detect the BS, we implemented the following algorithm (“find_bs.pl”, available upon request):
Although step 3 discards a relatively large percentage of introns, it guarantees that those BSs that are detected have no “serious” competitors, thus reducing the false-positive rate. Supplemental Table S1 summarizes the percentage of introns in which BSs were identified. Identification of PPT An algorithm to find sequence segments, in which pyrimidines are statistically enriched, was developed. Not only consecutive stretches were considered, as we were also interested in identifying “gapped” stretches, i.e., stretches including non-pyrimidines. Each stretch was assigned a score quantifying its pyrimidine enrichment. The score was calculated as the χ2 test score with 1 degree of freedom, comparing the observed number of pyrimidines with the expected one, assuming a uniform nucleotide distribution (the inter-species nucleotide content heterogeneity was subsequently accounted for by using randomized data sets; see below). Having defined a score for each segment, the optimal segment was searched based on the following algorithm:
As input, this algorithm receives a minimal score, serving as a threshold. This threshold score was set as the χ2 test score of six consecutive pyrimidines, as a minimal functional PPT has been shown to consist of 5–6 nt (Coolidge et al. 1997). Random data set for the PPT analysis To take into account the heterogeneity in nucleotide composition among the different organisms, we constructed a set of randomly permuted data sets of 50-nt intron ends. For each intron in each organism, a “random” 50-nt intron end was created by retaining a 3-nt 3′ss and a 7-nt BS (only if the BS was within the last 50 nt of the intron) and filling in the remaining 40 positions with 40 nt selected at random from within the intron. These 40 randomly selected nucleotides were obtained by first removing a 6-nt 5′ss, a 7-nt BS, and a 3-nt 3′ss from the intron sequence, randomly permuting the remaining positions, and selecting a 40-nt stretch from within the remaining sequence (or a shorter stretch, if the entire remaining intron sequence was <40 nt). We incorporated the BS and 3′ss signals into the randomized data sets, in order to take into account the bias introduced by these two signals. For organisms with only few introns (hemiascomycetous fungi and C. parvum), we created 30 random intron ends corresponding to each intron end, to obtain a representative, random data set. Compilation of snRNA and protein data sets We used a combination of BLASTN, the infernal package (Griffiths-Jones et al. 2005), and an algorithmic tool which we developed to identify the U1 snRNA and U2 snRNA homologs in the different organisms. For the identification of the SF1, U2AF2, and U2AF1 homologs, we used BLASTP (Altschul et al. 1990), Exonerate (Slater and Birney 2005), TBLASTN, and GeneWise (Birney et al. 2004). We used Pfam (http://pfam.sanger.ac.uk) and PROSITE (http://ca.expasy.org/prosite/) to confirm the existence of the characteristic domains in the three proteins. Multiple alignments of the entire protein sequences and of the domains were performed using T-COFFEE (Notredame et al. 2000). For a full description on the compilation of these data sets, view Supplemental Material. Acknowledgments We thank Yaron Racah for many insightful comments, critical observations, and stimulating suggestions. This work was supported by a grant from the Israel Science Foundation (1449/04 and 40/05), MOP Germany–Israel, GIF. E.E. is supported by the Catalan Institution of Research and Advanced Studies (ICREA) and by the grant BIO2005-01287 from the Spanish Ministry of Education and Culture. T.P. is supported by a Wolfson grant. S.S. and D.B. are fellows of the Edmond J. Safra Bioinformatics Program at Tel-Aviv University. Footnotes [Supplemental material is available online at www.genome.org.] Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.6818908 References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||
Curr Opin Cell Biol. 2001 Jun; 13(3):302-9.
[Curr Opin Cell Biol. 2001]Annu Rev Biochem. 2003; 72():291-336.
[Annu Rev Biochem. 2003]Mol Biol Evol. 2005 Apr; 22(4):1053-66.
[Mol Biol Evol. 2005]Mol Cell Biol. 2005 Jan; 25(1):233-40.
[Mol Cell Biol. 2005]Nat Rev Genet. 2002 Apr; 3(4):285-98.
[Nat Rev Genet. 2002]Nat Rev Genet. 2004 Oct; 5(10):773-82.
[Nat Rev Genet. 2004]Trends Genet. 2001 Feb; 17(2):100-7.
[Trends Genet. 2001]Proc Natl Acad Sci U S A. 2001 Sep 25; 98(20):11193-8.
[Proc Natl Acad Sci U S A. 2001]Nucleic Acids Res. 2003 Feb 15; 31(4):1121-35.
[Nucleic Acids Res. 2003]Eukaryot Cell. 2004 Oct; 3(5):1088-100.
[Eukaryot Cell. 2004]Genome Res. 2005 Jan; 15(1):111-9.
[Genome Res. 2005]Nucleic Acids Res. 2006; 34(14):3955-67.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2000 Aug 15; 28(16):3003-10.
[Nucleic Acids Res. 2000]Genes Dev. 2006 Jul 1; 20(13):1679-84.
[Genes Dev. 2006]Nat Struct Biol. 2000 Jan; 7(1):14-6.
[Nat Struct Biol. 2000]Curr Opin Cell Biol. 2000 Jun; 12(3):340-5.
[Curr Opin Cell Biol. 2000]RNA. 2004 Feb; 10(2):240-53.
[RNA. 2004]Nucleic Acids Res. 2000 Aug 15; 28(16):3003-10.
[Nucleic Acids Res. 2000]Nucleic Acids Res. 2002 Apr 1; 30(7):1427-64.
[Nucleic Acids Res. 2002]Genome Biol. 2004; 5(2):R7.
[Genome Biol. 2004]Mol Biol Evol. 2005 Apr; 22(4):1053-66.
[Mol Biol Evol. 2005]Nature. 2006 Nov 9; 444(7116):171-8.
[Nature. 2006]Mol Biol Evol. 2006 May; 23(5):901-10.
[Mol Biol Evol. 2006]Cell. 1985 Jul; 41(3):833-44.
[Cell. 1985]Nucleic Acids Res. 1994 May 11; 22(9):1750-9.
[Nucleic Acids Res. 1994]Nucleic Acids Res. 1999 Aug 1; 27(15):3219-28.
[Nucleic Acids Res. 1999]Nucleic Acids Res. 2003 Feb 15; 31(4):1121-35.
[Nucleic Acids Res. 2003]Nature. 2006 Nov 9; 444(7116):171-8.
[Nature. 2006]Proc Natl Acad Sci U S A. 2001 Sep 25; 98(20):11193-8.
[Proc Natl Acad Sci U S A. 2001]RNA. 2004 May; 10(5):828-40.
[RNA. 2004]Nat Rev Genet. 2004 Oct; 5(10):773-82.
[Nat Rev Genet. 2004]Nat Rev Genet. 2002 Nov; 3(11):838-49.
[Nat Rev Genet. 2002]Trends Genet. 2006 Jul; 22(7):375-87.
[Trends Genet. 2006]Nature. 2006 Oct 19; 443(7113):818-22.
[Nature. 2006]Bioinformatics. 1999 Jul-Aug; 15(7-8):563-77.
[Bioinformatics. 1999]Proc Natl Acad Sci U S A. 2001 Sep 25; 98(20):11193-8.
[Proc Natl Acad Sci U S A. 2001]Proc Natl Acad Sci U S A. 1990 Jan; 87(2):851-5.
[Proc Natl Acad Sci U S A. 1990]Nat Rev Genet. 2004 Oct; 5(10):773-82.
[Nat Rev Genet. 2004]J Biol Chem. 1993 May 25; 268(15):11222-9.
[J Biol Chem. 1993]Nucleic Acids Res. 1997 Feb 15; 25(4):888-96.
[Nucleic Acids Res. 1997]Hum Mol Genet. 2005 Jun 1; 14(11):1559-68.
[Hum Mol Genet. 2005]Nature. 1999 Dec 16; 402(6763):835-8.
[Nature. 1999]Proc Natl Acad Sci U S A. 1989 Dec; 86(23):9243-7.
[Proc Natl Acad Sci U S A. 1989]Mol Cell Biol. 2005 Jan; 25(1):233-40.
[Mol Cell Biol. 2005]FEBS J. 2006 Feb; 273(3):577-87.
[FEBS J. 2006]Mol Cell. 2006 Jul 7; 23(1):49-59.
[Mol Cell. 2006]Mol Cell. 2006 Jul 7; 23(1):49-59.
[Mol Cell. 2006]Genes Dev. 1994 Apr 1; 8(7):843-54.
[Genes Dev. 1994]Nucleic Acids Res. 2003 Feb 15; 31(4):1121-35.
[Nucleic Acids Res. 2003]Eukaryot Cell. 2004 Oct; 3(5):1088-100.
[Eukaryot Cell. 2004]Hum Mol Genet. 2005 Jun 1; 14(11):1559-68.
[Hum Mol Genet. 2005]Cell. 1986 Oct 10; 47(1):49-59.
[Cell. 1986]Nature. 1999 Dec 16; 402(6763):835-8.
[Nature. 1999]RNA. 2005 Mar; 11(3):248-53.
[RNA. 2005]Genome Res. 2005 Jan; 15(1):111-9.
[Genome Res. 2005]RNA. 2003 Jan; 9(1):88-99.
[RNA. 2003]Nucleic Acids Res. 1997 Feb 15; 25(4):888-96.
[Nucleic Acids Res. 1997]Hum Mol Genet. 2005 Jun 1; 14(11):1559-68.
[Hum Mol Genet. 2005]Trends Biochem Sci. 2006 Feb; 31(2):73-6.
[Trends Biochem Sci. 2006]Nature. 2006 Nov 30; 444(7119):580-6.
[Nature. 2006]Mol Cell Biol. 1995 May; 15(5):2663-71.
[Mol Cell Biol. 1995]Nucleic Acids Res. 1997 Feb 15; 25(4):888-96.
[Nucleic Acids Res. 1997]Cell. 1985 Jul; 41(3):833-44.
[Cell. 1985]Hum Mol Genet. 2005 Jun 1; 14(11):1559-68.
[Hum Mol Genet. 2005]RNA. 2003 Jan; 9(1):88-99.
[RNA. 2003]Nat Rev Genet. 2004 Oct; 5(10):773-82.
[Nat Rev Genet. 2004]Trends Genet. 2007 Jul; 23(7):321-5.
[Trends Genet. 2007]Curr Biol. 2003 Sep 2; 13(17):1512-7.
[Curr Biol. 2003]PLoS Comput Biol. 2005 Dec; 1(7):e79.
[PLoS Comput Biol. 2005]Science. 2005 Nov 25; 310(5752):1325-6.
[Science. 2005]Nucleic Acids Res. 2005; 33(6):1741-8.
[Nucleic Acids Res. 2005]Nat Rev Genet. 2004 Oct; 5(10):773-82.
[Nat Rev Genet. 2004]Trends Genet. 2007 Jul; 23(7):321-5.
[Trends Genet. 2007]Nucleic Acids Res. 2000 Aug 15; 28(16):3003-10.
[Nucleic Acids Res. 2000]Nucleic Acids Res. 2002 Apr 1; 30(7):1427-64.
[Nucleic Acids Res. 2002]Genome Biol. 2004; 5(2):R7.
[Genome Biol. 2004]Mol Biol Evol. 2005 Apr; 22(4):1053-66.
[Mol Biol Evol. 2005]Proc Natl Acad Sci U S A. 2005 Mar 22; 102(12):4430-5.
[Proc Natl Acad Sci U S A. 2005]Genes Dev. 1996 Jan 15; 10(2):233-43.
[Genes Dev. 1996]Annu Rev Biochem. 1996; 65():367-409.
[Annu Rev Biochem. 1996]Nucleic Acids Res. 2001 Apr 15; 29(8):1741-9.
[Nucleic Acids Res. 2001]Nucleic Acids Res. 2000 Jan 1; 28(1):185-90.
[Nucleic Acids Res. 2000]Nucleic Acids Res. 2001 Oct 1; 29(19):4006-13.
[Nucleic Acids Res. 2001]Nucleic Acids Res. 2006; 34(14):3955-67.
[Nucleic Acids Res. 2006]Bioinformatics. 1999 Jul-Aug; 15(7-8):563-77.
[Bioinformatics. 1999]Proc Natl Acad Sci U S A. 2001 Sep 25; 98(20):11193-8.
[Proc Natl Acad Sci U S A. 2001]Cell. 1983 Sep; 34(2):395-403.
[Cell. 1983]Cell. 1984 Mar; 36(3):645-53.
[Cell. 1984]Mol Cell Biol. 1987 Jan; 7(1):225-30.
[Mol Cell Biol. 1987]Nucleic Acids Res. 2003 Feb 15; 31(4):1121-35.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2003 Feb 15; 31(4):1121-35.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2003 Feb 15; 31(4):1121-35.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 1997 Feb 15; 25(4):888-96.
[Nucleic Acids Res. 1997]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D121-4.
[Nucleic Acids Res. 2005]J Mol Biol. 1990 Oct 5; 215(3):403-10.
[J Mol Biol. 1990]BMC Bioinformatics. 2005 Feb 15; 6():31.
[BMC Bioinformatics. 2005]Genome Res. 2004 May; 14(5):988-95.
[Genome Res. 2004]J Mol Biol. 2000 Sep 8; 302(1):205-17.
[J Mol Biol. 2000]