![]() | ![]() |
Formats:
|
||||||||||||||||
Copyright © Copyright 2004 by RNA Society A computational and experimental approach toward a priori identification of alternatively spliced exons Department of Genetics and Developmental Biology, University of Connecticut Health Center, Farmington, Connecticut 06030-3301, USA 1Present address: Center for Vertebrate Genomics, Cornell University, Ithaca, NY 14853, USA. Reprint requests to: Brenton R. Graveley, Department of Genetics and Developmental Biology, University of Connecticut Health Center, 263 Farmington Avenue, Farmington, CT 06030-3301, USA; e-mail: graveley/at/neuron.uchc.edu; fax: (860) 679-8345. Received July 29, 2004; Accepted September 27, 2004. This article has been cited by other articles in PMC.Abstract Alternative splicing is a powerful means of regulating gene expression and enhancing protein diversity. In fact, the majority of metazoan genes encode pre-mRNAs that are alternatively spliced to produce anywhere from two to tens of thousands of mRNA isoforms. Thus, an important part of determining the complete proteome of an organism is developing a catalog of all mRNA isoforms. Alternatively spliced exons are typically identified by aligning EST clusters to reference mRNAs or genomic DNA. However, this approach is not useful for genomes that lack robust EST coverage, and tools that enable accurate prediction of alternatively spliced exons would be extraordinarily useful. Here, we use comparative genomics to identify, and experimentally verify, potential alternative exons based solely on their high degree of conservation between Drosophila melanogaster and D. pseudoobscura. At least 40% of the exons that fit our prediction criteria are in fact alternatively spliced. Thus, comparative genomics can be used to accurately predict certain classes of alternative exons without relying on EST data. Keywords: alternative splicing, Drosophila, comparative genomics, bioinformatics INTRODUCTION Alternative splicing is a process by which a single gene can give rise to multiple mRNAs, each of which can encode proteins with distinct functions (Black 2000; Graveley 2001). It has recently been estimated that as many as 74% of human genes are alternatively spliced (Johnson et al. 2003). Moreover, some genes can generate an extraordinary number of isoforms. For instance, the Drosophila Dscam gene can potentially generate 38,016 different isoforms (Schmucker et al. 2000). As a result, alternative splicing profoundly expands the coding potential of eukaryotic genomes. Alternative splicing also plays an important role in post-transcriptional gene regulation (Black 2000; Graveley 2001). The best characterized example of this is the sex-determination pathway in Drosophila (Forch and Valcarcel 2003). This pathway involves five genes—Sex-lethal (Sxl), transformer (tra), male-specific lethal-2 (msl-2), doublesex (dsx), and fruitless (fru)—that are each spliced differently in male and female flies. Disrupting the splicing of different genes in this pathway can cause a number of phenotypes, including male-specific lethality, transformation of the primary physical sexual traits, and alterations of male courtship behavior. Whereas alternative splicing of most of the sex-determination genes results in the production of different proteins in males and females, other alternative splicing events regulate whether or not a protein is produced. One example of this is a process called RUST (regulated unproductive splicing and translation) (Lewis et al. 2003). This process involves the alternative splicing of exons that introduce or remove premature stop codons which, in turn, control whether the mRNA is subject to nonsense-mediated decay. Thus, alternative splicing is a powerful mechanism for controlling and specifying protein production. Current methods for identifying alternatively spliced exons involve aligning ESTs to genomic DNA or reference mRNAs (Modrek et al. 2001). These methods work well for organisms, such as human and mouse, that have extensive EST coverage. However, even when EST coverage is quite extensive, many rare alternative splicing events can still be missed (Graveley 2001). Moreover, because EST coverage is heavily biased toward the 5′ and 3′ ends of genes, many internal alternative exons are not identified by this method. These issues are even more confounding for organisms that lack extensive EST coverage. Thus, methods that facilitate the identification of alternative exons would be quite useful to assist in genome annotation. Currently, computational methods that accurately identify alternative exons do not exist. Here, we describe a comparative genomics approach that identifies alternative exons with a fairly high degree of accuracy without relying upon any EST data. RESULTS AND DISCUSSION Previous studies in humans and mice have shown that alternative exons often exhibit a higher degree of sequence conservation between related species than constitutive exons (Modrek and Lee 2003; Sorek and Ast 2003; Sugnet et al. 2004). In addition, the introns flanking alternative exons, but not constitutive exons, are also highly conserved (Sorek and Ast 2003). We tested whether these criteria could be used to identify novel alternative exons by simply comparing the genomes of two related species. To do this, we analyzed the genomes of Drosophila melanogaster (Adams et al. 2000) and D. pseudoobscura (http://www.hgsc.bcm.tmc.edu/projects/drosophila/), which diverged approximately 30 millions years ago (Russo et al. 1995; Powell 1997). Consistent with the observations between humans and mice (Modrek and Lee 2003; Sorek and Ast 2003), we found that constitutively spliced exons are typically less conserved between D. melanogaster and D. pseudoobscura than known alternative exons, and that the introns flanking known alternative exons are frequently highly conserved (Fig. 1 1).
We experimentally tested whether the 117 highly conserved “constitutive” exons are actually alternatively spliced. RT-PCR was performed on a pool of RNA collected from D. melanogaster embryos, larvae, and male and female adults and the PCR products cloned and sequenced to verify their identity. Twenty-three of the 91 reactions that yielded RT-PCR products corresponding to the targeted gene exhibited some type of alternative splicing (Fig. 3 3).
To determine the extent to which these criteria improve the accuracy of alternative exon prediction, we tested whether 30 randomly selected exons that were not known to be alternatively spliced actually are alternatively spliced. Of these 30 exons, only one is alternatively spliced (data not shown). Interestingly, the properties of the alternative exon identified from the randomly selected group, exon 3 in CG7185, resembles the exons selected by the critera of our screen—it is 88.7% identical in D. pseudoobscura, and the sequence flanking this exon is also highly conserved. These results demonstrate that our selection criteria increase the accuracy of a priori prediction of alternative exons at least 12-fold (3.3% for randomly selected exons vs. 42% for predicted exons). The alternative exons identified in our screen encompass nearly all varieties of alternative splicing, including alternative 5′ or 3′ splice sites, cassette exons, mutually exclusive exons, and intron retention. These newly identified alternative exons reside in genes that encode proteins with a wide variety of functions and are expressed in a broad spectrum of tissues (Table 1). In several instances, alternative splicing is expected to significantly affect the structure and/ or function of the encoded protein. For example, CG5658 (Klp98A) encodes a component of the cytoskeleton containing a kinesin motor, forkhead domain, and a PX domain (Miki et al. 2001). Exon 8 of this gene is alternatively spliced and results in a removal of the forkhead and PX domains from the protein, thereby significantly affecting the signaling properties of the molecule (Fig. 4 4).
In addition to the candidate alternative exons, we found a few novel alternative exons not predicted by our screen. For instance, exon 5 of CG12891 (CPTI) (Jackson et al. 1999), a carnitine ethyltransferase, was a candidate alternative exon that we found to be alternatively spliced. However, this exon was alternatively spliced in a mutually exclusive manner with a novel, unannotated, upstream exon (Fig. 4 4). We analyzed several features of the highly conserved exons to identify properties that differ between those that we observed to be alternatively spliced and those for which alternative splicing was not observed. The group of highly conserved alternative exons we analyzed included the 23 new exons we experimentally identified as well as the 45 previously known alternative exons. Surprisingly, we found no significant differences in the relative strength or nucleotide composition of the 5′ or 3′ splice sites between the two sets of exons (data not shown). However, we identified two features that differed between these two groups of exons. First, the distribution of the exons between each of the three reading frames is different in each group. Whereas the group of exons for which alternative splicing was not observed are evenly distributed between each reading frame, the group of alternative exons is enriched in exons that maintain the reading frame (p = 0.01) (Fig. 5A 5A).
Our results demonstrate that comparative genomics can be used to predict whether an exon is alternatively spliced with a fairly high degree of accuracy. Although the exons we tested were identified solely on the basis of their high degree of conservation, we also identified two features—higher degree of intron conservation and greater tendency to maintain the reading frame—that appear to further distinguish alternative and constitutive exons. Adding these features to the criteria of high exon and intron similarity may improve the accuracy of alternative exon prediction. The high degree of identity used in our screen (95% exon identity, 75% intron identity) most likely exceeds the lower limits of exon and intron identity useful for accurate prediction. This is supported by the fact that the only alternative exon identified in the group of randomly selected exons we tested was 88% identical in the exon and was flanked by conserved intron sequences. Thus, further experiments will be necessary to determine the lower limits of identity that can be used to accurately predict alternative exons. This will obviously depend on the amount of divergence between the species being compared. For example, analysis of whole genome shotgun traces of five additional Drosophila species (D. simulans, D. yakuba, D. ananassae, D. mojavensis, and D. virilis) indicates that the percent identity of these conserved exons differs between species. For example, while exon 28 of CG1522 (cac) is 98% identical between D. melanogaster and D. pseudoobscura, the same exon is only 89% identical between D. melanogaster and D. virilis (data not shown). Determining these limits for each pair of species will be important since they will significantly increase the number of alternative exons that can be identified by this means. Although this approach will be useful for identifying potential alternative exons, there are at least two classes of alternative exons that will not be identified using these criteria. The first class is small alternative exons, which will be difficult to identify based on percent identity alone. The second class of exons that will be missed by comparative genomics are those that are species specific (Modrek and Lee 2003). Recent studies in mammals have shown that a surprisingly large number of alternative exons are species specific. Additionally, there are some alternatively spliced exons that are specific to D. melanogaster or D. pseudoobscura (Graveley et al. 2004). Nonetheless, there are numerous alternative exons that are highly conserved between related species. Moreover, the finding of novel, unannotated alternative exons that are highly conserved suggests that many conserved noncoding sequences may in fact prove to be novel alternative exons. Thus, using comparative genomics to identify potential alternative exons should significantly advance our ability to accurately assess the amount of alternative splicing that occurs in any organism, thereby bringing us closer to understanding how organisms develop and function. MATERIALS AND METHODS Computational analysis Percent identity data of the entire Drosophila melanogaster and Drosophila pseudoobscura genomes were downloaded from http://lbl.pipeline.gov/pseudo. All exons between 95% and 100% identical (using a window size of 50 bp) were analyzed using the VISTA browser (Mayor et al. 2000) to identify those that are flanked on one or both splice sites by intron sequence greater than 75% identical. Primers flanking all exons identified using this method were designed and the sequences are available at http://penguin.uchc.edu/~intron/philipps/oligos.html. Experimental analysis of alternative splicing Total RNA was isolated using Trizol (Invitrogen) from both D. melanogaster and D. pseudoobscura embryos, larvae, and adult females and males. cDNA was synthesized from 5 μg of a pool of total RNA from each developmental stage using Superscript II (Invitrogen) reverse transcriptase in a 20 μL reaction. PCR was performed using gene-specific primers and Taq DNA polymerase (Invitrogen). The reactions were incubated for 35 cycles of 94°C for 30 sec, 55°C for 15 sec, and 72°C for 1 min. PCR products were resolved by agarose gel electrophoresis. Each PCR product was excised from the gel, cloned into the pCRII-TOPO vector (Invitrogen), and sequenced. Acknowledgments We thank members of the Graveley laboratory and Rob Reenan for discussions and comments on the manuscript. This work was supported by an NIH grant (GM62516) to B.R.G. Notes Article published online ahead of print. Article and publication date are at http://www.rnajournal.org/cgi/doi/10.1261/rna.7136104. REFERENCES
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||
Cell. 2000 Oct 27; 103(3):367-70.
[Cell. 2000]Trends Genet. 2001 Feb; 17(2):100-7.
[Trends Genet. 2001]Science. 2003 Dec 19; 302(5653):2141-4.
[Science. 2003]Cell. 2000 Jun 9; 101(6):671-84.
[Cell. 2000]Cell. 2000 Oct 27; 103(3):367-70.
[Cell. 2000]Trends Genet. 2001 Feb; 17(2):100-7.
[Trends Genet. 2001]Prog Mol Subcell Biol. 2003; 31():127-51.
[Prog Mol Subcell Biol. 2003]Proc Natl Acad Sci U S A. 2003 Jan 7; 100(1):189-92.
[Proc Natl Acad Sci U S A. 2003]Nucleic Acids Res. 2001 Jul 1; 29(13):2850-9.
[Nucleic Acids Res. 2001]Trends Genet. 2001 Feb; 17(2):100-7.
[Trends Genet. 2001]Nat Genet. 2003 Jun; 34(2):177-80.
[Nat Genet. 2003]Genome Res. 2003 Jul; 13(7):1631-7.
[Genome Res. 2003]Pac Symp Biocomput. 2004; ():66-77.
[Pac Symp Biocomput. 2004]Science. 2000 Mar 24; 287(5461):2185-95.
[Science. 2000]Mol Biol Evol. 1995 May; 12(3):391-404.
[Mol Biol Evol. 1995]Proc Natl Acad Sci U S A. 2001 Jun 19; 98(13):7004-11.
[Proc Natl Acad Sci U S A. 2001]Genetics. 1997 Jun; 146(2):607-18.
[Genetics. 1997]Nucleic Acids Res. 2004; 32(10):3070-82.
[Nucleic Acids Res. 2004]Biochem J. 1999 Aug 1; 341 ( Pt 3)():483-9.
[Biochem J. 1999]FEBS Lett. 1994 Jan 3; 337(1):81-7.
[FEBS Lett. 1994]Nucleic Acids Res. 2004; 32(4):1261-9.
[Nucleic Acids Res. 2004]Genome Res. 2003 Jul; 13(7):1631-7.
[Genome Res. 2003]Nat Genet. 2003 Jun; 34(2):177-80.
[Nat Genet. 2003]RNA. 2004 Oct; 10(10):1499-506.
[RNA. 2004]Bioinformatics. 2000 Nov; 16(11):1046-7.
[Bioinformatics. 2000]