![]() | ![]() |
Formats:
|
||||||||||||||
Copyright © 2004, Cold Spring Harbor Laboratory Press C. elegans ORFeome Version 3.1: Increasing the Coverage of ORFeome Resources With Improved Gene Predictions 1 Center for Cancer Systems Biology and Department of Cancer Biology, Dana-Farber Cancer Institute, and Department of Genetics, Harvard Medical School, Boston, Massachusetts 02115, USA 2 Unité de Recherche en Biologie Moléculaire, Facultés Universitaires Notre-Dame de la Paix, 5000 Namur, Belgium 3 Agencourt Biosciences Corporation, Beverly, Massachusetts 01915, USA 4Corresponding author.E-MAIL marc_vidal/at/dfci.harvard.edu; FAX (617) 632-5739. Received February 23, 2004; Accepted June 15, 2004. This article has been cited by other articles in PMC.Abstract The first version of the Caenorhabditis elegans ORFeome cloning project, based on release WS9 of Wormbase (August 1999), provided experimental verifications for ~55% of predicted protein-encoding open reading frames (ORFs). The remaining 45% of predicted ORFs could not be cloned, possibly as a result of mispredicted gene boundaries. Since the release of WS9, gene predictions have improved continuously. To test the accuracy of evolving predictions, we attempted to PCR-amplify from a highly representative worm cDNA library and Gateway-clone ~4200 ORFs missed earlier and for which new predictions are available in WS100 (May 2003). In this set we successfully cloned 63% of ORFs with supporting experimental data (“touched” ORFs), and 42% of ORFs with no supporting experimental evidence (“untouched” ORFs). Approximately 2000 full-length ORFs were cloned in-frame, 13% of which were corrected in their exon/intron structure relative to WS100 predictions. In total, ~12,500 C. elegans ORFs are now available as Gateway Entry clones for various reverse proteomics (ORFeome v3.1). This work illustrates why the cloning of a complete C. elegans ORFeome, and likely the ORFeomes of other multicellular organisms, needs to be an iterative process that requires multiple rounds of experimental validation together with gradually improving gene predictions. The Caenorhabditis elegans genome sequence, released in December 1998, was nearly complete and highly accurate, with an error rate estimated at 1/30,000 (The C. elegans Sequencing Consortium 1998). The finished sequence was eventually released in November 2002, comprising 100,258,171 bp in six contiguous segments corresponding to the six C. elegans chromosomes (J. Sulston, pers com; http://elegans.swmed.edu/Announcements/genome_complete.html). Although the technology required for rapid and accurate whole-genome sequencing is mature, the gene prediction tools currently available to identify protein-encoding open reading frames (ORFs) and to define their exon/intron structures still need improvements. For exon prediction in mammalian genomes, these tools have an overall sensitivity and specificity of only 60% (Burset and Guigo 1996), and ~40% for the 5′ and 3′ gene boundaries specifically (Korf et al. 2001). Predicted genes can be truncated, extended, split, or merged (see Reboul et al. 2001), relative to their actual “observed” exon/intron structure. Using GeneFinder, a gene prediction tool developed for C. elegans (http://ftp.genome.washington.edu/cgi-bin/genefinder_req.pl), a total of 19,477 ORFs were annotated in Wormbase release WS9 (August 1999; http://www.Wormbase.org; Stein et al. 2001). Approximately 50% of these ORFs were predicted ab initio, without experimental support. The C. elegans ORFeome project was launched to test the accuracy of these gene predictions, while simultaneously creating a resource of cloned full-length predicted ORFs to be used in various functional genomics and reverse proteomics studies (Reboul et al. 2001, 2003). ORFs were PCR-amplified between their 5′- and 3′-ends, and cloned using the Gateway recombinational cloning system (Hartley et al. 2000; Walhout et al. 2000a,b). PCR amplification was performed on a highly representative cDNA library using gene-specific primer pairs for each of the 19,477 ORFs based on WS9 predictions. Gateway tails attached to all primers allowed the cloning of the ORFs into the pDONR201 vector, resulting in a total of 11,984 (61.5% of the ORFs) Entry clones in the first version of the ORFeome (version v1.1; Supplemental Table 1). The C. elegans ORFeome version 1.1a (v1.1a) represents a consolidated set of 10,623 ORFs cloned in-frame, 11.4% (1361 out of 11,984) of all cloned ORFs in version 1 were cloned outof-frame because of mispredicted gene boundaries (v1.1b). This first version of the worm ORFeome contributed significantly to the reannotation of C. elegans gene structure. The alignment of OSTs (ORF Sequence Tags) to the corresponding predicted gene sequences allowed the improvement of C. elegans annotations by correcting the internal gene structure of 20% of v1.1a cloned ORFs. In addition, OSTs provided experimental verification for 45% of the set of “untouched” ORFs, that is, not detected yet by any mRNA or EST. For each gene, ORFeome v1.1a contains cloned pools that result from mixing ~50 to ~1000 Escherichia coli transformants for each Entry clone. Thus, such Entry pools might contain multiple splice variants and alleles corresponding to PCR misincorporations. We are in the process of generating a new resource, ORFeome v2 (Reboul et al. 2003), in which we isolate individual wild-type clones for all detected splice variants of ORFs cloned in v1.1a. We will shortly initiate similar attempts for the ORFs cloned in the ORFeome version 3 described below. The difficulties inherent in identifying ORFs within metazoan genomes and predicting their correct structure are not specific to C. elegans. Genome annotation initiatives in the model organisms Arabidopsis thaliana (Yamada et al. 2003) and Drosophila melanogaster (Hild et al. 2003) have also shown limited accuracy. The accuracy of current gene prediction algorithms is also a major issue for the human genome. High numbers of splice variants and lower signal-to-noise ratios caused by longer introns and intergenic regions render human genome annotations even more difficult than for the model systems experimentally validated so far. Hence, both in model organisms and in human, functional genomic and reverse proteomics studies, which require the use of large sets of full-length ORFs, are hampered by inaccuracies in gene prediction, limiting the usefulness of sequenced genomes. Since the release of Wormbase WS9 in 1999, continuous efforts to reannotate the C. elegans genome have occurred. Reannotations are mainly based on new experimental data, such as mRNAs and ESTs (the EMBL nucleotide sequence database [http://www.ebi.ac.uk/embl/] and the Y. Kohara DNA databank [DDBJ, http://www.ddbj.nig.ac.jp/]), as well as splice-leader sequences (Blumenthal et al. 2002). Furthermore, more refined ab initio approaches have allowed the reprediction of genes for which no confirmatory experimental data are yet available. To experimentally validate these new predictions, improve gene annotation, and generate a more complete C. elegans ORFeome resource, we attempted to clone the 4232 ORFs originally missed in v1.1a and that have been either repredicted or newly predicted between the release of WS9 and that of WS100 (May 2003). RESULTS Design of Version 3 of the C. elegans ORFeome Wormbase, the central repository for the C. elegans genome annotation, is updated biweekly, reflecting the continuous effort made both to correct the structure of previously predicted ORFs (referred to here as “repredicted ORFs”) and to predict new putative ORFs. To identify ORFs that could not be cloned or were cloned out-of-frame in ORFeome version 1, and have been repredicted in improved versions of the genome annotation, we chose to compare WS9 predictions to those of the recent Wormbase release WS100 (see Methods). WS100 is the first Wormbase release that has been archived in the public domain (“frozen”; http://ws100.Wormbase.org). For each of the 8854 ORFs that were not in v1.1a, we searched for repredictions that at least partially overlapped with the region between the previously predicted initiation and termination codons (“starts” and “stops”). We focused only on structure differences at the 5′- and 3′-boundaries, while ignoring internal structure differences. We found 2708 ORFs with repredicted starts or stops (Fig. 1A
These 4232 ORFs can be divided into two classes depending on whether their predicted coding sequence has been verified, at least partially, by EST and/or OST data (“touched” ORFs) or not (“untouched” ORFs; Fig. 1B Overall Assessment of WS100 As the quality of WS100 repredictions and new predictions has not been experimentally validated yet, we first tested their overall accuracy using a subset of ORFs. We compared the ORF cloning success rate using new WS100 predictions to that of the WS9 predictions from ORFeome version 1 on ORFs that could not be cloned previously. ORFs were PCR-amplified from our highly representative C. elegans cDNA library (Walhout 2000b) and cloned into the Gateway Entry vector pDONR201. Following a second round of PCR amplification from the Gateway Entry clone to confirm that inserts were present and of the corrected size, ORF sequence tags (OSTs) were generated. The OSTs were then aligned to the genome to confirm the identity of the clones. The cloning success rate was 59% (n = 111) using newly designed primers. In contrast, only 2.7% of attempted ORFs were successfully cloned using WS9-designed primers, used here as a negative control. These results clearly show that the C. elegans genome annotation has improved considerably between WS9 and WS100, and that primers designed based on these reannotations can amplify a substantial number of ORFs not originally cloned in ORFeome version 1. C. elegans ORFeome Version 3 In Version 3 of the C. elegans ORFeome project, PCR amplifications were performed for 4232 repredicted or new ORFs, using ORF-specific primers (Supplemental Fig. 1). Alignment of the resulting OSTs to the C. elegans genome revealed that 56% (2315 ORFs corresponding to 1378 repredicted ORFs and 937 new ORFs) were successfully cloned. The cloning success for touched ORFs is much higher (63%) than for untouched ORFs (42%), and is slightly lower than the cloning success rate of touched ORFs in ORFeome Version 1 (71%; Supplemental Fig. 2). We amplified 64% of ORFs that were cloned out-of-frame in ORFeome Version 1 (v1.1b). Among these, 87% are now cloned in-frame. Hence, reannotation efforts led to successful repredictions for 55.7% (64% × 0.87) of such ORFs, whereas wrong repredictions caused complete cloning failure in 36% of the cases. For the remaining 8.3% of originally out-of-frame ORFs, repredictions resulted again in out-of-frame PCR products. Of the ORFs cloned in ORFeome Version 3, 57% were shorter at one or both ends in WS100 relative to the gene annotation in WS9 (Fig. 2A
Corrections of Intron/Exon Organization In ORFeome Version 3, we corrected internal exon/intron structures for 540 (23.3%) cloned ORFs. Compared with WS100 predictions, OSTs could be used to extend 141 exons, truncate 165 exons, add 85 unpredicted exons, and delete apparently wrongly predicted 327 exons. In addition, 104 and 130 introns were added or deleted, respectively (Fig. 3
In comparison to ORFeome Version 1, the proportion of exons needing correction in ORFeome Version 3 decreased by 8%, which can be explained by a higher rate of EST coverage for the cloned ORFs. However, these additional EST data did not reduce the rate of ORFs cloned out-of-frame in ORFeome Version 3, because 11.7% (270) of all cloned ORFs display frame errors caused by mispredicted 3′- and 5′-boundaries. We have thus cloned 2045 (2315 - 270 out-of-frame) full-length ORFs in ORFeome Version 3. Correction of Truncated Clones As mispredictions of the 5′- or 3′-end of an ORF do not necessarily affect its internal gene structure, primers designed on mispredicted boundaries can give rise to truncated clones. Previously cloned ORFs that were subsequently merged, two or more at a time, into one single longer ORF in WS100 represent one class of such potentially truncated clones. Merges are typically based on additional EST data spanning the intergenic region between two individually predicted, neighboring ORFs. Our data set of 2708 repredicted ORFs contains 324 ORFs that resulted from a merge of two (251; Fig. 4A
Investigating the Existence of Clones Missing in Version 3.1 We next investigated whether the 44% of repredicted and newly predicted WS100 ORFs that could not be cloned here correspond to false-positive GeneFinder predictions, or genuine genes that need further exon/intron corrections. To obtain an estimate of the rate of repredicted ORFs not cloned in ORFeome Version 3 because of mispredicted ORF boundaries, we designed internal primers for a small subset of repredicted ORFs for which PCR amplification had failed (Reboul et al. 2001). These internal primers were designed to anneal to internally predicted exons, spanning at least one intron, and to amplify PCR products of 300 bp when the cDNA library is used as a template. As internal exons are easier to predict and hence more accurate than gene boundaries, many ORFs that are mispredicted at their 5′- and 3′-ends should be amplifiable using internal primers. We amplified internal PCR products of the correct length for 52% of ORFs missed in Version 3. The most likely explanation why we could not clone these ORFs in ORFeome Version 3 is that their 5′- or 3′-ends are still mispredicted. There are two reasons why we were unable to amplify internal PCR products for the remaining 48% of ORFs: ORFs could be mispredicted at the level of their internal exon/intron structure, which consequently may render them undetectable in the cDNA library using internal primers. In addition, predicted ORFs that were not amplified might be absent from the cDNA library because they were wrongly predicted and do not actually exist. We then investigated whether ORFs that we could not clone in ORFeome Version 3, were less supported by EST and Pfam data than ORFs that we successfully cloned. Of uncloned ORFs, 70% are either touched by EST data (16.5%), contain a Pfam motif (25.5%), or show evidence of both EST and Pfam data (28%). The number of cloned ORFs with EST and/or Pfam data is only slightly higher (74%). These results show that a substantial number of uncloned ORFs have experimental or bioinformatics evidence of their existence, supporting our conclusion that the main reason for cloning failure of C. elegans ORFs is the misprediction by Genefinder of their 3′- and 5′-boundaries. DISCUSSION The examples presented in this paper illustrate that the goal of cloning a complete ORFeome should be organized in gradual steps (Fig. 5
With the release of ORFeome v3.1, we have validated the existence of 2045 previously uncloned ORFs. Within this set, the internal structures of 540 ORFs were corrected. For most ORFs that were missed in ORFeome v1.1a, we relied on experimental data to obtain an accurate reprediction. Hence, a continuous supply of new experimental data is essential to reannotate the genome, correcting the gene structure of ORFs that were out-of-frame, mispredicted, or missed in previous versions. Sometimes, these data also reveal ORFs cloned in-frame that represent truncated versions of longer gene structures. ORFs cloned in-frame are thus also subject to change and consequently need to be replaced in the ORFeome resource, underlining the fluid character of an ORFeome resource. The reannotation of the genome and the experimental validation of these new predictions by cloning thus go hand in hand. Iteratively repeating these two steps increasingly generates a more complete representation of the C. elegans ORFeome. The first two C. elegans ORFeome projects relied on different snapshots of the genome annotation. In both, ~60% of the attempted ORFs were successfully cloned as Gateway Entry clones. The rate of wrongly predicted exons in ORFeome Version 3 decreased by 8% compared with ORFeome Version 1, probably as a consequence of additional EST coverage for the ORFs in WS100. However, the rate of cloned ORFs that displayed frame errors, indicating mispredicted ORF boundaries, remained the same (~11%). Furthermore, the comparison of internal primer experiments performed in ORFeome Versions 1 and 3 showed a drop in the success rate of PCR amplification (73% vs. 52%) for ORFs missed in ORFeome v1.1. Although this lower success rate might be caused by a higher rate of false negatives and less optimal experimental conditions in ORFeome Version 3, it is more likely that, as the set of ORFs remaining to be cloned decreases, the proportion of ORFs that do not exist or that are difficult to predict increases. Ongoing cloning efforts based on continuously reannotated versions of the genome would thus have an increasing cost-to-benefit ratio. The continuous efforts made to improve the C. elegans genome annotation during the last four years increased the size of the C. elegans ORFeome resource by ~20%. A third iterative cloning step, based on a new snapshot of Wormbase predictions, might only marginally improve the resource. A coming leap in improved gene annotations will likely result from comparative genomics, which has proven useful for genome reannotations in yeast (Cliften et al. 2001, 2003; Kellis et al. 2003) and will soon be applied to the worm. The comparison of the C. elegans genome to the newly sequenced Caenorhabditis briggsae genome (Stein et al. 2003) should result in corrected annotations for many previously predicted genes, as well as the discovery of new genes. Genome sequencing is currently underway for three additional Caenorhabditis species, Caenorhabditis remaniae, Caenorhabditis japonica, and CB5161, and when available should enable accurate predictions of C. elegans ORF structures, upon which future iterations of the ORFeome project will be based. The complete ORFeome for C. elegans is thus a long-term project relying on combined bioinformatics and experimental approaches. Besides providing a useful tool for functional genomics and reverse proteomics in the worm, these efforts might eventually define better models of metazoan genes, leading to improved gene prediction algorithms for numerous other genomes, including the human genome. METHODS Identification of Repredicted ORFs To find ORFs that were repredicted between versions WS9 and WS100 of Wormbase, we compared the start and stop coordinates of each ORF to the genome sequence. The sequencing of the C. elegans genome was completed between those two versions, and, consequently, because of nucleotide additions, some of the nonrepredicted ORFs displayed had changed coordinates in WS100. Hence, it was necessary to update the start and stop positions of WS9 predictions by aligning their corresponding primers from ORFeome Version 1 to the current genome sequence. The set of ORFs that we attempted to clone consists of ORFs in WS100 that overlap with ORFs in WS9 while having a repredicted start, stop, or both, as well as ORFs that are newly predicted in WS100. We used the OSP program to design new primers (Hillier and Green 1991). For ORFs that were repredicted at only one end, we designed new primers at the repredicted ends and used the primers originally synthesized for ORFeome Version 1 (“v1 primers”) on the unchanged ends (mixed primer pairs). For ORFs that were repredicted at both ends, we designed new forward and reverse primers. The overall quality of the v1 primers (synthesized a few years before this work) was tested by comparing PCR amplification of pairs of v1 primers, mixed new and v1 primers, and pairs of all new primers using worm genomic DNA as template. Given that the PCR success rate on genomic DNA is independent of the quality of the annotations, similar results are expected for all primer pairs. The comparison of v1 primer pairs to mixed and new primer pairs showed a PCR success rate of 74%, 76%, and 83%, respectively. These results indicate that only a small portion of old primers have decreased in quality since their synthesis, and can be used in mixed primer pairs without biasing the results. Gateway Cloning of C. elegans ORFeome 3.1 Primer pairs were organized by the expected size of ORFs and aliquoted in 96-well format to optimize PCR conditions for individual plates and to facilitate size analysis of PCR products. PCR amplification for C. elegans ORFs was performed using Platinum Taq DNA polymerase (Invitrogen), and PCR cycling conditions were as previously described (Reboul et al. 2003). For one entire plate of 77 ORFs, we failed to obtain any PCR products, leaving 4155 PCR products to be further processed. Entry clones were produced using the pDONR201 vector according to standard Gateway recombinant cloning technology protocols except that BP cloning reactions were done at one-fourth of the recommended volume (Invitrogen). Entry clones were subsequently transformed into DH5α cells rendered chemically competent with DMSO and cultured overnight in LB liquid media containing kanamycin (50 μg/mL). Cultures were then used to inoculate a second 1.0-mL liquid culture containing LB and kanamycin, which was grown overnight at 37°C. Recombinant products were archived for long-term storage as both bacterial glycerol stocks (15% glycerol in LB) and as plasmid DNA minipreps. A Qiagen 9600 robot was used to purify plasmid DNA. PCR was performed using recovered plasmid DNA and pDONR201 sequencing primers (Invitrogen), and the resulting PCR products were used as template for sequencing as described (Reboul et al. 2003). C. elegans cDNA Library The library used here as PCR template was described earlier (Walhout 2000b). Sequencing and Bioinformatics Analysis All cloned ORFs were sequenced at the 5′- and 3′-ends resulting in two OSTs (ORF Sequence Tags) for each ORF. ORFs that were not successfully cloned or sequenced (phred score below 20 over 200 bases) were not included in the analysis. All OSTs were aligned to the C. elegans genome stored in the ACeDB, using the acembly alignment software. The comparison between OSTs and corresponding predicted ORFs was done in two phases. First, all alignments were analyzed using a previously described protocol (Reboul et al. 2001), to detect OSTs that displayed a different internal exon/intron structure than their corresponding ORFs. In a second phase, these ORFs were analyzed manually to identify the type of structure difference and to detect frame problems in the OST. The information resulting from this analysis has been stored in a MySQL database. We found 230 ORFs in which no splicing events could be identified. These ORFs could be categorized as having OSTs that arose from extremely short sequencing reads that did not span predicted introns, those that gave rise to average-length OSTs but for which splicing had not been predicted in that region and OSTs that were predicted to span an intron but for which no splicing event was identified. Of the latter category, only 62 ORFs could be interpreted in our analysis, as we require sequencing through a splice junction. Of these ORFs, 81% were found to be out-of-frame, suggesting that they were either mispredicted or represent pseudogenes. Acknowledgments We thank the C. elegans Sequencing Consortium for a complete and highly accurate genome sequence; L. Stein, D. Lawson, R. Durbin, K. Bradnam, N. Chen, and others from Wormbase for continuously improving genome annotations; the participants of the annual ORFeome meeting for their input and numerous suggestions; M. Cusick for critical reading of the manuscript; J.-F. Rual, N. Bertin, and T. Kishikawa for their input and help; T. Clingingsmith and C. McCowan for superb administrative assistance; the staff at Illumina and Agencourt for technical assistance; and C. Fraughton for laboratory support. This work was supported by grants 7 R33 CA81658-02 from the National Cancer Institute and 5R01HG01715-02 from the National Human Genome Research Institute and the National Institute of General Medical Sciences awarded to M.V. Footnotes [Supplemental material is available online at www.genome.org.] Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.2496804. References
WEB SITE REFERENCES
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||
Science. 1998 Dec 11; 282(5396):2012-8.
[Science. 1998]Genomics. 1996 Jun 15; 34(3):353-67.
[Genomics. 1996]Nucleic Acids Res. 2001 Jan 1; 29(1):82-6.
[Nucleic Acids Res. 2001]Nat Genet. 2003 May; 34(1):35-41.
[Nat Genet. 2003]Genome Res. 2000 Nov; 10(11):1788-95.
[Genome Res. 2000]Science. 2000 Jan 7; 287(5450):116-22.
[Science. 2000]Methods Enzymol. 2000; 328():575-92.
[Methods Enzymol. 2000]Nat Genet. 2003 May; 34(1):35-41.
[Nat Genet. 2003]Science. 2003 Oct 31; 302(5646):842-6.
[Science. 2003]Genome Biol. 2003; 5(1):R3.
[Genome Biol. 2003]Nature. 2002 Jun 20; 417(6891):797-8.
[Nature. 2002]Nature. 2002 Jun 20; 417(6891):797-8.
[Nature. 2002]Nucleic Acids Res. 2003 Jan 1; 31(1):237-40.
[Nucleic Acids Res. 2003]Genome Res. 2001 Jul; 11(7):1143-4.
[Genome Res. 2001]Science. 2003 Jul 4; 301(5629):71-6.
[Science. 2003]Nature. 2003 May 15; 423(6937):233-4.
[Nature. 2003]PLoS Biol. 2003 Nov; 1(2):E45.
[PLoS Biol. 2003]PCR Methods Appl. 1991 Nov; 1(2):124-8.
[PCR Methods Appl. 1991]Nat Genet. 2003 May; 34(1):35-41.
[Nat Genet. 2003]Nat Genet. 2003 May; 34(1):35-41.
[Nat Genet. 2003]Nat Genet. 2003 May; 34(1):35-41.
[Nat Genet. 2003]