![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||
Copyright © 2006 Arumugam et al.; licensee BioMed Central Ltd. Pairagon+N-SCAN_EST: a model-based gene annotation pipeline 1Laboratory for Computational Genomics and Department of Computer Science, Washington University, One Brookings Drive, St. Louis, MO 63130, USA Corresponding author.Michael R Brent: brent/at/cse.wustl.edu SupplementEGASP '05: ENCODE Genome Annotation Assessment Project Roderic Guigó, Martin G Reese This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract Background This paper describes Pairagon+N-SCAN_EST, a gene annotation pipeline that uses only native alignments. For each expressed sequence it chooses the best genomic alignment. Systems like ENSEMBL and ExoGean rely on trans alignments, in which expressed sequences are aligned to the genomic loci of putative homologs. Trans alignments contain a high proportion of mismatches, gaps, and/or apparently unspliceable introns, compared to alignments of cDNA sequences to their native loci. The Pairagon+N-SCAN_EST pipeline's first stage is Pairagon, a cDNA-to-genome alignment program based on a PairHMM probability model. This model relies on prior knowledge, such as the fact that introns must begin with GT, GC, or AT and end with AG or AC. It produces very precise alignments of high quality cDNA sequences. In the genomic regions between Pairagon's cDNA alignments, the pipeline combines EST alignments with de novo gene prediction by using N-SCAN_EST. N-SCAN_EST is based on a generalized HMM probability model augmented with a phylogenetic conservation model and EST alignments. It can predict complete transcripts by extending or merging EST alignments, but it can also predict genes in regions without EST alignments. Because they are based on probability models, both Pairagon and N-SCAN_EST can be trained automatically for new genomes and data sets. Results On the ENCODE regions of the human genome, Pairagon+N-SCAN_EST was as accurate as any other system tested in the EGASP assessment, including ENSEMBL and ExoGean. Conclusion With sufficient mRNA/EST evidence, genome annotation without trans alignments can compete successfully with systems like ENSEMBL and ExoGean, which use trans alignments. Background There are three fundamental approaches to automated construction of exon-intron structure for protein-coding genes: native alignment - alignment of expressed sequences (including high quality cDNA sequences, expressed sequence tags (ESTs), and protein sequences) to the loci from which they were transcribed; trans alignment - non-native alignment of expressed sequences to loci that could potentially express similar sequences (can be within or between species); and de novo - prediction using the sequences of one or more genomes as the only inputs (no expressed sequences). Native alignments of full insert, high quality cDNA sequences are the unquestioned gold standard in high-throughput annotation. However, even a concerted, high-budget effort to sequence cDNA libraries produces a full-open reading frame (ORF) sequence for only about 50% to 60% of loci in a mammalian genome [1]. Thus, trans alignments have played a key role in producing the most trusted genome predictions, including the ENSEMBL predictions (sometimes termed 'evidence based') that have been used in the first published analyses of many new genome sequences. Nonetheless, the evidence they provide for expression is circumstantial rather than direct - for example, the annotated genomic locus may represent a pseudogene derived from the true genomic source of the expressed sequence. Even when a trans alignment identifies a functional homologous gene locus, the alignments tend to be inaccurate in their details unless the expressed sequence is highly similar to the genomic sequence [2,3]. De novo predictions have always been viewed with some suspicion. This suspicion derives in part from the tendency of gene predictors developed in the 1990s to predict far too many false positive genes and exons. It may also result, in part, from the fact that one cannot point to the evidence supporting de novo predictions - a large ensemble of individually weak statistical patterns - the way one can point to a single expressed sequence. Nonetheless, statistical evidence is biological evidence, with a track record extending back to Gregor Mendel. If de novo prediction were indeed inaccurate, relying heavily on trans alignments would make sense when analyzing a genome for which few EST or cDNA sequences are available. However, the rapidly increasing accuracy of de novo prediction and the large number of very high quality cDNA sequences available for human suggest the possibility that high quality annotations might be produced without using trans alignments. A system that does not use trans alignments might be more accurate than one that does, since all alignments would have near 100% identity. Even if its accuracy were merely equal to that of a system using trans alignments, the evidence supporting each prediction might be considered more direct. To build an annotation pipeline without trans alignment, we combined a number of tools that have been recently developed in our lab. These tools include Pairagon, a cDNA-to-genome aligner, N-SCAN_EST [4], a multi-genome gene predictor capable of taking guidance from EST alignments, and PPFINDER [5], a program for eliminating pseudogenes from sets of predicted protein-coding genes. Pairagon uses a PairHMM to produce native cDNA alignments To produce the best possible alignments of high quality cDNA sequences, we used Pairagon, a cDNA-to-genome aligner that is based on a pairHMM probability model [6]. A pairHMM is a hidden Markov model (HMM) whose states emit alignment columns. In our case, the columns contain either a match between the two sequences, a mismatch, an insertion in the genome, a deletion in the genome, or an intron base in the genome (Figure (Figure1).1
In order to make Pairagon run faster, we ran ungapped BLASTN as a preprocessing step and used the long alignments it produced to seed exon alignments (Figure (Figure2,2
N-SCAN_EST threads complete gene structures through EST alignments In the genomic regions between Pairagon's cDNA alignments, we combined EST alignments with de novo gene prediction by using N-SCAN_EST [4]. N-SCAN_EST is based on N-SCAN [12,13], a multi-genome de novo gene predictor, which was the most accurate de novo predictor in the EGASP assessment [14] by every measure except nucleotide sensitivity. (De novo includes both the 'ab initio' and 'multi-genome' assessment categories.) N-SCAN_EST is a version of N-SCAN that takes guidance from EST alignments. Specifically, it takes as input a representation of EST alignments that we call ESTseq, by analogy to the 'conservation sequence' used in TWINSCAN (a three-character alphabet representing genome sequence conservation between two species) [15,16]. N-SCAN_EST takes guidance from EST alignments, but it does not follow them blindly. Instead, it also considers the DNA sequence of the target genome and the evolutionary conservation information provided by alignments of the target genome with the genomes of other organisms. It predicts complete transcripts by extending or merging EST alignments or by building gene structures in which some exon regions are supported by EST evidence while others are not. We have shown elsewhere that this approach increases sensitivity and specificity not only for the genes that have EST support, but even for those that do not [4]. Pairagon+N-SCAN_EST annotates genomes without using trans alignment To apply N-SCAN_EST, we downloaded human ESTs from dbEST and aligned them to the human genome using BLAT [8] (Figure (Figure2,2 In the remainder of this paper we present accuracy statistics for both the EGASP version of the pipeline and an updated version and analyze the relative contributions of Pairagon versus N-SCAN_EST. We then examine a series of examples where our pipeline gave a revealing result, whether correct or incorrect. Finally, we draw some lessons about how the pipeline could be improved in the future. Results and discussion RefSeq and MGC cDNA sequences mapped to the ENCODE regions were downloaded from the UCSC Genome Browser and alignments were generated using the Stepping Stone implementation of Pairagon v0.5 as described in Materials and methods. GenBank's coding sequence (CDS) annotations of these cDNA sequences were used to produce 451 aligned transcripts annotated with GenBank ORFs (141 from MGC sequences and 310 from RefSeq sequences). Merging identical gene structures and removing inconsistent structures (for example, gap in the coding region leading to a frame shift in the genome) yielded 413 unique gene structures. N-SCAN_EST predictions were generated as described in Materials and methods. The 94 N-SCAN_EST predictions that did not overlap the 413 Pairagon gene structures were added to the gene set. We obtained seven gene structures by aligning sequences from our RT-PCR experiments. Two of these did not overlap the existing set and were included in our submission to the 'any evidence' category. We do not discuss this set in detail because it is almost identical to the submission to the 'mRNA/EST evidence' category. The accuracy statistics for this set can be found in the EGASP assessment report [14]. The official assessment of Pairagon+N-SCAN_EST shows high accuracy Table 1 compares the coding region prediction accuracy measures of three submissions to the EGASP 'mRNA/EST evidence' category at the gene, transcript, exon and nucleotide levels. Pairagon+N-SCAN_EST (Pairagon+N) is optimized for high accuracy in predicting exact exons and transcripts, so we will focus our analysis on those columns of Table 1. By both measures, ExoGean is the most sensitive of the three programs and Pairagon+N is the most specific; ENSEMBL is intermediate except in exact exon specificity, where it falls below the other two. None of the programs completely dominates any other, although one might argue that Pairagon+N has a slight edge, since the margin by which its specificity exceeds that of the second best program is substantially larger than the margin by which its sensitivity falls below the others. In absolute numbers, our pipeline identifies almost the same number of correct Gencode transcript structures as ENSEMBL (255 versus 258, respectively), and 21 fewer than ExoGean, but we have many fewer incorrect transcripts (149 versus 205 from ENSEMBL and 237 from ExoGean). Their gene accuracy measures are slightly better than ours because ENSEMBL and ExoGean predict more transcripts per gene locus on average. Predicting more transcripts at a locus increases the chance that at least one of them is correct, yielding a true positive by the gene measure, while no penalty in false positives is incurred for the additional incorrect transcripts. This is arguably a flaw in the gene level measure when applied to systems that can predict more than one transcript per locus.
Pairagon's cDNA alignments are highly accurate The individual accuracies of Pairagon and N-SCAN_EST gene structures in the submission are given in Table 2. Pairagon's nucleotide and exon specificities are 98.8% and 96.1%, respectively. Pairagon is also very accurate in identifying splice sites - we estimated that 98.3% of the introns that Pairagon identified have supporting evidence in the Gencode reference genes. When there is high quality mRNA evidence, more than three-fourths of transcript structures predicted by Pairagon are correct.
Identifying the correct splice boundaries is the crucial step in cDNA-to-genome alignment, and here Pairagon proves to be extremely accurate. Out of the 1,834 introns Pairagon predicted (both within and outside coding regions), only 22 introns from 15 transcript structures were not supported by HAVANA annotation. Three of them (from a single transcript) matched the introns of a Gencode gene labeled 'putative' and eight of them were a result of using incorrect seed exons from BLASTN (discussed in detail below). The remaining 11 were from Refseq cDNAs that have no evidence in HAVANA annotation. Two of the eleven aligned to the reference genome with numerous mismatches. There are 22 unique GC-AG introns in the protein coding part of the HAVANA annotation. Pairagon correctly identifies 12 of these. The remaining 10 are missed because they did not have supporting Refseq or MGC cDNA sequence. When other systems prefer a GT dinucleotide, especially if it occurs close to the actual GC donor site, Pairagon gets the GC splice boundaries correct. Figure Figure33
In the Stepping Stone implementation of Pairagon, the accuracy of the final alignment depends on how well the seed exons are mapped in the genome (see Materials and methods and Figure Figure44
Pairagon's accuracy has improved since the official evaluation Since the EGASP assessment, we have made several improvements to both Pairagon's probability model and its implementation. We have retrained Pairagon using its own alignments of 20,594 MGC cDNA sequences to 21,249 loci on the human genome. Several bug-fixes and optimizations have resulted in a faster and more robust program with lower memory requirements. Table 3 lists the accuracy measures of the current version of Pairagon (v0.95) when aligning the same cDNA sequences used for the assessment. Pairagon v0.95 shows improvement in all accuracy measures. It now identifies 22 more correct Gencode transcripts and 162 more correct exons with a small improvement in specificity as well. Thus, the accuracy of our pipeline using Pairagon v0.95 is substantially better than that of the version submitted for the assessment, which was already as good as, or slightly better than, that of the other entrants. Of course, other systems have likely improved as a result of this exercise, too.
A lack of biological evidence raises questions about ORF annotation Identifying the coding region in (even) a full-length mRNA is an extremely difficult problem. NCBI and HAVANA do not always agree in their CDS annotations of mRNA sequences, even if they agree on the exon-intron structures. Because we relied on the CDS annotations from NCBI, a few of our gene predictions are incorrect according to HAVANA, although the underlying alignment is correct. For example, GenBank's annotated translation start sites for cDNA sequences BC001940 and NM_001004759.1 are 798 bases downstream and 81 bases upstream of HAVANA's annotated translation start sites in Gencode genes AC005538.1-001 and AC011711.3-001, respectively. A few more of our ORF predictions obtained from correct alignments are labeled incorrect because HAVANA has not made any CDS annotations on the exon-intron structures yet. For example, the exon-intron structure of our gene NM_181879.1 from aligning a reviewed RefSeq mRNA NM_181879.1 matches that of Gencode reference gene AC008984.1-003, which does not have a CDS annotation. Since the biological evidence supporting the GenBank ORF annotations, if any, is not available for evaluation, we might do better by using a modified version of N-SCAN to predict ORFs on aligned cDNA sequences. N-SCAN_EST performs well on complete GENCODE test regions After the release of the HAVANA annotations, we found that N-SCAN_EST predictions used to fill the gaps between Pairagon alignments had a very high proportion of incorrect genes - the gene/transcript specificity of the original N-SCAN_EST predictions was 8.5% in regions that did not overlap Pairagon alignments (gene and transcript specificity are the same for programs that predict only one transcript per locus). However, this is due largely to the fact that there are high quality cDNA sequences covering most of the real genes in the ENCODE regions. When these are not used and N-SCAN_EST's predictions on the complete GENCODE test regions are evaluated, their specificity is 38.7% (Table 2). In the ENCODE regions, the accuracy of N-SCAN_EST is due in large part to the accuracy of N-SCAN itself (this may not hold in less gene-dense regions). Table 4 compares the five submissions to the Dual or Multiple Genome category of EGASP that score the highest on exons, transcripts, and genes. N-SCAN scores the highest in all categories except for nucleotide sensitivity. In terms of exon specificity, N-SCAN is 4.8% better than the next best system (Dogfish) and in transcript specificity 18% better than the next best system (Augustus-dual). For transcript and exon sensitivity, N-SCAN is 4.7% and 4.6% better, respectively, than any other system except TWINSCAN-MARS. N-SCAN outperforms TWINSCAN-MARS by about 1% transcript sensitivity and 2% exon sensitivity. TWINSCAN-MARS has relatively high sensitivity in part because it predicts several transcripts per gene, for which it pays a price in specificity. Even with the hit it takes in specificity, TWINSCAN-MARS is among the top three performers, especially at the transcript level. This may be explained, in part, by the fact that N-SCAN and TWINSCAN-MARS share nearly identical models for DNA sequence [16], although their conservation models are quite different.
N-SCAN's ability to explicitly model untranslated regions (UTRs) [12,13,19] facilitates the distinction between coding and non-coding exons. Figure Figure66
When the genome sequence and conservation do not provide sufficient information about the coding potential of a gene locus, EST evidence can be very useful in gene prediction. Figure Figure77
Conclusion The results of this exercise have demonstrated two things. First, this careful community assessment has been very valuable, particularly for the way in which it uncovered weaknesses in, and inspired improvements to, Pairagon and other systems. Second, genome annotation without trans alignments can compete successfully with systems like ENSEMBL and ExoGean, which use trans alignments, under certain circumstances. However, annotation accuracy in the EGASP assessment is determined largely by the accuracy with which high quality native cDNA sequences can be aligned, and secondarily by the accuracy with which HAVANA's ORF calls on those cDNA sequences, or lack thereof, can be anticipated. We cannot extrapolate the results of this exercise to situations in which fewer full length cDNAs and/or fewer ESTs are available. In such situations, the accuracy of our pipeline would depend more on N-SCAN and N-SCAN_EST, while the accuracy of ENSEMBL would depend more on trans alignments. In future assessments, it would be worthwhile to assess prediction pipelines under a range of scenarios between the two evaluated this time - freedom to use all available native cDNA and prohibition against using any. In particular, the selective elimination of cDNA and EST sequences from the available pool would shed light on the tradeoffs among different approaches under a range of situations of practical significance (see [4] for such a study on Pairagon+N-SCAN_EST). Materials and methods Pairagon gene predictions The state diagram of Pairagon's pairHMM model for cDNA-to-genome alignment is given in Figure Figure1.1 We implemented the Viterbi algorithm, an optimal dynamic programming algorithm for finding the most probable alignment between two sequences, in C. Although it produced accurate alignments, the time and space complexity for optimally aligning two sequences increases in proportion to the product of the sizes of the input sequences, imposing limitations on the size of the input sequences. Therefore, we adapted the Stepping Stone algorithm [20], a heuristic modification to the optimal algorithm. Stepping Stone relies on faster seeded alignment programs like BLASTN to identify regions of high identity between the cDNA and the genomic sequence (diagonal lines in Figure Figure4).4 Pairagon v0.5 was trained using 15,766 BLAT alignments of 15,297 MGC [1,9,10] cDNA sequences to the human genome build NCBI35 (May 2005). Transition probabilities between the states were estimated from the alignments using maximum likelihood. Because this was a bootstrap procedure, and BLAT does not pay careful attention to splice sites, we assigned reasonable estimates for probabilities of GT-AG, GC-AG and AT-AC splice site combinations (98.9%, 1.0% and 0.1%, respectively). All bases were equally probable in states RG1, RC1, RG2, RC2, G, C and Intron. The probability of a match in the aligned state was estimated using maximum likelihood and was evenly distributed among the four possible combinations. Similarly, the probability of a mismatch in the aligned state was distributed among the 12 possible combinations. Ungapped local alignments between the cDNA sequences and the unmasked ENCODE regions were generated using BLASTN [21] with parameters M = 1 N = -3. These approximate seed exons were then used by the Stepping Stone implementation of Pairagon v0.5 to generate an alignment. GenBank CDS annotations of the cDNA sequences were used to convert these alignments into gene structures. N-SCAN gene predictions N-SCAN_EST gene predictions Human ESTs, downloaded from dbEST on 20 January 2005, were aligned to whole human genome (build NCBI35) by BLAT [8]. For each EST sequence, the alignment with the greatest number of bases matching the genome was selected. Alignments with at least 98% of the bases in the entire EST matching the genome were chosen to generate an ESTseq for each chromosome. ESTseq parameters were estimated from regions corresponding to a set of cleaned Refseq annotations containing 17,798 transcripts. An additional 1,000 bases on either side of the genes were used to train intergenic regions. The genome sequence was masked for putative processed pseudogenes using PPFINDER [5]. ESTseqs corresponding to the ENCODE regions were obtained by cutting the relevant sections out of the chromosomal ESTseq, and N-SCAN_EST was then used to predict genes. Pairagon+N-SCAN_EST pipeline A block diagram showing the steps involved in generating Pairagon gene structures and N-SCAN_EST gene predictions, and combining them is given in Figure Figure2.2 Acknowledgements We are grateful to Jeltje van Baren for help with her PPFINDER software for detection of processed pseudogenes in gene annotation sets. Thanks also to the organizers of the GENCODE evaluation, including especially Roderic Guigó and Paul Flicek. This work was supported in part by grants U01 HG003150 (ENCODE) and R01 HG02278 from the National Human Genome Research Institute and by Contract N01-CO-12400 from the National Cancer Institute (Mammalian Gene Collection). This article has been published as part of Genome Biology Volume 7, Supplement 1, 2006: EGASP '05. The full contents of the supplement are available online at http://genomebiology.com/supplements/7/S1. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||
Genome Res. 2004 Oct; 14(10B):2121-7.
[Genome Res. 2004]Genome Res. 2005 Dec; 15(12):1777-86.
[Genome Res. 2005]Genome Res. 2004 May; 14(5):988-95.
[Genome Res. 2004]Genome Res. 2006 May; 16(5):678-85.
[Genome Res. 2006]Nucleic Acids Res. 2001 Oct 1; 29(19):4006-13.
[Nucleic Acids Res. 2001]Genome Res. 2002 Apr; 12(4):656-64.
[Genome Res. 2002]Genome Res. 2004 Oct; 14(10B):2121-7.
[Genome Res. 2004]Proc Natl Acad Sci U S A. 2002 Dec 24; 99(26):16899-903.
[Proc Natl Acad Sci U S A. 2002]Science. 1999 Oct 15; 286(5439):455-7.
[Science. 1999]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D501-4.
[Nucleic Acids Res. 2005]J Comput Biol. 2006 Mar; 13(2):379-93.
[J Comput Biol. 2006]Genome Biol. 2006; 7 Suppl 1():S2.1-31.
[Genome Biol. 2006]Genome Res. 2003 Jan; 13(1):46-54.
[Genome Res. 2003]Bioinformatics. 2001; 17 Suppl 1():S140-8.
[Bioinformatics. 2001]Genome Res. 2002 Apr; 12(4):656-64.
[Genome Res. 2002]Genome Res. 2004 Apr; 14(4):708-15.
[Genome Res. 2004]Genome Res. 2006 May; 16(5):678-85.
[Genome Res. 2006]Genome Biol. 2006; 7 Suppl 1():S2.1-31.
[Genome Biol. 2006]Bioinformatics. 2006 Jan 1; 22(1):13-20.
[Bioinformatics. 2006]Bioinformatics. 2001; 17 Suppl 1():S140-8.
[Bioinformatics. 2001]J Comput Biol. 2006 Mar; 13(2):379-93.
[J Comput Biol. 2006]Genome Res. 2005 May; 15(5):742-7.
[Genome Res. 2005]Bioinformatics. 2002 Oct; 18(10):1309-18.
[Bioinformatics. 2002]Genome Res. 2004 Oct; 14(10B):2121-7.
[Genome Res. 2004]Proc Natl Acad Sci U S A. 2002 Dec 24; 99(26):16899-903.
[Proc Natl Acad Sci U S A. 2002]Science. 1999 Oct 15; 286(5439):455-7.
[Science. 1999]J Mol Biol. 1990 Oct 5; 215(3):403-10.
[J Mol Biol. 1990]Genome Res. 2006 May; 16(5):678-85.
[Genome Res. 2006]J Comput Biol. 2006 Mar; 13(2):379-93.
[J Comput Biol. 2006]Genome Res. 2002 Apr; 12(4):656-64.
[Genome Res. 2002]Genome Res. 2006 May; 16(5):678-85.
[Genome Res. 2006]BMC Bioinformatics. 2003 Oct 17; 4():50.
[BMC Bioinformatics. 2003]