Logo of pnasPNASInfo for AuthorsSubscriptionsAboutThis Article
Proc Natl Acad Sci U S A. Dec 23, 2008; 105(51): 20286–20290.
Published online Dec 18, 2008. doi:  10.1073/pnas.0807813105
PMCID: PMC2629301
Cell Biology

Identification of gene 3′ ends by automated EST cluster analysis

Abstract

The properties and biology of mRNA transcripts can be affected profoundly by the choice of alternative polyadenylation sites, making definition of the 3′ ends of transcripts essential for understanding their regulation. Here we show that 22–52% of sequences in commonly used human and murine “full-length” transcript databases may not currently end at bona fide polyadenylation sites. To identify probable transcript termini over the entire murine and human genomes, we analyzed the EST databases for positional clustering of EST ends. The analysis yielded 58,282 murine- and 86,410 human-candidate polyadenylation sites, of which 75% mapped to 23,091 known murine transcripts and 22,891 known human transcripts. The murine dataset correctly predicted 97% of the 3′ ends in a manually curated and experimentally supported benchmark transcript set. Of currently known genes, 15% had no associated prediction and 25% had only a single predicted termination site. The remaining genes had an average of 3–4 alternative polyadenylation sites predicted for each murine or human transcript, respectively. The results are made available in the form of tables and an interactive web site that can be mined for rapid assessment of the validity of 3′ ends in existing collections, enumeration of potential alternative 3′ polyadenylation sites of known transcripts, direct retrieval of terminal sequences for design of probes, and detection of polyadenylation sites not currently mapped to known genes.

Keywords: 3′ UTR, gene prediction, alternative polyadenylation, transcriptome, transcript probe design

The 3′ ends of nascent mRNA transcripts are generated by a multifactorial complex that recognizes a well-defined hexamer polyadenylation signal (PAS, typically AAUAAA or AUUAAA) in a context that includes other less clearly defined motifs, cleaves the RNA 16–28 nt downstream of the PAS, and adds the characteristic polyA tail. Many transcripts are cleaved and polyadenylated at alternative sites that may depend on cellular contexts (15). The 3′ UTRs of transcripts frequently contain motifs that regulate their stability and ribosomal translation (6) and their translocation to the cytoplasm. Additionally, these 3′ UTRs of transcripts may contain miRNA targets and short hairpin loops of regulatory significance. Indeed, as many as 40% of miRNA targets have been estimated to be located in alternative 3′ UTR segments (7). The properties and biology of transcripts can therefore be affected profoundly by the choice of alternative polyadenylation sites, making definition of the 3′ ends of transcripts essential for understanding their regulation. There are additional practical motivations for determining 3′ ends. For many genes, similarities in coding regions may make it necessary to find probes that target the UTRs to achieve the desired specificity. Furthermore, methods for amplification of cDNA are finding increasing use in circumstances wherein amounts of available RNA are too small, for example, for direct application to microarrays (8, 9). Because amplification protocols are typically initiated by oligo(dT) priming on transcript polyA tails, there is an inherent bias toward best preservation of abundance of terminal 3′ sequences. Amplified cDNAs are therefore ideally interrogated by probes that target sequences as close to 3′ ends as is practical.

For access to 3′ terminal sequences, biologists usually turn to highly curated collections of complete mRNA sequences, such as the RefSeq (10), Ensembl (11), UCSC KnownGene (12), FANTOM (3) and VEGA (13) collections. Probes in widely used commercial microarrays are similarly selected from sequence collections representative of full-length transcripts, from which probes near 3′ polyadenylation sites could be generated in principle. Affymetrix, for example, has published collections of murine and human transcript sequences. However, it cannot be taken for granted that transcripts in public or commercial collections do in fact include sequences near all used alternative sites or that a given transcript sequence is complete to any of its possible termini. At the present time, no survey of the extent of completeness of the sequences in popular collections is available, leaving the onus on the user to determine whether a given transcript sequence ends at a valid polyadenylation site or not and whether other more frequently used sites exist upstream or downstream of available sequence ends.

A useful methodology exists (1419) for locating polyadenylation sites based on the large and growing databases of EST sequences. These short sequences are sampled from larger cDNA clones, which are generally reverse-transcribed from polyadenylated cellular transcripts after priming with oligo(dT). NCBI's GenBank contained over 96 million such sequences as of March 2007. The high level of EST redundancy means that the expression of each gene is described in each organism by multiple independently derived ESTs, while their preparation based on oligo(dT) priming results in a preponderance of ESTs aligning to and terminating at 3′ polyadenylation sites.

Here we report results directly addressing the needs of researchers for methodology and readily accessible databases facilitating the identification of 3′ transcript ends. First, an analysis of the main public murine and human transcript collections provides evidence that up to half of available sequences may not end at true polyadenylation sites. We further describe and validate an EST-based method with improved prediction of candidate polyadenylation sites. The method was applied to both murine and human genomes to yield sets of predicted 3′ ends mapped against the currently known genes. The results are presented in 4 datasets (Dataset S1, Dataset S2, Dataset S3, and Dataset S4), 2 text files (UCSCSessionmm8 and UCSCSessionhg18), and an interactive web site (www.ogic.ca/ts) that can be mined for rapid assessment of the validity of 3′ ends in existing collections, enumeration of potential alternative 3′ polyadenylation sites of known transcripts, direct retrieval of terminal sequences for the design of probes, and detection of polyadenylation sites not currently mapped to known genes. (For additional details, see also SI Text, Figs. S1–S8, and Tables S1–S5.)

Results

To obtain a comprehensive first approximation of the degree of completeness of available transcript sequence collections, we determined the proportion of 3′ termini in various murine and human collections (including RefSeq, ENSEMBL, UCSC KnownGene, FANTOM, and VEGA) that contained a hexamer PAS. The results summarized in Table 1 indicate that despite the extensive hand curation involved in their creation, 22–52% of sequences in individual collections lack a terminal PAS and are thus likely not to end at sites of polyadenylation. We will refer here to 3′ ends of database transcript sequences as “nominal ends.”

Table 1.
Proportions of transcript sequence collections containing a 3′ PAS

For automating the process of detection of potential polyadenylation sites, we sought to identify clusters of EST ends associated with a nearby upstream PAS and ending within a few bases of each other when aligned with the genome. The method is summarized here and described in greater detail in the SI Text. To determine the coordinates of genomic EST alignments, we used the UCSC genome annotation (20). Because of the relatively low accuracy of EST sequencing and recent genomic duplications, some ESTs can be aligned with more than one genomic position. To avoid possible misidentifications, ambiguously aligning ESTs were excluded from our analysis. For end-cluster detection, we plotted the number of matching ESTs against position along the entire genome. This analysis revealed, as expected, that the numbers of aligned ESTs gradually increase toward transcript 3′ ends (Fig. 1) and then abruptly fall, suggesting that the plot could be used to infer the direction of transcription. We used an approach that exploited a convolution algorithm to detect and quantify the shape of such edges, allowing adjustment of the algorithm parameters. We then tuned parameters and detection thresholds on various transcript collections to optimize the relationship between the numbers of database sequence ends detected and the proportion of predicted ends corresponding to nominal ends. Fig. S3 illustrates optimization of 1 parameter, the minimum number of EST ends required to identify a potential polyadenylation site.

Fig. 1.
EST evidence for alternative 3′ ends for murine Pde7a transcripts. (A) The diagram, obtained from the UCSC Genome Browser (Mouse mm8, February 2006 Assembly) (22), illustrates a region of mouse chromosome 3 spanning 8 kb, including the 3′UTR ...

The automated analysis identified 58,282 and 86,410 EST clusters on the murine and human genomes, respectively, that contain at least 2 EST ends and an appropriately positioned PAS. These candidate 3′ ends were mapped in relationship to the UCSC KnownGene (12) collection and assembled in Dataset S1, Dataset S2, Dataset S3, and Dataset S4. The legends to the datasets are located at the end of the Supplementary Text. Approximately 75% of candidate ends lay within KnownGene transcription zones or up to 10 kb downstream of their nominal ends. Within the total KnownGene collection, 15% of genes had no associated prediction, 25% had only a single predicted termination site, and the remaining 60% had an average of 3–4 alternative polyadenylation sites predicted for each murine or human transcript, respectively. Failure to predict termination sites for 15% of the KnownGene collection could occur by failure to detect termini due to termination at rare variant PAS motifs, insufficient local EST representation, or complexity in the EST signature arising from nearby transcript ends on the opposite strand. KnownGenes may also lack 3′ termini because they are incomplete; an example discussed below is the murine Mll2 gene (Table 2), the true termini of which are likely those detected in the ensuing downstream KnownGene. Of murine RefSeq genes originally lacking terminal PAS, 57% were assigned candidate ends in our analysis.

Table 2.
Predicted 3′ ends for murine genes Pde7a, Rnf11, Mll2, and Zdhhc5 extracted from Dataset S1 and Dataset S2

To assess the accuracy of identification of polyadenylation sites, we needed a benchmark set of transcripts ending at validated 3′ polyadenylation sites. For this purpose, we used a collection of 113 murine genes chosen originally on the basis of their biological interest and accordingly representing an essentially random sampling relative to our objective here. For each transcript, a single likely 3′ transcript terminus was chosen by hand curation based on identification of clustered EST ends occurring closely downstream of a PAS (Table S2). The validity of each curated 3′ terminus was tested by specific secondary PCR probing of total cDNA, initially amplified globally from murine ES cells or purified hematopoietic precursor cells (21) under stringent conditions of oligo(dT) primer annealing and reverse transcription (8). The global RT-PCR amplification procedure yields a mixture of individual cDNA fragments, each confined to a 300- to 500-nt window immediately upstream of a polyA sequence (8). PCR primers (Table S3) were synthesized to target sequences contained within 300 nt upstream of the predicted polyadenylation sites and used to probe for the presence of their targets in globally amplified cDNA. Such targets would only be present in the global cDNA if they were located closely upstream of polyA sequences in the original RNA templates. Fragments of the predicted size were amplified in each instance (Fig. S5), providing experimental support for the usage of all 113 curated polyadenylation sites.

Each curated end was then mapped against the complete set of predicted ends listed in Dataset S1 and Dataset S2. Table S4 lists the genome coordinates of the curated benchmark set, the corresponding nominal database ends, and the automatically identified ends. Of the benchmark set, 110 (97%) were exactly matched by automatically identified ends. Of 31 curated bench mark 3′ termini that differed from the nominal database ends, all were correctly identified by the computational procedure within a tolerance of 50 nt. Of the 3 curated ends that were undetected by the automated algorithm, Cbx1 shares 3′ terminal sequence with other genes leading to the exclusion of relevant terminal ESTs from the automated procedure, whereas Ring1 apparently uses an anomalous PAS at its curated end (AACAAA). Despite the presence of many EST ends at the Phf1 transcript terminus, ESTs originating from a gene on the opposite strand and terminating close to the 3′ end of Phf1 interfered with edge detection by the convolution function. The high proportion of valid benchmark ends included in our automated prediction set extrapolates to a high rate of inclusion of valid ends in our murine and human prediction collections. Table S5 further describes the predictive precision of our murine dataset against various transcript collections, and Fig. S4 documents the improved performance of our dataset relative to earlier analyses and available curated transcript sets.

Dataset S1, Dataset S2, Dataset S3, and Dataset S4 can be used to devise cDNA probes for the 3′ ends of any particular transcript. For each predicted terminus, 400 genomic nucleotides are included in the table corresponding to the terminal alignment of one EST ending at the predicted polyadenylation site. Priming at A-rich tracts internal to transcripts could yield ESTs whose polyA tails originate from genomic sequence rather than polyadenylation (Fig. 1). Therefore 40 nt of sequence downstream of the predicted terminus are also supplied with a flag to indicate whether a downstream A-rich tract is present.

Four practical examples of use of the prediction tables are described here in detail. Table 2 contains the pertinent information for 4 murine transcripts as extracted from Dataset S1.

Murine Pde7a.

The RefSeq for Pde7a lacks a polyadenylation signal near its nominal 3′ end (Table 2, row 1, column 9). The alignment of EST ends to the 3′ end of this transcript was shown in Fig. 1, where 3 termination maxima were clearly evident, none of which occurred at the nominal RefSeq end. Pde7a was located by a search of columns 4 in Dataset S1 and Dataset S2, and the corresponding data were copied to Table 2. The Pde7a transcription interval on the genome is represented in the table by the alignment of RefSeq NM_008802 on chromosome 3 from position 19457108 (5′) to 19418068 (3′). Within the limits of −1,000 to +10,000 nt from the nominal 3′ end of NM_008802, 3 EST termination clusters are indicated (Table 2, rows 3–5, column 11), each comprising 7–33 EST ends (column 12). A possible fourth alternative ending is indicated in row 2, lying further upstream at 20,166 nt downstream of the RefSeq start. The predicted termini (Table 2, rows 3 and 4) at 138 nt upstream and 1,228 nt downstream of the nominal RefSeq end are supported by 33 and 24 ESTs and would be reasonable targets for 3′ probe construction. The values in Table 2, column 13, indicate that the EST termini are distributed over a 36-nt interval (“termination zone”) for the predictions in both rows 3 and 4, descending from the start of the interval indicated in column 6. The 3′ terminal sequence of a representative EST (Table 2, column 15) at each predicted terminus is given in column 16. For the prediction in Table 2, row 3, the value in column 14 locates the maximum number of EST endings to a position 13 nt upstream of the 3′ end of the sequence in column 16. Each prediction in Dataset S1 and Dataset S2 has an associated link to the UCSC genome browser (22). The browser view links to information on the tissue origin of the individual ESTs terminating at a predicted end that could narrow the choice of particular probes among alternative endings. The view can also be used to determine whether there is a continuous pattern of overlapping ESTs aligning with the genome between a nominal end and a predicted end further downstream. EST continuity would support membership of the predicted end in the same transcriptional unit. Additional information related to position of a predicted end and the evidence used to generate it can be examined by using the link in Dataset S1 and Dataset S2 to our Transcriptome Sailor web server (www.ogic.ca/ts).

Murine Rnf11.

Lookup of the gene symbol “Rnf11” in Dataset S1 and Dataset S2 identifies 3 predicted alternative polyadenylation sites (Table 2, rows 8–10), each of which lies downstream (column 11) of the nominal RefSeq end (row 7, columns 3 and 7). The absence of a nominal PAS flag in Table 2, column 9, indicates that the nominal RefSeq terminus lacks a PAS and is therefore unlikely to be a site of polyadenylation. One of the predicted alternative ends, supported by 17 EST ends, is located 1,750 nt downstream of the corresponding nominal RefSeq end on the genome. The view, accessible by using the corresponding link to the UCSC browser, is shown in Fig. S6. The UCSC browser data indicate a continuous pattern of overlapping ESTs aligning with the genome between the nominal and predicted ends, providing evidence for linkage of the predicted end to the same transcriptional unit. This predicted site was among the set verified experimentally within this study as described above.

Murine Mll2.

EST termination clusters were not detected in the Mll2 transcription interval on the genome as represented by the BC058659 mRNA sequence (Table 2, row 12). However, a distinct transcript AK039901 is shown (Table 2, row 13) beginning 1,500 nt downstream of the nominal end of BC058659, within which 2 potential alternative endings are detected a short distance upstream of its nominal end (Table 2, rows 14 and 15, column 11). The UCSC browser indicates a lack of continuity of overlapping ESTs in the interval between the 3′ end of BC058659 and the 5′ end of AK039901, suggesting these might be unrelated transcripts. However, analysis of the sequence in the unpopulated intervals by using an RNA folding algorithm reveals 2 regions of high-energy internal folding (Fig. S7), which could render RNA transcripts inaccessible to cDNA generation in this region and thus obscure their transcriptional continuity in the EST databases. The analysis suggests that the predicted end near position 98659766 could represent a 3′ polyadenylation site of Mll2. Supporting evidence was obtained by probing this target by specific RT-PCR in hematopoietic and other cells by using the strategy described for Fig. S5. The results yielded an expression pattern that was expected for Mll2.

Murine Zdhhc5.

The exemplar RefSeq for Zdhhc5 is shaded gray in Table 2 (row 17) and in Dataset S1 and Dataset S2, indicating that its nominal 3′ transcript end overlaps another gene transcribed in the opposite direction on the positive genome strand. The predicted polyadenylation site (Table 2, row 18) located 181 bases upstream of the nominal RefSeq end is similarly shaded in gray. The narrow width (Table 2, column 13) of the termination zone of the corresponding EST cluster suggests that it belongs to Zdhhc5 and not to the gene ending on the opposite strand, as explained in the legend to Dataset S1 and Dataset S2. The corresponding UCSC browser view is shown in Fig. S8.

Discussion

The approach outlined here, together with tables relating predicted polyadenylation sites to known genes, significantly reduces the effort required to identify bona fide transcript ends. The prediction tables are based on methodology that improves on earlier EST-based analyses (1419) and demonstrates a 97% level of predictive recall on a hand-curated and experimentally supported benchmark transcript set. Exact and comprehensive definition of alternative 3′ transcript ends will allow for the design of probes that discriminate between alternative forms and thereby support elucidation of the regulatory impact of alternative polyadenylation. The described methods will also facilitate measurement of global gene expression in amplified cDNAs representative only of 3′ transcript termini. Redesign of microarrays with probes uniformly positioned close to true 3′ ends will significantly reduce the sequence-based, secondary structure-mediated biases that are inherent in transcribing and labeling copies of mRNA and which may be magnified when amplification of target sequences is attempted.

Materials and Methods

Implementation and tuning of the algorithm for automated recognition of EST end clusters is described in detail in SI Text, along with measurements of recall and precision and comparison of these parameters against earlier reported implementations. Procedures used in generating Dataset S1, Dataset S2, Dataset S3, and Dataset S4 are described in the legend to Dataset S1 and Dataset S2. PCR probing of globally amplified cDNA samples used to test for the presence of curated transcript ends is described in the legend to Fig. S5.

We have implemented our predictions in a web tool (Transcriptome Sailor, www.ogic.ca/ts) to allow for their examination in a genomic context. The web site also provides access to the complete datasets used for this study and to updates, as generated with future refinement of our algorithms.

Supplementary Material

Supporting Information:

Acknowledgments.

We thank Christopher J. Porter for assistance with database maintenance and Carl Virtanen for helpful discussions. This work was supported by funds from the Ontario Innovation Trust, the Canadian Foundation for Innovation, the Ontario Research and Development Challenge Fund, the Terry Fox Foundation, and the Stem Cell Network. M.A.A.-N. is a Canada Research Chair in Bioinformatics.

Footnotes

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/cgi/content/full/0807813105/DCSupplemental.

References

1. Tian B, Hu J, Zhang H, Lutz C. A large-scale analysis of mRNA polyadenylation of human and mouse genes. Nucleic Acids Res. 2005;33:201–212. [PMC free article] [PubMed]
2. Zhang H, Lee JY, Tian B. Biased alternative polyadenylation in human tissues. Genome Biol. 2005;6:R100. [PMC free article] [PubMed]
3. Carninci P, Kasukawa T, Katayama S, Gough J, Frith M, et al. The transcriptional landscape of the mammalian genome. Science. 2005;309:1559–1563. [PubMed]
4. Kan Z, States D, Gish W. Selecting for functional alternative splices in ESTs. Genome Res. 2002;12:1837–1845. [PMC free article] [PubMed]
5. Kan Z, Rouchka EC, Gish WR, States DJ. Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. Genome Res. 2001;11:889–900. [PMC free article] [PubMed]
6. Mignone F, et al. UTRdb and UTRsite: A collection of sequences and regulatory motifs of the untranslated regions of eukaryotic mRNAs. Nucleic Acids Res. 2005;33:D141–D146. [PMC free article] [PubMed]
7. Majoros WH, Ohler U. Spatial preferences of microRNA targets in 3′ untranslated regions. BMC Genomics. 2007;8:152. [PMC free article] [PubMed]
8. Iscove NN, et al. Representation is faithfully preserved in global cDNA amplified exponentially from sub-picogram quantities of mRNA. Nat Biotechnol. 2002;20:940–943. [PubMed]
9. Kenzelmann M, et al. High-accuracy amplification of nanogram total RNA amounts for gene profiling. Genomics. 2004;83:550–558. [PubMed]
10. Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequences (RefSeq): A curated nonredundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007;35:D61–65. [PMC free article] [PubMed]
11. Hubbard TJ, et al. Ensembl 2007. Nucleic Acids Res. 2007;35:D610–D617. [PMC free article] [PubMed]
12. Hsu F, et al. The UCSC known genes. Bioinformatics. 2006;22:1036–1046. [PubMed]
13. Ashurst JL, et al. The vertebrate genome annotation (Vega) database. Nucleic Acids Res. 2005;33:D459–D465. [PMC free article] [PubMed]
14. Brockman JM, et al. PACdb: PolyA cleavage site and 3′-UTR database. Bioinformatics. 2005;21:3691–3693. [PubMed]
15. Zhang H, Hu J, Recce M, Tian B. PolyA_DB: a database for mammalian mRNA polyadenylation. Nucleic Acids Res. 2005;33:D116–D120. [PMC free article] [PubMed]
16. Yan J, Marr TG. Computational analysis of 3′-ends of ESTs shows four classes of alternative polyadenylation in human, mouse, and rat. Genome Res. 2005;15:369–375. [PMC free article] [PubMed]
17. Lopez F, Granjeaud S, Ara T, Ghattas B, Gautheret D. The disparate nature of “intergenic” polyadenylation sites. RNA. 2006;12:1794–1801. [PMC free article] [PubMed]
18. Moucadel V, Lopez F, Ara T, Benech P, Gautheret D. Beyond the 3′ end: Experimental validation of extended transcript isoforms. Nucleic Acids Res. 2007;35:1947–1957. [PMC free article] [PubMed]
19. Lee JY, Yeh I, Park JY, Tian B. PolyA_DB 2: mRNA polyadenylation sites in vertebrate genes. Nucleic Acids Res. 2007;35:D165–D168. [PMC free article] [PubMed]
20. Hinrichs AS, Karolchik D, Baertsch R, Barber G, Bejerano G, et al. The UCSC genome browser database: Update 2006. Nucleic Acids Res. 2006;34:D590–D598. [PMC free article] [PubMed]
21. Benveniste P, Cantin C, Hyam D, Iscove NN. Hematopoietic stem cells engraft in mice with absolute efficiency. Nat Immunol. 2003;4:708–713. [PubMed]
22. Karolchik D, Hinrichs AS, Kent WJ. The UCSC genome browser. Curr Protoc Bioinformatics. 2007 Chapter 1:Unit 1.4. [PubMed]
23. Imanishi T, Itoh T, Suzuki Y, Donovan CO, Fukuchi S, et al. Integrative annotation of 21,037 human genes validated by full-length cDNA clones. PLoS Biol. 2004;2:e162. [PMC free article] [PubMed]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • Gene (nucleotide)
    Gene (nucleotide)
    Records in Gene identified from shared sequence links
  • Nucleotide
    Nucleotide
    Published Nucleotide sequences
  • PubMed
    PubMed
    PubMed citations for these articles
  • Substance
    Substance
    PubChem Substance links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...