Logo of pnasPNASInfo for AuthorsSubscriptionsAboutThis Article
Proc Natl Acad Sci U S A. 2000 Mar 28; 97(7): 3491–3496.
Medical Sciences

Shotgun sequencing of the human transcriptome with ORF expressed sequence tags


Theoretical considerations predict that amplification of expressed gene transcripts by reverse transcription–PCR using arbitrarily chosen primers will result in the preferential amplification of the central portion of the transcript. Systematic, high-throughput sequencing of such products would result in an expressed sequence tag (EST) database consisting of central, generally coding regions of expressed genes. Such a database would add significant value to existing public EST databases, which consist mostly of sequences derived from the extremities of cDNAs, and facilitate the construction of contigs of transcript sequences. We tested our predictions, creating a database of 10,000 sequences from human breast tumors. The data confirmed the central distribution of the sequences, the significant normalization of the sequence population, the frequent extension of contigs composed of existing human ESTs, and the identification of a series of potentially important homologues of known genes. This approach should make a significant contribution to the early identification of important human genes, the deciphering of the draft human genome sequence currently being compiled, and the shotgun sequencing of the human transcriptome.

The identification and sequencing of human expressed sequences (cDNAs) plays a synergistic role to complete genome determination and represents a direct link to functional genomics (15). In particular, cDNAs greatly aid exon identification and are essential for determination of tissue and pathology-specific exon usage in the form of alternatively spliced variants (68). Furthermore, repeated partial sequencing of expressed sequences, so-called expressed sequence tags (ESTs), have proved a powerful means of identification of genetic polymorphisms (911) and for determination of differential gene expression (1218). To date, more than 1,500,000 human ESTs have been generated and deposited in GenBank, derived principally from the Merck Gene Index Project and the Cancer Genome Anatomy project (refs. 19 and 20; http://www.ncbi.nlm.nih.gov/dbEST). Clustering of these sequences shows that at least some have been derived from an estimated 86,000 different human genes but only approximately 11% of these have a full-length sequence (UniGene build 98, November 3, 1999, http://www.ncbi.nlm.nih.gov/UniGene/index.html). Moreover, approximately 65% of ESTs represent the 3′ extremity of cDNAs and 26% represent the 5′ extremity of cDNAs, resulting in a very biased representation of expressed gene sequences (see Fig. Fig.2).2). In consequence, a current limitation in analyzing human genes is the relative lack of sequences derived from the central portions of transcripts. We have found, however, that such sequences can be generated systematically and efficiently in a high-throughput format, potentially permitting the rapid, shotgun sequencing of the human transcriptome.

Figure 2
A comparison of the actual percentage of ORESTES and 5′ and 3′ ESTs that pass through the relative position of full-length cDNA sequences. The figure was constructed by using all human full-length cDNAs of more than 1 kb currently in GenBank, ...

The basis of the approach we have taken is to generate short cDNA templates of less than twice the length of an average sequencing read by reverse transcription–PCR using arbitrarily selected, nondegenerate primers (either singly or in pairs) under low-stringency conditions as we have described previously (21). It is not possible to predict (in the absence of complete transcript sequence information) with which transcripts an arbitrary primer will bind or the position of primer binding within any given transcript. The position of amplified fragments within transcripts is, in contrast, highly ordered and predictable, with a high percentage of fragments encompassing the midportions of genes. To demonstrate this, we have generated more than 10,000 sequences (which we refer to as ORF ESTs or ORESTES) from PCR fragments derived from the central, coding regions of human breast tumor transcripts by using the protocol described.

Materials and Methods

Template Preparation and DNA Sequencing.

Tissue samples obtained from excised breast tumors, after explicit informed consent of patients, from the Hospital do Câncer A.C. Camargo, São Paulo, were frozen in liquid nitrogen immediately after resection. They then were allowed to partially thaw to −20°C and microdissected to enrich for tumor cells in the sample. Total RNA was extracted with Trizol, and RNA degradation was evaluated by means of a Northern Blot by using a GAPDH cDNA probe. Those samples with intact mRNA were treated with DNaseI (10 units/50 μg of total RNA), and the absence of contaminating genomic DNA was confirmed by PCR using primers for the mitochondrial D loop and for the p53 gene. The amplified product was blotted onto nylon membranes and hybridized with [α-32P]dCTP-labeled probes for the corresponding amplified sequences. Qualified samples, with no detectable DNA, then were processed for isolation of poly(A)+ RNA (MiniMacs; Miltenyi Biotec, Auburn, CA). To produce cDNA templates, samples of 10–100 ng of the purified mRNA were heated at 65°C for 5 min and then subjected to reverse transcription at 37°C for 60 min in the presence of 200 units of mouse murine leukemia virus reverse transcriptase and 15 pmol of a randomly selected primer in a final volume of 20 μl. The criteria for primer selection were GC content of more than 50% and length of 18–25 nt. No specific sequence constraints were imposed. Indeed, almost exclusively, the primers used originally had been designed for specific PCR amplification of DNA sequences in nonhuman genomes and were exploited here if they obeyed the simple criteria listed above. After cDNA synthesis, one microliter of a 1:5 dilution of the single-stranded cDNA then was amplified by PCR by using the same or a single, alternative primer. Amplification profiles were generated by using the following cycling parameters: an initial cycle of 95°C for 5 min, 37°C for 2 min, and 72°C for 2 min followed by 35 cycles of 95°C for 45 sec, 45°C for 1 min, and 72°C for 90 sec. Three microliters of each pool was checked for complexity on 8% silver-stained polyacrylamide gels. Product pools with a single, predominant product (≈1%) reflecting the amplification of a highly abundant gene were not processed further. The remaining amplification pools with multiple bands then were cloned into pUC18 by using the Sureclone kit (Amersham Pharmacia). Minipreps for sequencing the inserts were prepared by alkali lysis or boiling preparations and sequenced by using the Perkin–Elmer Big-Dye reagent kit with ABI377 sequencers. In general, 50–200 sequences were determined from each amplification profile.

Computational Analysis.

Simulation of the amplification process was undertaken by searching for matches (60% identity including the 3′ nucleotide) between five totally random 20 mer sequences and the sequence of all full-length cDNAs currently in GenBank. In all cases in which a match was found, a reverse, complementary match within the same cDNA was sought. In all cases in which two complementary matches were found (taken as indicating a successful PCR amplification), the relative position of the amplicon within the cDNA was noted and used in the compilation of curves showing the percentage of amplicons passing through each percentage point of the genes analyzed. When the effect of amplicon size was analyzed, the whole set of amplicons was searched first for those of the size range desired that then were used to construct the curves against a representative transcript of unit length.

An automated protocol for the analysis of the experimentally generated data was used to: (i) assess sequence quality, (ii) trim vector and primer sequence, (iii) remove undesirable sequences such as bacterial, mitochondrial, and rRNA sequences, (iv) mask repetitive elements, and (v) undertake serial blast searches against existing databases. Sequence quality was determined by counting the number of “N” nucleotides generated by the ABI base caller. We excluded sequences whose level of “N” nucleotides was higher than 20%. We also trimmed sequences by analyzing a window of 30 nt. Windows with more than six “Ns” were deleted. Mitochondrial and rRNA sequences were identified by fasta searches against the GenBank entry corresponding to the human mitochondria complete genome sequence and against a locally developed human rRNA database, respectively. Masking of repetitive elements was performed by using repeatmasker under default parameters. Searches against existing databases were performed with blast (22) by using default parameters. Significant hits were determined by using an E value of 10−5 for searches against protein databases and an E value of 10−15 for searches against DNA databases.

Results and Discussion

The amplification of any given point within a transcript, by reverse transcription–PCR, requires primer binding on both sides of that point. The chance of each of these events occurring will be proportional to the lengths of the sequences on both sides of the point. Thus, if the size of a transcript is taken as 1 and the distance from the 3′ end of the cDNA to the point in question is taken as S, then the probability of each appropriate primer template interaction will be S and 1 − S, respectively. In consequence, the probability of amplification will be S(1 − S). The chance of any fragment containing the midpoint of the gene will be 0.5 × 0.5 = 0.25, whereas that of any fragment containing a point 1/10th of the way along the gene will be 0.1 × 0.9 = 0.09, etc. Plotting the values of this expression for multiple points along a transcript of unit length results in a symmetrical curve around the midpoint of the transcript (Fig. (Fig.1).1). This simple concept does not take into account the actual proportion of primer template interactions that yield products or the influence of amplicon length to the coverage of the transcript. We therefore explored the concept by undertaking a series of simulations of PCR under low-stringency conditions using the known sequence of all available, full-length human cDNAs in GenBank. The plots generated confirmed the predicted symmetry around the midpoint of the transcripts and revealed, as would be expected, a higher percentage of amplicons passing through the midpoint of the transcript as amplicon size is increased (Fig. (Fig.1).1).

Figure 1
The predicted, simulated, and experimentally determined position of ORESTES. The smooth, solid curve shows the predicted percentage of ORESTES that should contain the point, with the relative position shown within a hypothetical transcript. The curves ...

On the basis of the theoretical considerations and simulated amplifications, we generated a set of 10,122 ORESTES from human breast tumor mRNA, of which 1,207 were derived from either mitochondrial transcripts or rRNA, 135 were derived from bacterial contaminants, whereas 855 consisted entirely or almost entirely of repetitive elements precluding their further useful analysis (Table (Table1).1). Of the remaining 8,058 sequences, 6,501 were unique. These were divided among sequences that exhibited similarity with known, full-length cDNA sequences, those with similarity to human ESTs, and those without significant similarity to previously identified human transcripts.

Table 1
Categories of ORESTES

We used the nonredundant compilation of those sequences that matched reportedly full-length cDNA sequences to investigate the distribution of ORESTES within transcripts. The percentage of sequences passing through different points along the transcripts followed very closely the predicted distribution with an almost symmetrical curve around the midpoint of the transcripts (Fig. (Fig.1).1). In accord with their centralized distribution, 71% of ORESTES that matched full-length cDNAs were wholly or partially composed of known ORFs as judged by blast against the nonredundant protein database. To investigate whether the ORESTES strategy was likely to add a significant percentage of new sequences to the public databases, we also compared their distribution with those of 5′ and 3′ ESTs against the same full-length genes for which ORESTES sequences were generated (Fig. (Fig.2).2). The data show a clear complementarity of the ORESTES data with that already deposited in GenBank that should permit the rapid construction of contigs covering full-length cDNAs. In the generation of this figure, only complete cDNAs of more than 1 kb were used (excluding some 30% of complete gene sequences) because in very short sequences all the EST data essentially are superimposed, making their relative contributions hard to distinguish. Thus, the ORESTES curve is slightly lower in Fig. Fig.2,2, as would be predicted from the effect of the relative sizes of the amplicon and transcript.

To examine whether the strategy adopted permits sequence analysis of less abundant gene transcripts, as we would expect based on our previous work (21), we listed the UniGene cluster size of the nonredundant compilation of ORESTES that matched fully sequenced human genes. By way of comparison, similar data were generated from nonnormalized and normalized human breast cDNA libraries (Fig. (Fig.3).3). We used only contigs containing full-length cDNAs for the analysis to avoid the strong bias that otherwise would have occurred toward the relative lack of ORESTES matches against small UniGene clusters that contain almost exclusively 3′ reads. The mean cluster size containing ESTs derived from the nonnormalized breast library was 649, the mean cluster size containing ESTs derived from the normalized breast library was 351, and the mean size of clusters against which ORESTES exhibited significant sequence similarity was 318. The median values for the cluster sizes were 317, 138, and 125, respectively. A rigorous comparative analysis of the ORESTES and standard ESTs cannot be pursued because different tissue samples were used. Nevertheless, ORESTES appears to exhibit a performance similar to normalized libraries in terms of accessing genes with lower levels of expression. This is indicated further by the percentage of ORESTES (25.26%) and standard ESTs (10.23% and 18.06% for the nonnormalized and normalized breast libraries, respectively) that exhibited sequence similarity to UniGene clusters of 50 or less entries. The basis of this partial equalization of the frequency of sequences is that the chance of the amplification of a particular transcript is dependent on its sequence and not abundance. Because the percentage of very highly abundant transcripts is very small (23), most amplifications do not result in products derived from very highly expressed genes. This immediately increases the relative abundance of the products derived from the less abundant transcripts in the cell. (Note that in those amplification profiles in which a highly abundant transcript has been amplified this is immediately apparent upon electrophoretic analysis, and the profile is eliminated from further processing.) In addition to this selection, there is also a significant tendency to equalize the relative abundance of amplicons in individual product pools because of the Cot effect of PCR (24).

Figure 3
Comparison of abundance of ORESTES and ESTs. (A) Nonnormalized breast tumor cDNA library NCI CGAP Br1.1. (B) Normalized breast tumor cDNA library NCI CGAP Br2. (C) ORESTES. The bars show the percentage of nonredundant sequences with similarity to full-length ...

The evaluation of the ORESTES strategy provided a considerable amount of new sequence information. Thirty-eight percent of the nonredundant compilation of ORESTES did not exhibit significant similarity with expressed human sequences (Table (Table1).1). A small fraction of these exhibits significant similarity against full or partial cDNA sequences from other organisms (Tables (Tables11 and and2)2) or low levels of similarity against known human transcripts, indicating that they probably derive from orthologous or paralogous genes, respectively (Table (Table2).2). A total of 32% of the nonredundant compilation, however, showed no similarity to known expressed genes from any organism. We would expect these sequences to include those of marginal quality, undetected bacterial sequences, sequences derived from immature mRNA, or, possibly, sequences derived from undetected trace amounts of contaminating DNA. Nevertheless, 40% of those ORESTES that exhibited no database matches were predicted to contain ORFs by using estscan (25) or grail (26), and a number of these exhibited a significant match against a variety of domain profiles (Table (Table3).3). By way of comparison, 68% of ORESTES that exhibited similarity to known genes were identified as containing coding regions by estscan. Thus, we can predict that the majority of ORESTES with no matches contain high-quality data derived from expressed human genes.

Table 2
Putative paralogs–orthologs
Table 3
ORESTES with matches to protein domains

The potential of ORESTES to act as the basis of a shotgun approach to the sequencing of human transcripts was demonstrated by the 21% of sequences that partially or wholly matched preexisting human ESTs, from which we were able to construct 783 contigs, each one corresponding to a different UniGene cluster. Of the total number of bases contained in these contigs, 19% represented new sequence contributed by ORESTES. The contigs assembled to date mostly comprise extensions of the sequences contained in 3′ or 5′ ESTs but also a few instances in which the reads from the ends of cDNAs are joined by an ORESTES sequence or, indeed, different clusters are joined.

Knowledge of the complete sequence of the noncoding regions of the human genome will provide the basis of the definition of many facets of gene structure and expression. Nevertheless, it is the coding regions contained within the genome that represent the information of most crucial and immediate importance. These regions probably constitute only about 3% of the human genome. Although 5′ ESTs often fall within coding regions and have been pursued vigorously with this characteristic in mind, ORESTES offer the possibility of significantly extending the coverage of coding regions with ESTs. Using the different complementary EST approaches now available, it eventually may even be possible to contemplate the generation of a shotgun sequence of the human transcriptome. We currently are embarking on the production of approximately 500,000 human ORESTES that will be deposited in the public databases as they are generated. Given the simplicity of the methodology, the small amounts of starting material required, and the speed of data generation, the simultaneous adoption of this approach in diverse laboratories could lead rapidly to the determination of the majority of the human transcriptome.


We thank Rui C. Serafim, Ricardo P. Moura, Elisangela Monteiro, Anna Christina de Matos Salim, and Daniel F. Simão for dedicated and expert technical assistance and Juçara Parra for acting as the administrative coordinator of this project. E.D.N., R.G.C., W.S.J., and S.B. were supported by doctoral or postdoctoral fellowships from the Fundaçao de Amparo à Pesquisa do Estado de São Paulo. The work also was supported in part by the Programa de Apoio a Nucleos de Excelencia/Financiadora de Estudos e Projetos and the Fundaçao de Amparo à Pesquisa do Estado de São Paulo.


expressed sequence tag


1. Adams M D, Dubnick M, Kerlavage A R, Moreno R, Kelley J M, Utterback T R, Nagle J W, Fields C, Venter J C. Nature (London) 1992;355:632–634. [PubMed]
2. Adams M D, Kerlavage A R, Fleischmann R D, Fuldner R A, Bult C J, Lee N H, Kirkness E F, Weinstock K G, Gocayne J D, White O. Nature (London) 1995;377:3–174. [PubMed]
3. Hillier L D, Lennon G, Becker M, Bonaldo M F, Chiapelli B, Chissoe S, Dietrich N, DuBuque T, Favello A, Gish W, et al. Genome Res. 1996;6:807–828. [PubMed]
4. Nomura N, Miyajima N, Sazuka T, Tanaka A, Kawarabayasi Y, Sato S, Nagase T, Seki N, Ishikawa K, Tabata S. DNA Res. 1994;1:27–35. [PubMed]
5. Pandey A, Lewitter F. Trends Biochem Sci. 1999;24:276–280. [PubMed]
6. Bailey L C J, Searls D B, Overton G C. Genome Res. 1998;8:362–376. [PubMed]
7. Burke J, Wang H, Hide W, Davison D B. Genome Res. 1998;8:276–290. [PMC free article] [PubMed]
8. Jiang J, Jacob H J. Genome Res. 1998;8:268–275. [PMC free article] [PubMed]
9. Buetow K H, Edmonson M N, Cassidy A B. Nat Genet. 1999;21:323–325. [PubMed]
10. Cargill M, Altshuler D, Ireland J, Sklar P, Ardlie K, Patil N, Lane C R, Lim E P, Kalayanaraman N, Nemesh J, et al. Nat Genet. 1999;22:231–238. [PubMed]
11. Picoult-Newberg L, Ideker T E, Pohl M G, Taylor S L, Donaldson M A, Nickerson D A, Boyce-Jacino M. Genome Res. 1999;9:167–174. [PMC free article] [PubMed]
12. Gress T M, Muller-Pillasch F, Geng M, Zimmerhackl F, Zehetner G, Friess H, Buchler M, Adler G, Lehrach H. Oncogene. 1996;13:1819–1830. [PubMed]
13. Huang G M, Ng W, Farkas J, He L, Liang H A, Gordon D, Yu J, Hood L. Genomics. 1999;59:178–186. [PubMed]
14. Jay P, Diriong S, Taviaux S, Roeckel N, Mattei M G, Audit M, Berge-Lefranc J L, Fontes M, Berta P. Genomics. 1997;39:104–108. [PubMed]
15. Malone K, Sohocki M M, Sullivan L S, Daiger S P. Mol Vis. 1999;5:5. [PMC free article] [PubMed]
16. Nelson P S, Ng W L, Schummer M, True L D, Liu A Y, Bumgarner R E, Ferguson C, Dimak A, Hood L. Genomics. 1998;47:12–25. [PubMed]
17. Neubauer G, King A, Rappsilber J, Calvio C, Watson M, Ajuh P, Sleeman J, Lamond A, Mann M. Nat Genet. 1998;20:46–50. [PubMed]
18. Vasmatzis G, Essand M, Brinkmann U, Lee B, Pastan I. Proc Natl Acad Sci USA. 1998;95:300–304. [PMC free article] [PubMed]
19. Aaronson J S, Eckman B, Blevins R A, Borkowski J A, Myerson J, Imran S, Elliston K O. Genome Res. 1996;6:829–845. [PubMed]
20. Strausberg R L, Dahl C A, Klausner R D. Nat Genet. 1997;15:415–416. [PubMed]
21. Neto E D, Harrop R, Correa-Oliveira R, Wilson R A, Pena S D, Simpson A J. Gene. 1997;186:135–142. [PubMed]
22. Altschul S F, Madden T L, Schaffer A A, Zhang J, Zhang Z, Miller W, Lipman D J. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article] [PubMed]
23. Zhang L, Zhou W, Velculescu V E, Kern S E, Hruban R H, Hamilton S R, Vogelstein B, Kinzler K W. Science. 1997;276:1268–1272. [PubMed]
24. Mathieu-Daude F, Welsh J, Vogt T, McClelland M. Nucleic Acids Res. 1996;24:2080–2086. [PMC free article] [PubMed]
25. Iseli, C., Jongeneel, C. V. & Bucher, P. (2000) Ismb, in press.
26. Uberbacher E C, Mural R J. Proc Natl Acad Sci USA. 1991;88:11261–11265. [PMC free article] [PubMed]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences
PubReader format: click here to try


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • Cited in Books
    Cited in Books
    PubMed Central articles cited in books
  • EST
    Published EST sequences
  • Gene
    Gene links
  • Gene (nucleotide)
    Gene (nucleotide)
    Records in Gene identified from shared sequence links
  • GEO Profiles
    GEO Profiles
    Related GEO records
  • MedGen
    Related information in MedGen
  • Nucleotide
    Published Nucleotide sequences
  • Protein
    Published protein sequences
  • PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...