![]() | ![]() |
Formats:
|
||||||||||||||||||||||
Copyright © Copyright 2005 by RNA Society Detecting novel low-abundant transcripts in Drosophila 1Department of Medicine, 2Department of Ecology and Evolution, 3Ben May Institute for Cancer Research, and 4Department of Computer Science, University of Chicago, Chicago, Illinois 60637, USA 5Beijing Genomics Institute, Chinese Academy of Sciences, Beijing 101300, China 6Department of Biological Sciences, Purdue University Calumet, Hammond, Indiana 46323, USA 7Center for Functional Genomics, ENH Research Institute, Northwestern University, Evanston, Illinois 60201, USA Reprint requests to: San Ming Wang, Center for Functional Genomics, ENH Research Institute, Northwestern University, Evanston, IL 60201, USA; e-mail: swang1/at/northwestern.edu; fax: (224) 364-5003. Received November 17, 2004; Accepted February 23, 2005. This article has been cited by other articles in PMC.Abstract Increasing evidence suggests that low-abundant transcripts may play fundamental roles in biological processes. In an attempt to estimate the prevalence of low-abundant transcripts in eukaryotic genomes, we performed a transcriptome analysis in Drosophila using the SAGE technique. We collected 244,313 SAGE tags from transcripts expressed in Drosophila embryonic, larval, pupae, adult, and testicular tissue. From these SAGE tags, we identified 40,823 unique SAGE tags. Our analysis showed that 55% of the 40,823 unique SAGE tags are novel without matches in currently known Drosophila transcripts, and most of the novel SAGE tags have low copy numbers. Further analysis indicated that these novel SAGE tags represent novel low-abundant transcripts expressed from loci outside of currently annotated exons including the intergenic and intronic regions, and antisense of the currently annotated exons in the Drosophila genome. Our study reveals the presence of a significant number of novel low-abundant transcripts in Drosophila, and highlights the need to isolate these novel low-abundant transcripts for further biological studies. Keywords: low abundant, transcript, SAGE, EST, Drosophila, genome INTRODUCTION RNA reassociation experiment indicates that the majority of transcripts expressed in eukaryotic genomes are present at low levels (Bishop et al. 1974). While low-abundant transcripts were traditionally considered unimportant “noise” transcripts, recent studies suggest that they may be biologically significant, with roles in cellular differentiation, metabolism, and phenotypic alternation (Reanney et al. 1983; Elowitz et al. 2002; Kuznetsov et al. 2002; Ozbudak et al. 2002; Blake et al. 2003; Paulsson 2004). Low-abundant transcripts may also be a driving force in the evolutionary process (Alvarez 2001). Answers to some fundamental biological questions may emerge from the systematic study of low-abundant transcripts. Low-abundant transcripts must be isolated before their biological roles can be determined. Despite intensive efforts on transcript identification, however, little is known about the prevalence of low-abundant transcripts, primarily because of isolation difficulties due to their small mass and high heterogeneity (Bishop et al. 1974; Holland 2002; Czechowski et al. 2004). SAGE is a method for genome-level transcript analysis (Velculescu et al. 1995). Through isolating a short tag from a transcript and concatemerizing multiple tags for a single sequencing reaction, SAGE provides high sensitivity for transcript detection. SAGE can detect both known and novel transcripts, and provides quantitative information about the detected transcripts (Zhou et al. 2001; Saha et al. 2002). Drosophila is a well-established eukaryotic animal model. The Drosophila genome has been well sequenced and annotated, and its transcriptome has been extensively characterized by the large-scale EST approach (Adams et al. 2000; Rubin et al. 2000; Stapleton et al. 2002). Compared with higher eukaryotic genomes, the smaller size of the Drosophila genome enables Drosophila SAGE tags to represent their original transcripts and to map in the Drosophila genome with high specificity (Jasper et al. 2001, 2002; Fujii and Amrein 2002; Pleasance et al. 2003). Taking advantage of the vast information known about the Drosophila genome and the high sensitivity of the SAGE technique for transcript detection, we performed a thorough Drosophila transcriptome analysis using the SAGE method to investigate the prevalence of low-abundant transcripts. We expected to detect low-abundant transcripts as long as a significant quantity of low-abundant transcripts exists and a sufficient number of SAGE tags could be collected. Here we report the results from this study. RESULTS General procedures of the study Figure 1 1
Data collection and processing We collected 359,139 SAGE tags from normal and radiated embryos, larvae, pupae, male and female adults, and testes from male adults. To ensure a high confidence in the downstream analysis, we excluded 114,826 uncertain SAGE tags, which included tags not mapped in the Drosophila genome, tags from potentially contaminated transcripts from the yeast that were used for feeding Drosophila in the laboratory environment, tags of SAGE linkers, and tags from the Drosophila mitochondrial genome. We obtained 244,313 final SAGE tags. From these SAGE tags, we identified a total of 40,823 unique SAGE tags. Each unique SAGE tag has quantitative information according to its copy number and maps to the Drosophila genome (Table 1; [http://www.ncbi.nlm.nih.gov/geo; accession no. GSE2347]; Supplementary Table 1 and Supplementary Table 2 [http://www.biochem.northwestern.edu/ibis/faculty/smwang.htm]).
Comparison of SAGE tags with known Drosophila transcripts We compared the 40,823 unique SAGE tags with known Drosophila transcripts physically isolated from the Drosophila genome, including full-length cDNAs, 3′ ESTs, and 5′ ESTs. First, we matched the SAGE tags to the 10 bases adjacent to the last CATG in the full-length cDNAs and 3′ ESTs. Because SAGE tags are adjacent to the last CATG of the detected transcripts, a SAGE tag matched to such a location is an indication that the SAGE tag originates from the matched transcript. Second, we matched the SAGE tags to the 10 bases adjacent to all CATG sites in full-length cDNA, 3′ ESTs, and 5′ ESTs (except those adjacent to the last CATG in full-length cDNAs and 3′ ESTs). A SAGE tag matched to those locations suggests the presence of an alternatively spliced or polyadenylated transcript in which the sequences around the last CATG were removed and an upstream CATG was exposed for SAGE tag releasing. From these two comparisons, we observed that 45% of the 40,823 unique SAGE tags match the known expressed sequences. The remaining 55% do not have matches to the known Drosophila transcripts (Table 2; Supplementary Tables 3, 4 [http://www.biochem.northwestern.edu/ibis/faculty/smwang.htm]).
We also compared the 40,823 unique SAGE tags with the annotated Drosophila transcripts. SAGE tags are located toward the 3′ part of the transcripts. Because the majority of known Drosophila transcripts are ESTs and most of the ESTs are 5′ ESTs, the comparison between SAGE tags and physically isolated transcripts may be biased due to the unbalanced 3′ EST and 5′ EST collection. The Drosophila genome project provides a full set of annotated transcripts. These annotated transcripts were generated using current knowledge of gene structure, computational gene prediction, and experimental evidence —including, but not necessarily restricted by, the full-length cDNAs, 5′ ESTs, and 3′ ESTs. The biased distribution of 5′ ESTs and 3′ ESTs has been normalized in the annotated transcripts. This comparison shows that 41% of the 40,823 unique SAGE tags match the annotated transcripts, whereas 59% do not match. Taken together, the results from these two comparisons show that over half of the SAGE tags do not match the known Drosophila transcripts. Verification of the origins of unmatched SAGE tags We performed four types of experiments to verify the origins of the unmatched SAGE tags.
In summary, these experimental results indicate that most of the unmatched SAGE tags are novel SAGE tags representing currently unidentified novel transcripts. Location of novel SAGE tags in the Drosophila genome To investigate the correlation between the novel transcripts detected from novel SAGE tags and known genes, we mapped the novel SAGE tags in the Drosophila genome. To provide high mapping specificity, we focused on the 18,913 unique SAGE tags that map only to a single locus in the Drosophila genome. These tags include 7,106 matched SAGE tags and 11,807 novel SAGE tags. Of the 7,106 matched SAGE tags, 88% are located in the annotated exons, and 12% are mapped in the unannotated loci. In contrast, of the 11,807 novel SAGE tags, only 1% are mapped within the annotated exons, while 99% are mapped in the unannotated loci (Table 4A). Further analysis revealed that 48% of those mapped in the unannotated loci are located in the intergenic regions, 16% are located in the intragenic regions (most of which are intronic), and 36% are antisense of the intragenic regions (two-thirds of which are exonic). Since some annotated genes may end at a translational stop codon without 3′ UTR sequences, we further mapped the 11,807 novel SAGE tags to the genomic sequences 1,000 bp downstream of the annotated genes. Only 7.5% of these novel tags mapped within the region. The 11,807 novel SAGE tags were rather uniformly distributed among different chromosomes (Table 4B), although many tags mapped in particular chromosomes tend to be clustered. In conclusion, most novel transcripts detected by novel SAGE tags were expressed outside the annotated exons or genes, or were antisense of annotated exons or genes in the Drosophila genome.
Quantitative distribution of novel SAGE tags We measured the quantitative distribution of the novel SAGE tags and compared it with that of the matched SAGE tags. The matched SAGE tags have proportionally higher copy numbers, while the novel SAGE tags have proportionally lower copy numbers (Fig. 5 5).
DISCUSSION The total number of Drosophila SAGE tags collected in this study approximates the number of Drosophila EST sequences. However, over half of the transcripts detected by SAGE tags are not present in the EST collection. The discrepancy between the SAGE data and the EST data may be explained by the following:
We consider it unlikely that the following possibilities could be the major source of novel SAGE tags detected in this study.
Identification of low-abundant transcripts is more difficult than identification of high-abundant transcripts due to the redundancy and complexity of transcripts. In the last decade, new technologies with increased sensitivity, such as EST, subtraction/normalization EST, SAGE, and MPSS, have been developed and applied in transcriptome studies (Adams et al. 1992; Velculescu et al. 1995; Bonaldo et al. 1996; Brenner et al. 2000). When techniques with higher sensitivity are used, greater numbers of less abundant novel transcripts are identified. However, identification of all low-abundant transcripts remains a challenge, as evidenced by our current study. Although mathematic calculations can be used to estimate the scope of transcript collection for identification of full-set transcripts (Stern et al. 2003; Reverter et al. 2005), the final answer will likely come from experimental data showing that few or no novel transcripts could be identified. Thus far, this stage of transcript identification has not been achieved in most of the genomes studied (Kapranov et al. 2002; Okazaki et al. 2002; Seki et al. 2002; Bertone et al. 2004; Imanishi et al. 2004; Schadt et al. 2004; Scheetz et al. 2004). The genomic locations of the novel SAGE tags are interesting. Among the novel SAGE tags, nearly half are located intergenically, implying that more novel transcribed regions than current annotated ones are present in the Drosophila genome. Using a tilling array technique, a recent study also detected transcriptional activities in 41% of the intergenic region and 43% of the intronic region in the Drosophila genome (Stolc et al. 2004). Furthermore, a third of the novel SAGE tags are antisense transcripts of the annotated genes, most of which are located in the known exons. The wide presence of antisense novel transcripts for known genes revealed in this study supports the concept that antisense transcript is one of the major means for gene expression regulation (Yelin et al. 2003). In conclusion, our study demonstrates the presence of a large quantity of low-abundant transcripts in Drosophila, which may also occur in other species (Bertone et al. 2004). Systematic identification of low-abundant transcripts in model species is an important step toward the elucidation of the biological roles of low-abundant transcripts. MATERIALS AND METHODS Collection of SAGE tags The Drosophila melanogaster strain (y; cn bw sp) used in the Drosophila genome and EST projects was used for the study. RNA was extracted from six types of samples, including embryo (0–24 h), embryo with radiation (0- to 24-h embryos, 30 min after treatment with 40 Gy γ radiation), larvae (second and third day), pupae (third day), adults (10 males and 10 females, up to 10 d), and testis from testicular tissue of adult males. Four SAGE libraries were constructed: (1) pooled sample that included an equal amount of total RNA from embryo, larvae, pupae, and young and aged adults; (2) embryo; (3) irradiated embryo; and (4) testis. SAGE libraries were constructed following the procedures (Lee et al. 2001). Large-scale SAGE tag sequence collection was performed using the DYEnamic ET Terminator Cycle Sequencing kit in Megabase1000 DNA sequencers (Amersham) with Phred20 as the cutoff. SAGE tags were extracted using SAGE- 300 software. Questionable SAGE tags were removed from the collected SAGE tags, including yeast SAGE tags (http://www.sagenet.org/SAGEData/sagedata.htm), the SAGE tag linkers TCCCTATTAA and TCCCCGTACA, the primer tag AAAGCGGCCG and its derivatives, and the mitochondrial tags (extracted from the Drosophila melanogaster mitochondrial genome, NC_001709, http://www.ncbi.nlm.nih.gov/). Construction of SAGE reference databases The Drosophila genome sequences used were Drosophila Release 3.1 (http://www.fruitfly.org/cgi-bin/seq_tools/fasta_download.cgi). The genomic tag reference database was generated by extracting 10-base tags from all CATG sites in the genomic sequences, including the tags from the sense strand immediately adjacent to the CATGs and the tags from the antisense before CATG with reverse/complementary sequences. The SAGE tag reference database from the physically isolated known transcripts was constructed by extracting 10 bases adjacent to CATGs in the full-length cDNA, 3′ ESTs, and 5′ ESTs (UniGene Drosophila melanogaster database release 17, http://www.ncbi.nlm.nih.gov/). Of these sequences, 94% of mRNA, 78% of 3′ ESTs, and 80% of 5′ ESTs contain CATG and are therefore detectable by SAGE. The SAGE tag reference database from the Drosophila annotated transcripts was constructed by extracting 10 bases adjacent to all CATGs in each annotated “transcript” sequence in Release 3.1 (http://flybase.net/annot/download_sequences.html). Conversion of SAGE tags into 3′ cDNAs A set of unmatched SAGE tags was randomly selected from the total unmatched SAGE tag list and converted into 3′ cDNAs using the GLGI method (generation of longer 3′ cDNA from SAGE tags for gene identification) (Chen et al. 2002; Supplementary Table 5 [http://www.biochem.northwestern.edu/ibis/faculty/smwang.htm]). The 3′ cDNA sequences were deposited in GenBank with accession numbers CB305186–CB305318. RT-PCR confirmation of novel transcripts detected by novel SAGE tags RT-PCR was used to confirm novel transcripts detected by novel SAGE tags. Sense primers and antisense primers were designed based on the 3′ cDNAs converted from novel SAGE tags. Total RNA samples from embryonic, early larval, later larval, pupal, male and female adult, and pooled tissues were used as the templates for the analysis (Supplementary Table 6 [http://www.biochem.northwestern.edu/ibis/faculty/smwang.htm]). RNase A-treated RNA samples were used as negative control. Northern blot confirmation of novel transcripts detected by novel SAGE tags Northern blot was used to confirm novel transcripts detected by novel SAGE tags. RNA samples from whole adults were used for the detection. The 3′ cDNAs converted from novel SAGE tags were used as probes. Probe labeling, hybridization, and signal detection were performed using the Bright Star Bio-Detection system (Ambion) following the protocol. RT-PCR detection of novel transcripts expressed from novel SAGE tag-mapped, unannotated genomic regions Genomic segments mapped by novel SAGE tags were used for the test. Each segment starts at the novel SAGE tag-mapped location and moves downstream to the polyA signal sequences AATAAA or ATTAAA. Sense primers were designed based on the mapped novel SAGE tag; antisense primers were designed based on the genomic sequences upstream of AATAAA or ATTAAA (Supplementary Table 8 [http://www.biochem.northwestern.edu/ibis/faculty/smwang.htm]). The pooled total RNA samples were used as the templates for the detection. RNase A-treated RNA samples were used as control for monitoring genomic DNA contamination. Acknowledgments This study was funded by a Howard Hughes Fellowship (J.S.), a Department of Defense MURI program (C.T., S.M.W.), the Ludwig Fund for Cancer Research (W.D., also a Leukemia and Lymphoma Society Scholar), National Institutes of Health (C-I.W.), Natural Science Foundation of China and Chinese Academy of Sciences (H.Y.), the G. Harold and Lelia Y. Mathers Charitable Foundation (S.M.W.), the Daniel F. and Ada L. Rice Foundation (S.M.W.), and National Institutes of Health (S.M.W.). Notes Article and publication are at http://www.rnajournal.org/cgi/doi/10.1261/rna.7239605. REFERENCES
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||
Nature. 1974 Jul 19; 250(463):199-204.
[Nature. 1974]Aust J Biol Sci. 1983; 36(1):77-90.
[Aust J Biol Sci. 1983]Science. 2002 Aug 16; 297(5584):1183-6.
[Science. 2002]Genetics. 2002 Jul; 161(3):1321-32.
[Genetics. 2002]Nat Genet. 2002 May; 31(1):69-73.
[Nat Genet. 2002]J Math Biol. 2001 Dec; 43(6):534-44.
[J Math Biol. 2001]Nature. 1974 Jul 19; 250(463):199-204.
[Nature. 1974]J Biol Chem. 2002 Apr 26; 277(17):14363-6.
[J Biol Chem. 2002]Plant J. 2004 Apr; 38(2):366-79.
[Plant J. 2004]Science. 1995 Oct 20; 270(5235):484-7.
[Science. 1995]Proc Natl Acad Sci U S A. 2001 Nov 20; 98(24):13966-71.
[Proc Natl Acad Sci U S A. 2001]Nat Biotechnol. 2002 May; 20(5):508-12.
[Nat Biotechnol. 2002]Science. 2000 Mar 24; 287(5461):2185-95.
[Science. 2000]Science. 2000 Mar 24; 287(5461):2222-4.
[Science. 2000]Genes Chromosomes Cancer. 2002 Mar; 33(3):252-61.
[Genes Chromosomes Cancer. 2002]Genes Chromosomes Cancer. 2002 Mar; 33(3):252-61.
[Genes Chromosomes Cancer. 2002]Nature. 1974 Jul 19; 250(463):199-204.
[Nature. 1974]BMC Genomics. 2004 Jan 5; 5(1):1.
[BMC Genomics. 2004]Genome Res. 1996 Sep; 6(9):791-806.
[Genome Res. 1996]Proc Natl Acad Sci U S A. 2000 Apr 11; 97(8):4162-7.
[Proc Natl Acad Sci U S A. 2000]Nature. 1974 Jul 19; 250(463):199-204.
[Nature. 1974]BMC Genomics. 2004 Jan 5; 5(1):1.
[BMC Genomics. 2004]Genome Res. 1996 Sep; 6(9):791-806.
[Genome Res. 1996]Proc Natl Acad Sci U S A. 2000 Apr 11; 97(8):4162-7.
[Proc Natl Acad Sci U S A. 2000]Nature. 1992 Feb 13; 355(6361):632-4.
[Nature. 1992]Science. 1995 Oct 20; 270(5235):484-7.
[Science. 1995]Genome Res. 1996 Sep; 6(9):791-806.
[Genome Res. 1996]Nat Biotechnol. 2000 Jun; 18(6):630-4.
[Nat Biotechnol. 2000]Bioinformatics. 2003 Mar 1; 19(4):443-8.
[Bioinformatics. 2003]Science. 2004 Oct 22; 306(5696):655-60.
[Science. 2004]Nat Biotechnol. 2003 Apr; 21(4):379-86.
[Nat Biotechnol. 2003]Science. 2004 Dec 24; 306(5705):2242-6.
[Science. 2004]Biotechniques. 2001 Aug; 31(2):348-50, 352-4.
[Biotechniques. 2001]Genes Chromosomes Cancer. 2002 Mar; 33(3):252-61.
[Genes Chromosomes Cancer. 2002]