Logo of narLink to Publisher's site
Nucleic Acids Res. 2006 Nov; 34(19): e133.
Published online 2006 Oct 5. doi:  10.1093/nar/gkl714
PMCID: PMC1636492

DeepSAGE—digital transcriptomics with high sensitivity, simple experimental protocol and multiplexing of samples


Digital transcriptomics with pyrophosphatase based ultra-high throughput DNA sequencing of di-tags provides high sensitivity and cost-effective gene expression profiling. Sample preparation and handling are greatly simplified compared to Serial Analysis of Gene Expression (SAGE). We compare DeepSAGE and LongSAGE data and demonstrate greater power of detection and multiplexing of samples derived from potato. The transcript analysis revealed a great abundance of up-regulated potato transcripts associated with stress in dormant potatoes compared to harvest. Importantly, many transcripts were detected that cannot be matched to known genes, but is likely to be part of the abiotic stress-response in potato.


Transcriptomics is essential to monitoring the genomic activation of cells or organisms in response to environmental signals. Global gene expression analysis has been conducted either by hybridization with oligo nucleotide microarrays (1), or by counting of sequence tags. An advantage of microarray analysis is that once the array has been made at a high cost, many measurements can be made at a relatively low cost. However, only known genes can be spotted on the array. In contrast, sequence tag based approaches, like Serial Analysis of Gene Expression (SAGE) (2) and massive parallel signature sequencing (MPSS) (3) can measure the expression of both known and unknown genes. The MPSS technology, however, is too complex to be performed in non-specialized laboratories and very expensive. On the contrary, a SAGE experiment consists of a series of molecular biology manipulation that, in principle, can be carried out in any molecular biology laboratory with access to a 96 capillary DNA sequencer. SAGE relies on the extraction of one 14–21 nt sequence tag from each mRNA. Tags are ligated together, cloned and sequenced. In a typical sequence run of 96 samples ∼1500 tags of corresponding mRNAs can be detected. Due to the cost of sequencing, a SAGE study typically encompasses 50 000 tags and provides detailed knowledge of the 2000 most highly expressed genes in the tissue analyzed. In practice, it can be difficult to achieve enough clones of the appropriate insert length (4) to facilitate efficient detection.

Here we describe an experimentally simple method for ditag-based transcript detection, DeepSAGE, similar to the initial steps of LongSAGE (5) in conjunction with emulsion-based amplification and pyrophosphate based ultra-high throughput DNA sequencing (6). DeepSAGE allows the counting of more than 300 000 tags with less effort and cost than a typical LongSAGE study encompassing 50 000 tags. The deep sampling facilitates the measurement of rare transcripts below the detection limit of existing global transcript profiling technologies. Moreover multiple samples can be sequenced in a single run.


DeepSAGE sample preparation

RNA was isolated (7) from field grown potato tubers cv. Kuras at the time of harvest (HAR) and at dormancy after 60 days of storage at 10°C (DOR). Quality of RNA was verified from integrity and intensity of ribosomal RNA following 1% TAE-agarose gel electrophoresis. Fifty microgram of RNA was used to construct LongSAGE ditags as described by Saha et al. (5). Following ligation of linker-tags to form ditags, six 50 μl PCR consisting of 2.5 U Taq polymerase (Ampliqon, Copenhagen, Denmark), 0.5 mM deoxynucleotide triphosphates, 1 μl 1:160 dilution of the ligation reaction, 2 μM of 5′-GCCTTGCCAGCCCGCTCAGCAAGCTTCTAACGATGTACGT-3′ and 2 μM of either 5′-GCCTCCCTCGCGCCATCAGAAGTGGTGCAGTACAACTAGGCT (HAR) or 5′-GCCTCCCTCGCGCCATCAGACGTGGTGCAGTACAACTAGGCT (DOR) in 10 mM Tris–HCl, 50 mM KCl, 3 mM MgCl, 1% Triton X-100 were prepared. PCR were subjected to 26 cycles of amplification at 94°C for 30 s, 1 min at 55°C followed by 1 min at 70°C. The presence of a 125 bp ditag band was verified by 15% TAE–PAGE prior to pooling and ethanol precipitation by addition of 2 μl 20 g/l glycogen (Fermentas, Burlington, Canada), 50 μl 7.5 M ammonium acetate, 1 ml 100% ethanol (De Danske Spritfabrikker, Aalborg, Denmark) and incubation at −80°C for 1 h. The tubes were centrifuged at maximum speed at room temperature for 20 min. The pellets were washed with 1 ml 70% ethanol and redisolved in 75 μl 10 mM Tris–HCl, 0.1 mM EDTA, pH 7.5. The two amplified ditag samples were separated by 12% TAE–PAGE. Following staining of the gel for 2 min with ethidium bromide (2 μg/ml), the 130 bp band was excised using a clean scalpel, and the gel piece transferred into a 0.6 ml tube that had been punctured in the bottom with a 12 Gauge needle. The tube was inserted into a 1.5 ml tube and centrifuged at maximum speed for 1 min in a benchtop centrifuge. 375 μl 10 mM Tris–HCl, 0.1 mM EDTA, pH 7.5 and 125 μl 7.5 M ammonium acetate was added to the crushed gel pieces, and the tubes were incubated at 4°C overnight. The entire contents of each sample was transferred to two Spin-X filter tubes (Corning, New York, USA) and centrifuged at maximum speed for 30 s. The eluates were transferred to a 2 ml tube prior to addition of 2 μl 20 mg/ml glycogen and 1500 μl 100% ethanol. Following incubation at −80° for 1 h, the tubes were centrifuged at maximum speed at room temperature for 20 min, washed with 1 ml 70% ethanol, and redissolved in 20 μl 10 mM Tris–HCl, 0.1 mM EDTA, pH 7.5. The integrity of the 130 bp ditag band was checked by 15% TAE–PAGE and the concentration was determined by absorption at 260 nm. The two samples were mixed in equimolar amounts prior to sequencing by 454-Life Science Corp., Branford, CT, USA according to Ref. (6).

Tag extraction and data analysis

Tags were extracted from sequence FASTA files containing ditags using the PERL script DeepSAGE_extract.pl (see Supplementary Data). Linker and poly-A derived tags were removed, but duplicate ditags were not. The tags were mapped to potato tentative contiguous sequences (www.tigr.org) using Sagemap-tsv.pl. (www.bio.aau.dk/en/biotechnology/software_applications). The resulting tabulator separated value-files were imported into Excel for further analysis. The entire dataset including tags only observed in one of the datasets only was used for the calculation of correlation coefficients. To improve interpretability however, Figures 2 and 3 were displayed in logarithmic scale thereby omitting tags observed only in one of the datasets. Statistically significant gene expression changes were detected (8) using strict Bonferroni correction.


mRNA from two stages of potato tuber development, at harvest (HAR) and dormancy (DOR), were extracted. Following the preparation of LongSAGE ditags (5), 100–400 50 μl PCR are usually pooled to provide enough concatemers for LongSAGE. In the present study, only six 50 μl reactions yielded more than 10 times the material used for a DeepSAGE experiment. Amplification of ditags was carried out using primers containing a sequence primer recognition site, a 3 nt sample identification key (AAG for HAR, ACG for DOR) and a sequence complementary to the linkers used in LongSAGE. Both samples yielded amplification products of 125 bp which were purified by gel electrophoresis. DNA concentration was determined and equimolar amounts of the two samples were pooled. Contrary to LongSAGE these amplified ditags were used directly for sequencing. Preparation of beads carrying sequence templates, clonal amplification in emulsion and DNA sequencing were done according to Margulies et al. (6).

A total of 224 310 sequences were obtained in a single sequence run, which included both forward and reverse sequences (Table 1). The distribution of the length between the two CATG sequences flanking the ditags (Figure 1) was found to be very similar to traditional LongSAGE. A PERL script (DeepSAGE_extract.pl) was used to extract 314 212 tags of 19 nt (167 367 from forward sequences and 146 845 from reverse sequences). Overall, 70% of these sequences yielded a good ditag sequence. This is comparable to our experience with traditional LongSAGE, where 73% of sequenced clones contained at least one ditag.

Figure 1
Distribution of ditags length in LongSAGE (solid) and DeepSAGE (hatched).
Table 1
Summary of sequencing statistics

It was reported that the pyro-sequencing employed in this study has a somewhat higher error rate than Sanger sequencing, especially in homopolymer regions of four or more (6). Therefore, we inspected our dataset for tags containing homopolymers which were truncated or elongated. Surprisingly, we only found such tags in very low abundance similar to other type of sequencing errors, even though several abundant tags contained homopolymers.

To further address sequence accuracy and the impact on tag based transcriptome analysis, the forward sequences from both runs were sorted by their identification key into 91 580 from the HAR sample and 122 100 from the DOR sample. We determined the sequence error rates using SAGEscreen (9) for both the DeepSAGE datasets and for a LongSAGE DOR library of 53 688 tags. Tags observed more than 50 times (87 229 and 141 tags for LongSAGE, DeepSAGE DOR and HAR, respectively). The results are shown in Table 1. Overall estimates of sequence error containing tags in DeepSAGE are in fact lower (9.1–12.4%) than LongSAGE (16.6%). The overall estimates are composed of lower substitution error rate in DeepSAGE compared to LongSAGE (5.2–9.2% versus 15.3% of tags) and a higher insertion (2.2–2.5% versus 0.72%) and deletion rate (1.3–1.7% versus 0.8%) in agreement with what was previously found for ultra-high throughput pyrophosphate sequencing (6). Presumably, the higher sequence accuracy of DeepSAGE is obtained because tag sequences are extracted from nt 33 to approximately 73 (dependent on variation in MmeI cleavage) of DNA sequences, well within the first 90 nt which are determined with the highest accuracy (6). Indeed, correlation analysis of tags extracted from forward and reverse sequences (Figure 2A) indicated good sequence fidelity and reliable tag extraction (R2 = 0.96). Reproducibility was confirmed by performing a second limited run yielding 119 835 tags and comparing the two runs (R2 = 0.96) (Figure 2B).

Figure 2
Correlation of tag counts extracted from (A) forward and reverse sequences, respectively. Data sets consisted of 167 159 forward sequences and 199 413 reverse sequences. Using tags observed at least once in both directions only (12 025 tags) the R2 = ...

Figure 3 shows a comparison of the numbers of LongSAGE DOR tags versus DeepSAGE DOR tags and DeepSAGE HAR tags, respectively. The distribution of DOR tags was very similar for the Long- and DeepSAGE methods (R2 = 0.96) showing that these measurements of the transcriptome are equivalent. Comparison of the transcriptomes at dormant and harvest were significantly different (R2 = 0.33) as expected. A similar correlation of R2 = 0.35 was obtained for DeepSAGE DOR versus DeepSAGE HAR (data not shown).

Figure 3
Correlation of LongSAGE and DeepSAGE DOR tags (A) and DeepSAGE HAR (B). Data sets consisted of 51 918 LongSAGE tags, 122 100 DeepSAGE DOR tags and 91 580 DeepSAGE HAR tags. The most abundant DOR tag was encountered 1397 in LongSAGE and 3145 in DeepSAGE. ...

Little is known about the potato tubers adaptation to the abiotic stress imposed by the unnatural environment above ground during storage. Comparing the gene expression of the analyzed potato libraries, 69 genes were up-regulated and 65 genes were down-regulated (P < 0.05 with Bonferroni correction) in DeepSAGE DOR compared to DeepSAGE HAR (Supplementary Table 1). Strikingly, among the 69 up-regulated transcripts, 22 of the 42 transcripts that can be matched to a known sequence are homologues to either chaperones (4), genes involved in the ubiquitin protein degradation pathway (5) or suggested to be otherwise stress-related (13) (see Table 2). Intriguingly, among these transcripts are three members of Ca2+ signal transduction pathways: TC112122 (Calmodulin, 4-fold up-regulated), TC126796 (Phosholipase C, 15-fold up-regulated) and TC119057 (Annexin P34, 5-fold up-regulated). Interestingly, Phospholipase C and Annexin P34 transcripts were also observed during EST sequencing in three cDNA libraries derived from abiotically stressed tissue [www.tigr.org and Ref. (10)]. In addition, the cell wall protein (Q40142) which has been shown to be involved in thickening the protective periderm layer was increased (11), consistent with the fact that potatoes thicken their skin during storage, presumably as a response to drought. In stark contrast, only 3 of the 65 down-regulated genes encode chaperones and no other stress-related transcripts were identified. It seems likely, that several of the unknown up-regulated transcripts are additional parts of the potato tuber's response to the abiotic stress induced by storage. 38 of the 65 down-regulated genes can be matched to a potato tentative consensus sequence (TC) and of these more than half (25) match storage proteins, such as patatins, metallocarboxypeptidase inhibitors or Kunitz type protease inhibitors. This indicates that loading of storage protein loading into the potato tuber, at least to some extent can take place up to the point of harvest, but is completely shut down after the tuber has been taken out of the ground. Using the LongSAGE DOR tags (see Supplementary Table 2) only 97 of these were identified as regulated by comparison to DeepSAGE HAR. Importantly, 26 additional ‘false positive’ transcripts, which fail to meet the statistical significance criteria using the larger dataset were deemed differentially regulated using the smaller dataset.

Table 2
Stress genes regulated between harvest and dormant potato tubers

Typically 225 000 sequences are obtained in a single experiment, and can be divided in forward and reverse sequences (as in this study), or be generated exclusively from one end only. In DeepSAGE, each sequence contains a ditag. Therefore, at the success rate of this study of 70%, 225 000 sequences represent 315 000 tags determined in a single sequence run. This might be compared to microarray transcriptomics, as Lu et al. (12) have estimated that the sensitivity towards detecting rare transcripts in an Affymetrix gene chip experiment is comparable to a SAGE study of 120 000 tags. Therefore, three samples can be multiplexed in a single run to generate this sensitivity.

To estimate the usefulness of the increased sensitivity we analyzed the expression of transcription factors, a class of genes known to be expressed at very low abundance. Mapping the DOR derived tags to the ∼190 000 tentative contiguous sequences of potato (www.tigr.org), only 67 different gene matches were identified as homologues of transcription factors among LongSAGE tags (see Supplementary Table 3 for details). Of these, 13 were observed >5 times and 8 tags were observed 3, 4 or 5 times. The majority (69%) of the 67 tags were seen only twice (17) or once (29). Because, a large proportion of the very rare tags might be tags generated by sequencing errors (13), it cannot be established with certainty whether the corresponding transcripts are present. Furthermore, reliable expression levels cannot be established for rare tags because of random sampling (13). In comparison, 94 different gene matches were found to putative transcription factors in the DeepSAGE analysis of the DOR sample. Twenty-two tags were observed >5 times, whereas 51 tags (54%), were seen twice (16) or once (36). Interestingly, 21 or almost three times as many tags were encountered 3, 4 or 5 times. This underscores that a deeper sampling indeed will detect more rare transcripts, and provides more reliable gene expression estimates.

Multiplexing of different samples or replicates of the same sample, each tagged with a unique nucleotide identification key is a further possibility of DeepSAGE as we have shown here by co-analyzing potato tuber ditag libraries at dormancy and harvest. Until now the lack of replicates has been a severe drawback of SAGE. Following sequencing, different samples are first sorted according to their identification keys and the tags are counted prior to comparison of gene expression.

The DeepSAGE protocol omits the sample consuming concatenation of ditags, the tedious clone picking and sequence template preparation, which constitute most of the experimental SAGE protocol. As an example, a single person in our laboratory has consistently spent six weeks to generate data from two LongSAGE libraries, including sequencing of concatemers. Using DeepSAGE, the same person has recently generated 19 SAGE libraries from pig mRNA in 2 weeks. Sequence template preparation and analysis was performed in 1 week.

Tag based gene expression profiling methods have been inhibited by the cost of DNA sequencing, despite their advantageous global and digital nature, but will become increasingly cost-effective as the 1000$ genome is approached. The total cost of DeepSAGE, including labor costs, was reduced at least 10-fold compared to LongSAGE, from 25 cents to ∼2.5 cents per tag.


The authors wish to thank Dr Michael Egholm, 454-Life Science Corp., CT, USA for help with sequencing and Professor Karen G. Welinder, Aalborg University for the critical reading of this manuscript. The work was supported by the Danish Veterinarian and Agricultural Research Council (23-02-0034) and Technical Research Council (26-00-0141). Funding to pay the Open Access publication charges for this article was provided by the Danish Research Councils.

Conflict of interest statement. None declared.


1. Lockhart D.J., Dong H.L., Byrne M.C., Follettie M.T., Gallo M.V., Chee M.S., Mittmann M., Wang C.W., Kobayashi M., Horton H., et al. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat. Biotechnol. 1996;14:1675–1680. [PubMed]
2. Velculescu V.E., Zhang L., Vogelstein B., Kinzler K.W. Serial analysis of gene expression. Science. 1995;270:484–487. [PubMed]
3. Brenner S., Johnson M., Bridgham J., Golda G., Lloyd D.H., Johnson D., Luo S., McCurdy S., Foy M., Ewan M., et al. Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat. Biotechnol. 2000;18:630–634. [PubMed]
4. Gowda M., Jantasuriyarat C., Dean R.A., Wang G.L. Robust-LongSAGE (RL-SAGE): a substantially improved LongSAGE method for gene discovery and transcriptome analysis. Plant Physiol. 2004;134:890–897. [PMC free article] [PubMed]
5. Saha S., Sparks A.B., Rago C., Akmaev V., Wang C.J., Vogelstein B., Kinzler K.W., Velculescu V.E. Using the transcriptome to annotate the genome. Nat. Biotechnol. 2002;20:508–512. [PubMed]
6. Margulies M., Egholm M., Altman W.E., Attiya S., Bader J.S., Bemben L.A., Berka J., Braverman M.S., Chen Y.J., Chen Z., et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–380. [PMC free article] [PubMed]
7. Scott D.L., Clark C.W., Deahl K.L., Prakash C.S. Isolation of functional RNA from periderm tissue of potato tubers and sweet potato storage roots. Plant Mol. Biol. Rep. 1998;16:3–8.
8. Audic S., Claverie J.M. The significance of digital gene expression profiles. Genome Res. 1997;7:986–995. [PubMed]
9. Akmaev V.R., Wang C.J. Correction of sequence-based artifacts in serial analysis of gene expression. Bioinformatics. 2004;20:1254–1263. [PubMed]
10. Rensink W., Hart A., Liu J., Ouyang S., Zismann V., Buell C.R. Analyzing the potato abiotic stress transcriptome using expressed sequence tags. Genome. 2005;48:598–605. [PubMed]
11. Domingo C., Gomez M.D., Canas L., Hernandez-Yago J., Conejero V., Vera P. A novel extracellular matrix protein from tomato associated with lignified secondary cell walls. Plant Cell. 1994;6:1035–1047. [PMC free article] [PubMed]
12. Lu J., Lal A., Merriman B., Nelson S., Riggins G. A comparison of gene expression profiles produced by SAGE, long SAGE, and oligonucleotide chips. Genomics. 2004;84:631–636. [PubMed]
13. Anisimov S.V., Sharov A.A. Incidence of ‘quasi-ditags’ in catalogs generated by Serial Analysis of Gene Expression (SAGE) BMC. Bioinformatics. 2004;5:152. [PMC free article] [PubMed]
14. Kim Y.O., Kang H. The role of a zinc finger-containing glycine-rich RNA-binding protein during the cold adaptation process in Arabidopsis thaliana. Plant Cell Physiol. 2006;47:793–798. [PubMed]
15. Sieburth L.E., Muday G.K., King E.J., Benton G., Kim S., Metcalf K.E., Meyers L., Seamen E., Van Norman J.M. SCARFACE encodes an ARF-GAP that is required for normal auxin efflux and vein patterning in Arabidopsis. Plant Cell. 2006;18:1396–1411. [PMC free article] [PubMed]
16. Sathyanarayanan P.V., Poovaiah B.W. Decoding Ca2+ signals in plants. CRC Crit Rev. Plant Sci. 2004;23:1–11. [PubMed]
17. Zeeman S.C., Thorneycroft D., Schupp N., Chapple A., Weck M., Dunstan H., Haldimann P., Bechtold N., Smith A.M., Smith S.M. Plastidial alpha-glucan phosphorylase is not required for starch degradation in Arabidopsis leaves but has a role in the tolerance of abiotic stress. Plant Physiol. 2004;135:849–858. [PMC free article] [PubMed]
18. Kerk D., Bulgrien J., Smith D.W., Gribskov M. Arabidopsis proteins containing similarity to the universal stress protein domain of bacteria. Plant Physiol. 2003;131:1209–1219. [PMC free article] [PubMed]
19. Sanchez-Aguayo I., Rodriguez-Galan J.M., Garcia R., Torreblanca J., Pardo J.M. Salt stress enhances xylem development and expression of S-adenosyl-l-methionine synthase in lignifying tissues of tomato plants. Planta. 2004;220:278–285. [PubMed]
20. Lin W.H., Ye R., Ma H., Xu Z.H., Xue H.W. DNA chip-based expression profile analysis indicates involvement of the phosphatidylinositol signaling pathway in multiple plant responses to hormone and abiotic treatments. Cell Res. 2004;14:34–45. [PubMed]
21. Nakane E., Kawakita K., Doke N., Yoshioka H. Elicitation of primary and secondary metabolism during defense in the potato. J. Gen. Plant Pathol. 2003;69:378–384.
22. Hwang E.W., Kim K.A., Park S.C., Jeong M.J., Byun M.O., Kwon H.B. Expression profiles of hot pepper (Capsicum annuum) genes under cold stress conditions. J. Biosci. 2005;30:657–667. [PubMed]
23. Clark G.B., Thompson G., Roux S.J. Signal transduction mechanisms in plants: an overview. Curr. Sci. 2001;80:170–177. [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press
PubReader format: click here to try


Save items

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • EST
    Expressed Sequence Tag (EST) nucleotide sequence records reported in the current articles.
  • MedGen
    Related information in MedGen
  • Protein
    Protein translation features of primary database (GenBank) nucleotide records reported in the current articles as well as Reference Sequences (RefSeqs) that include the articles as references.
  • PubMed
    PubMed citations for these articles
  • Substance
    PubChem chemical substance records that cite the current articles. These references are taken from those provided on submitted PubChem chemical substance records.
  • Taxonomy
    Taxonomy records associated with the current articles through taxonomic information on related molecular database records (Nucleotide, Protein, Gene, SNP, Structure).
  • Taxonomy Tree
    Taxonomy Tree

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...