![]() | ![]() |
Formats:
|
||||||||||||||
Copyright © 2008 The Author(s) MachiBase: a Drosophila melanogaster 5′-end mRNA transcription database 1Department of Computational Biology, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa 277-0882, 2Japan Science and Technology Agency (JST), Tokyo 102-8666, 3Department of Molecular Preventive Medicine, School of Medicine, The University of Tokyo, Tokyo 113-0033 and 4Department of Biological Sciences, Tokyo Metropolitan University, Hachioji, Tokyo, Japan *To whom correspondence should be addressed. Tel: Phone: +81 47 136 3984; Fax: +81 47 139 3977; Email: moris/at/k.u-tokyo.ac.jp, Email: moris/at/cb.k.u-tokyo.ac.jp The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. Received August 15, 2008; Accepted September 25, 2008. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract MachiBase (http://machibase.gi.k.u-tokyo.ac.jp/) provides a comprehensive and freely accessible resource regarding Drosophila melanogaster 5′-end mRNA transcription at different developmental states, supporting studies on the variabilities of promoter transcriptional activities and gene-expression profiles in the fruitfly. The data were generated in conjunction with the recently developed high-throughput genome sequencer Illumina/Solexa using a newly developed 5′-end mRNA collection method. INTRODUCTION Characterization of the complete repertoire of expressed messenger RNA (mRNA) is central to the functional analysis of a genome. To date, several studies have been undertaken to achieve a better understanding of the Drosophila melanogaster genome (1–4). The technical approaches used in these studies included in-depth, full-length cDNA cloning and tiling microarrays. However, despite the absence of prior knowledge of the locations of previously identified genes, the 5′-end SAGE (5) method has demonstrated efficacy in cataloging high numbers of expressed genes. Following the simple modification of adopting the recently developed high-throughput genome sequencer Illumina/Solexa, 5′-end SAGE has become a potent tool for elucidating transcriptional mechanisms. To achieve a deeper insight into transcriptional activity, we collected approximately 25 million 25–27 nt 5′-end mRNA tags from the embryos, larvae, young males, young females, old males, old females and S2 (culture cell line) of D. melanogaster with high mechanical reproducibility. After aligning these tags to unique positions in the fly genome while allowing three mismatches, 2.87–4.05 million uniquely mapped tags were amassed for each of the seven samples. These data constitute the most substantial transcriptional start site (TSS) and gene-expression database for D. melanogaster currently available. MachiBase is designed to assist fly biologists in their analyses of gene expression and in placing expression data in the context of functional genomics through genomic orientation. Thus, information on differentially expressed genes can be accessed by either inputting the gene name as a keyword or selecting a chromosomal location. Aside from providing information on gene expression, these data constitute a potent resource for analyses of transcriptional regulation. The core promoter, which is the region surrounding the TSS of a gene required for recruitment of the transcription apparatus, warrants analysis. However, TSSs and core promoters have previously been identified on a gene-by-gene basis. With the help of this database, biologists can explain transcriptional initiation mechanisms by combining additional information on chromatin structure and DNA methylation. In addition, these data allow accurate predictions of gene structures, particularly of the 5′-untranslated region (5′-UTR). METHODS The newly developed 5′-end mRNA collection method extends the range of the original 5′-end SAGE technique developed by Hashimoto et al. (5). This method initially profiles 25–27 nt tags using a novel strategy that incorporates the oligo-capping method (6). The 5′-end tags are then ligated directly to the Illumina/Solexa linker, to prepare for sequencing with the Illumina/Solexa system. Prior to construction of the Illumina/Solexa libraries, we confirmed the integrity of the cDNA using the Agilent 2100 Bioanalyser. Collection of numerous 5′-end tags from seven libraries and testing the reproducibility of the method used To characterize the transcriptional activity patterns of the D. melanogaster genome, we collected 25–27 nt 5′-end mRNA tags from embryos, larvae, young males, young females, old males, old females and the S2 cell line. Table 1 presents the results of this process. The second column shows more than five million raw tags collected from each of the seven libraries. As most of these tags were redundant, they were grouped into non-redundant representative tags, the statistics for which are shown in the third column. Each non-redundant tag represents a duplicated occurrence and is therefore associated with its frequency, i.e. the number of times that it occurs.
The frequency is expected to be reproducible, in that the frequency of each non-redundant tag is proportional to the total number of tag occurrences in independent experiments. To test for reproducibility, we performed an additional collection of 5′-end tags from the same young female Drosophila library. Figure 1
Identification of transcription start sites by millions of 5′-end tags For the identification of TSSs, non-redundant tags were aligned to the genome of D. melanogaster (R5.3) in FlyBase (8). We observed that 5′-end tags tended to contain read errors, especially towards their termini. To correct these read errors, the tags were aligned to the genome while allowing, at most, three mismatches. The efficient mapping of millions of tags was an issue that needed to be resolved. We developed and used a parallel version of BLAT (9), which operates on massive parallel clusters. Another major technical issue involved the fact that a single 5′-end tag could be mapped to multiple locations, making it difficult to determine the original location of the tag. To eliminate false-positive positional data, these ambiguous tags were simply excluded from our analysis, so that only uniquely aligned tags were considered. A tag was considered to be uniquely aligned if, for a non-negative number k ( 3), the tag was mapped to a unique location with k mismatches, although it could be mapped to multiple positions with more than k mismatches. The number of uniquely aligned and redundant tags in each library, and their ratios to the number of raw redundant tags, are shown in the fourth and fifth columns of Table 1, respectively. A uniquely aligned 5′-end tag identified a TSS in the genome. Distinct tags could be mapped to the same TSS, since the alignment step tolerated mismatches and replaced erroneous nucleotides with the correct nucleotides in the genome. From all seven libraries, a total of 25 083 481 tags were mapped to unique locations, thereby identifying 1 773 851 TSSs; the data breakdown in terms of chromosomes is presented in Table 2.
Discrepancy between the known representative TSS and the most frequent TSS In attempting genome annotation, it is usual to choose the longest cDNA sequence in a specific locus to define the representative cDNA. To examine the level of agreement between the newly collected 5′-end tags and the known representative cDNA sequences, we calculated how many of the uniquely aligned, redundant tags were located in the promoters and 5′-UTRs of the representative sequences, and found 96.2% of the 5′-end tags in the UTR regions (Table 3). Figure 1
Database features and applications We visualized the numbers of 5′-end tags for each position in a vertical bar (Figure 2
DISCUSSION The vast transcriptional datasets have been used to characterize differentially expressed genes, especially in relation to age and sexual development. Using these datasets, we have confirmed that the representative TSSs, the abundantly expressed TSSa flanking FlyBase-annotated TSSs, differ from many of the known FlyBase-annotated TSSs. It has become evident that the rules for start site selection are fundamentally different for different promoters, and large-scale studies have given us the tools to partition promoters into functional classes with respect to TSS information in future studies. As a novel and high-quality data resource, MachiBase is a valuable tool for experimental biologists who are working on D. melanogaster. In future, we will empower this database with various annotated data on the fly genome. FUNDING Scientific Research on Priority Areas (C) from the Ministry of Education, Culture, Sports, Science and Technology of Japan partially; Bioinformatics Research and Development (BIRD); the Japan Science and Technology Agency (JST). Funding for open access charge: JST. Conflict of interest statement. None declared. ACKNOWLEDGEMENTS Computational time was provided by the Information Technology Center and the Human Genome Center, at the University of Tokyo. All 5′-end sequence data are deposited at NCBI Short Read Archive under the accession number SRA002200. REFERENCES 1. Arbeitman MN, Furlong EE, Imam F, Johnson E, Null BH, Baker BS, Krasnow MA, Scott MP, Davis RW, White KP. Gene expression during the life cycle of Drosophila melanogaster. Science. 2002;297:2270–2275. [PubMed] 2. Stapleton M, Liao G, Brokstein P, Hong L, Carninci P, Shiraki T, Hayashizaki Y, Champe M, Pacleb J, Wan K, et al. The Drosophila gene collection: identification of putative full-length cDNAs for 70% of D. melanogaster genes. Genome Res. 2002;12:1294–1300. [PubMed] 3. Stolc V, Gauhar Z, Mason C, Halasz G, van Batenburg MF, Rifkin SA, Hua S, Herreman T, Tongprasit W, Barbano PE, et al. A gene expression map for the euchromatic genome of Drosophila melanogaster. Science. 2004;306:655–660. [PubMed] 4. Tomancak P, Berman BP, Beaton A, Weiszmann R, Kwan E, Hartenstein V, Celniker SE, Rubin GM. Global analysis of patterns of gene expression during Drosophila embryogenesis. Genome Biol. 2007;8:R145. [PubMed] 5. Hashimoto S, Suzuki Y, Kasai Y, Morohoshi K, Yamada T, Sese J, Morishita S, Sugano S, Matsushima K. 5′-end SAGE for the analysis of transcriptional start sites. Nat. Biotechnol. 2004;22:1146–1149. [PubMed] 6. Maruyama K, Sugano S. Oligo-capping: a simple method to replace the cap structure of eukaryotic mRNAs with oligoribonucleotides. Gene. 1994;138:171–174. [PubMed] 7. Hashimoto S, Qu W, Budrul A, Ogoshi K, Nakatani Y, Lee Y, Ogawa M, Ametani A, Suzuki Y, Sugano S, et al. High-resolution analysis of the 5′-end transcriptome using a next generation DNA sequencer. in press. 8. Drysdale RA, Crosby MA. FlyBase: genes and gene models. Nucleic Acids Res. 2005;33:D390–D395. [PubMed] 9. Kent WJ. BLAT—the BLAST-like alignment tool. Genome Res. 2002;12:656–664. [PubMed] 10. Yan C, Boyd DD. Histone H3 acetylation and H3 K4 methylation define distinct chromatin regions permissive for transgene expression. Mol. Cell Biol. 2006;26:6357–6371. [PubMed] 11. Pokholok DK, Harbison CT, Levine S, Cole M, Hannett NM, Lee TI, Bell GW, Walker K, Rolfe PA, Herbolsheimer E, et al. Genome-wide map of nucleosome acetylation and methylation in yeast. Cell. 2005;122:517–527. [PubMed] 12. Wiren M, Silverstein RA, Sinha I, Walfridsson J, Lee HM, Laurenson P, Pillus L, Robyr D, Grunstein M, Ekwall K. Genomewide analysis of nucleosome density histone acetylation and HDAC function in fission yeast. EMBO J. 2005;24:2906–2918. [PubMed] 13. Nishida H, Suzuki T, Kondo S, Miura H, Fujimura YI, Hayashizaki Y. Histone H3 acetylated at lysine 9 in promoter is associated with low nucleosome density in the vicinity of transcription start site in human cell. Chromosome Res. 2006;14:203–211. [PubMed] 14. Mavrich TN, Jiang C, Ioshikhes IP, Li X, Venters BJ, Zanton SJ, Tomsho LP, Qi J, Glaser RL, Schuster SC, et al. Nucleosome organization in the Drosophila genome. Nature. 2008;453:358–362. [PubMed] 15. Wei CL, Wu Q, Vega VB, Chiu KP, Ng P, Zhang T, Shahab A, Yong HC, Fu YT, Weng ZP, et al. A global map of p53 transcription-factor binding sites in the human genome. Cell. 2006;124:207–219. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||
Science. 2002 Sep 27; 297(5590):2270-5.
[Science. 2002]Genome Res. 2002 Aug; 12(8):1294-300.
[Genome Res. 2002]Science. 2004 Oct 22; 306(5696):655-60.
[Science. 2004]Genome Biol. 2007; 8(7):R145.
[Genome Biol. 2007]Nat Biotechnol. 2004 Sep; 22(9):1146-9.
[Nat Biotechnol. 2004]Nat Biotechnol. 2004 Sep; 22(9):1146-9.
[Nat Biotechnol. 2004]Gene. 1994 Jan 28; 138(1-2):171-4.
[Gene. 1994]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D390-5.
[Nucleic Acids Res. 2005]Genome Res. 2002 Apr; 12(4):656-64.
[Genome Res. 2002]Mol Cell Biol. 2006 Sep; 26(17):6357-71.
[Mol Cell Biol. 2006]Cell. 2005 Aug 26; 122(4):517-27.
[Cell. 2005]EMBO J. 2005 Aug 17; 24(16):2906-18.
[EMBO J. 2005]Chromosome Res. 2006; 14(2):203-11.
[Chromosome Res. 2006]Nature. 2008 May 15; 453(7193):358-62.
[Nature. 2008]