![]() | ![]() |
Formats:
|
||||||||||||
A high-resolution map of active promoters in the human genome 1 Ludwig Institute for Cancer Research, 9500 Gilman Drive, La Jolla, CA 92093-0653, USA 2 8125 Math Sciences Building, UCLA Department of Statistics, Los Angeles, CA 90095-1554 3 Nimblegen Systems, Inc., 1 Science Court, Madison, WI 53711 4 Department of Cellular and Molecular Medicine, UCSD School of Medicine, 9500 Gilman Drive, La Jolla, CA 92093-0653, USA 6 To whom correspondence should be addressed. Email: biren/at/ucsd.edu. Phone: 858 822 5766; Fax: 858 534 7750. 5These two authors contributed equally to this work. Author to which correspondence and material request should be addressed: Bing Ren, biren/at/ucsddu. The microarray datasets are available from GEO (accession numbers to be provided). Abstract In eukaryotic cells, transcription of every protein-coding gene begins with the assembly of an RNA Polymerase II preinitiation complex (PIC) on the promoter1. The promoters, in conjunction with enhancers, silencers and insulators, define the combinatorial codes that specify gene expression patterns2. Our ability to analyze the control logic encoded in the human genome is currently limited by a lack of accurate information of the promoters for most genes3. Here, we describe a genome-wide map of active promoters in human fibroblast cells, determined by experimentally locating the sites of PIC binding throughout the human genome. This map defines 10,571 active promoters corresponding to 6,763 known genes and at least 1,199 un-annotated transcriptional units. Features of the map suggest extensive usage of multiple promoters by the human genes and widespread clustering of active promoters in the genome. In addition, examination of the genome-wide expression profile reveals four general classes of promoters that define the transcriptome of the cell. These results provide a global view of the functional relationship among the transcriptional machinery, chromatin structure and gene expression in human cells. The PIC consists of the RNA Polymerase II (RNAP), the transcription factor IID (TFIID) and other general transcription factors4. Our strategy to map the PIC binding sites involves a chromatin immunoprecipitation coupled DNA microarray analysis (ChIP-on-chip), which combines the immunoprecipitation of PIC-bound chromatin from formaldehyde crosslinked cells with parallel identification of the resulting bound DNA sequences using DNA microarrays5,6. Previously, we have demonstrated the feasibility of this strategy by successfully mapping active promoters in 1% of the human genome that correspond to the 44 genomic loci known as the ENCODE regions6,7. To apply this strategy to the entire human genome, we fabricated a series of DNA microarrays8 containing roughly 14.5 million 50-mer oligonucleotides, designed to represent all the non-repeat DNA throughout the human genome at 100 basepairs (bp) resolution. We immunoprecipitated TFIID-bound DNA from the primary fibroblast IMR90 cells with a monoclonal antibody that specifically recognizes the TAF1 subunit of this complex (TBP associated factor 1, formerly TAFII2509, Fig 1a
Next, we matched these 12,150 TFIID-binding sites to the 5′ end of known transcripts in three public transcript databases (DBTSS10, RefSeq11, GenBank human mRNA collection12) and the EnsEMBL gene catalog13. To account for the uncertainty of our knowledge of the true 5′ end of transcripts and the uncertainty of predicted TFIID-binding positions due to noise within the microarray data, we chose an arbitrary distance of 2.5 Kbp as a measure of close proximity. We found that 10,553 (87%) TFIID-binding sites were within 2.5Kbp of annotated 5′ ends of known mRNA. We resolved common TFIID-binding sites mapping to similar 5′ ends to define a non-redundant set of 9,330 5′ end-matched TFIID-binding sites. Of these TFIID-binding sequences 7,789 (83%) were found within 500 bp of the putative transcription start sites (TSS) (Fig. 1c Four independent analyses validated the high specificity and accuracy of the active promoters detected in IMR90 cells. First, ChIP-on-chip analysis using an anti-RNAP antibody (8WG16) confirmed the binding of RNAP to at least 9,050 (97%) of the 9,330 promoters in IMR90 cells (Fig. S1). Second, standard chromatin immunoprecipitation (ChIP) performed on 28 promoters randomly selected from the above list confirmed the occupancy of RNAP on all but one promoter (Fig. S2). Third, the 9,330 active promoters are enriched for known promoter-associated sequences such as CpG islands, and the INR and DPE core promoter elements (Fig. 1f
Among the 12,150 mapped TFIID-binding sites, 1,597 are found more than 2.5 Kbp away from previously defined 5′ ends of mRNA, and may represent promoters for novel transcripts or genes (Table S2). Of these, 607 non-redundant TFIID-binding sites were matched within 2.5 Kbp of the 5′ ends of the Expressed Sequence Tag (EST)-based gene models, indicating that they may indeed produce mRNA (Table S2). The remaining TFIID-binding sites were further filtered to a set of 634 putative promoters by requiring the occupancy of RNAP and presence of AcH3 and MeH3K4 within 1 Kbp of these sites (Fig. S3). To verify that these promoters drive transcription, we analyzed mRNA from the IMR90 cells, using 50-mer oligonucleotide arrays that represent a 28 Kbp sequence surrounding 569 of 634 unmatched putative promoters. At least 36 novel transcription units were identified near the putative promoter regions, suggesting that these may represent new transcription units yet to be annotated in the human genome (Table S3). The failure to detect mRNA from the other putative promoters may indicate that these transcripts are highly unstable. Indeed, at least one putative promoter is located within 250 bp upstream from a predicted miRNA17 (Fig. S4), suggesting that some putative promoters could transcribe non-coding RNA that might have escaped detection by conventional mRNA isolation techniques. In all, we defined a set of 1,241 putative promoters that correspond to previously un-annotated transcription units (Fig. 4b
Two notable features were apparent in this map of active promoters. First, large domains of four or more consecutive genes were found to be simultaneously bound by PIC and likely transcribed in the IMR90 cells. At least 256 clusters, consisting of 1,668 EnsEMBL genes, can be classified into such regions, and the number of clustered promoters is highly significant (P 0.001, Table S5). The clustering of active promoters is consistent with previous findings that co-regulated genes tend to be organized into coordinately regulated domains23–26. Second, a large number of genes contained two or more active promoters (Table S4). In general, these multiple promoters correspond to transcripts with either different 5′ UTR sequences or distinct first exons (i., PTEN) but do not affect the open reading frames. In some cases, however, distinct proteins were produced from multiple promoters (i., NR2F2, WEE1). In other cases, transcripts undergo differential splicing and polyadenylation (i., NFKB2, STAT3). The widespread usage of multiple promoters in this single cell type indicates a greater complexity of the cellular proteome than previously expected and also reveals highly coordinated regulation of transcriptional initiation, splicing, and polyadenylation throughout the genome27. To experimentally verify our observations regarding multiple promoter utilization in IMR90 cells, we selected the WEE1 gene for further analysis. Two TFIID-binding sites were mapped within this gene, corresponding to the 5′ ends of two distinct mRNAs, NM_003390 and AK122837 (Fig. 3a
The active promoter map in IMR90 cells allowed us to systematically investigate the functional relationship between the transcription machinery and gene expression. We examined the genome-wide expression profiles of IMR90 cells and correlated the expression status of 14,437 EnsEMBL genes to promoter occupancy by the PIC. The comparison revealed four general classes of genes (Fig. 4 The genes in class I and class IV, representing over 75% of the genes examined, support the general model that formation of the PIC on the promoters leads to transcription. The class II and III genes, on the other hand, are inconsistent with this model and may indicate other mechanism is responsible for expression of these genes. We postulate that the discrepancy between the PIC formation and transcription on the class II promoters are due to at least two possibilities. The first possibility is that the PIC assembles on these promoters, but the PIC formation is not sufficient to initiate transcription. Additional regulatory steps, such as promoter clearance or elongation may be rate-limiting in transcription of these genes28. Some notable examples in class II are the immediate early genes, FOS and FOSB; the heat shock protein genes, HSPA6 and HSPD1; and the DNA damage repair genes, MSH5 and ERCC4. The second possibility is that transcription actually takes place at these promoters, but the resulting mRNAs are post-transcriptionally degraded, as in miRNA-mediated post-transcriptional silencing29. In contrast to class II, genes in class III appear to be transcribed, but the PIC binding on their promoters was not detected. This could simply be due to moderate sensitivity of our method6. To address this issue, we performed standard ChIP assay to detect binding of TFIID and RNAP on 10 randomly selected class III gene promoters. Nearly 60% of the promoters were weakly associated with TFIID and RNAP in these cells, and were marked by enrichment ratios less than 2-fold but nonetheless above the observed background (Fig. S2). Hence, the failure to detect TFIID and RNAP occupancy in roughly 60% of the class III promoters (~1,700) may be due to weak signals that fall below the detection sensitivity of our method. This result indicates that the promoters of a significant fraction of class III genes are open and accessible for transcription, but PIC assembles on these promoters transiently, weakly or only during the early stage of fibroblast differentiation. In order to understand the functional relationship between the histone modification status and gene expression, we examined the histone modifications (AcH3 and MeH3K4) in 29 ENCODE regions7 (Table S7), with a specific focus on the four classes of gene promoters. As expected, these epigenetic markers were associated with virtually all class I and class II genes, and the vast majority of class III genes. However, roughly 20% of the class IV genes were also associated with these markers (Fig. 4 Our results provide an initial framework for analysis of the cis-regulatory logic30 in human cells. The high-resolution map of active promoters in IMR90 cells will enable detailed analysis of transcription factor binding sites within these regions. The promoter map described here can also serve as a reference to understand gene expression in other cell types. We expect that a survey of additional cell types using the same approach will allow comprehensive mapping of all promoters in the human genome, and help elucidate the control logic that governs gene expression in different cell types in the body. Methods Suppl_info Click here to view.(196K, doc) Suppl_figs Click here to view.(4.4M, pdf) Table S1 Click here to view.(2.4M, xls) Table S2 Click here to view.(228K, xls) Table S3 Click here to view.(186K, xls) Table S4 Click here to view.(317K, xls) Table S5 Click here to view.(60K, xls) Table S6 Click here to view.(1.0M, xls) Table S7 Click here to view.(47K, xls) Acknowledgments We thank Jim Kadonaga, Richard A. Young, Richard Kolodner, Webster K. Cavenee, Sara Van Calcar, and Christopher K. Glass for discussion and their comments on the manuscript. This research was supported by Ruth L. Kirschstein National Research Service Award F32CA108313 (T. H. K.); Ford Foundation Predoctoral Fellowship (L. O. B.); Ludwig Institute for Cancer Research (B. R.); U01HG003151 (B. R.) and R21CA105829 (B. R.) from NIH; and IIS-0222967 from NSF (Y. W.). References 1. Smale ST, Kadonaga JT. The RNA polymerase II core promoter. Annu Rev Biochem. 2003;72:449–79. [PubMed] 2. Tjian R, Maniatis T. Transcriptional activation: a complex puzzle with few easy pieces. Cell. 1994;77:5–8. [PubMed] 3. Trinklein ND, Aldred SJ, Saldanha AJ, Myers RM. Identification and functional analysis of human transcriptional promoters. Genome Res. 2003;13:308–12. [PubMed] 4. Reinberg D, et al. The RNA polymerase II general transcription factors: past, present, and future. Cold Spring Harb Symp Quant Biol. 1998;63:83–103. [PubMed] 5. Ren B, et al. Genome-wide location and function of DNA binding proteins. Science. 2000;290:2306–9. [PubMed] 6. Kim TH, et al. Direct isolation and identification of promoters in the human genome. Genome Res. 2005;15 in press. 7. The ENCODE Project Consortium. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science. 2004;306:636–40. [PubMed] 8. Singh-Gasson S, et al. Maskless fabrication of light-directed oligonucleotide microarrays using a digital micromirror array. Nat Biotechnol. 1999;17:974–8. [PubMed] 9. Ruppert S, Wang EH, Tjian R. Cloning and expression of human TAFII250: a TBP-associated factor implicated in cell-cycle regulation. Nature. 1993;362:175–9. [PubMed] 10. Suzuki Y, Yamashita R, Sugano S, Nakai K. DBTSS, DataBase of Transcriptional Start Sites: progress report 2004. Nucleic Acids Res. 2004;32(Database issue):D78–81. [PubMed] 11. Pruitt KD, Tatusova T, Maglott DR. NCBI Reference Sequence project: update and current status. Nucleic Acids Res. 2003;31:34–7. [PubMed] 12. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: update. Nucleic Acids Res. 2004;32(Database issue):D23–6. [PubMed] 13. Birney E, et al. Ensembl 2004. Nucleic Acids Res. 2004;32(Database issue):D468–70. [PubMed] 14. Antequera F, Bird A. Number of CpG islands and genes in human and mouse. Proc Natl Acad Sci U S A. 1993;90:11995–9. [PubMed] 15. Ohler U, Liao GC, Niemann H, Rubin GM. Computational analysis of core promoters in the Drosophila genome. Genome Biol. 2002;3:RESEARCH0087. [PubMed] 16. Schubeler D, et al. The histone modification pattern of active genes revealed through genome-wide chromatin analysis of a higher eukaryote. Genes Dev. 2004;18:1263–71. [PubMed] 17. Griffiths-Jones S. The microRNA Registry. Nucleic Acids Res. 2004;32(Database issue):D109–11. [PubMed] 18. International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature. 2004;431:931–45. [PubMed] 19. Bertone P, et al. Global identification of human transcribed sequences with genome tiling arrays. Science. 2004;306:2242–6. [PubMed] 20. Kampa D, et al. Novel RNAs identified from an in-depth analysis of the transcriptome of human chromosomes 21 and 22. Genome Res. 2004;14:331–42. [PubMed] 21. Saha S, et al. Using the transcriptome to annotate the genome. Nat Biotechnol. 2002;20:508–12. [PubMed] 22. Rinn JL, et al. The transcriptional activity of human Chromosome 22. Genes Dev. 2003;17:529–40. [PubMed] 23. Su AI, et al. Large-scale analysis of the human and mouse transcriptomes. Proc Natl Acad Sci U S A. 2002;99:4465–70. [PubMed] 24. Spellman PT, Rubin GM. Evidence for large domains of similarly expressed genes in the Drosophila genome. J Biol. 2002;1:5. [PubMed] 25. Roy PJ, Stuart JM, Lund J, Kim SK. Chromosomal clustering of muscle-expressed genes in Caenorhabditis elegans. Nature. 2002;418:975–9. [PubMed] 26. Caron H, et al. The human transcriptome map: clustering of highly expressed genes in chromosomal domains. Science. 2001;291:1289–92. [PubMed] 27. Maniatis T, Reed R. An extensive network of coupling among gene expression machines. Nature. 2002;416:499–506. [PubMed] 28. Krumm A, Hickey LB, Groudine M. Promoter-proximal pausing of RNA polymerase II defines a general rate-limiting step after transcription initiation. Genes Dev. 1995;9:559–72. [PubMed] 29. Ambros V. The functions of animal microRNAs. Nature. 2004;431:350–5. [PubMed] 30. Yuh CH, Bolouri H, Davidson EH. Genomic cis-regulatory logic: experimental and computational analysis of a sea urchin gene. Science. 1998;279:1896–902. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||
Annu Rev Biochem. 2003; 72():449-79.
[Annu Rev Biochem. 2003]Cell. 1994 Apr 8; 77(1):5-8.
[Cell. 1994]Genome Res. 2003 Feb; 13(2):308-12.
[Genome Res. 2003]Cold Spring Harb Symp Quant Biol. 1998; 63():83-103.
[Cold Spring Harb Symp Quant Biol. 1998]Science. 2000 Dec 22; 290(5500):2306-9.
[Science. 2000]Science. 2004 Oct 22; 306(5696):636-40.
[Science. 2004]Nat Biotechnol. 1999 Oct; 17(10):974-8.
[Nat Biotechnol. 1999]Nature. 1993 Mar 11; 362(6416):175-9.
[Nature. 1993]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D78-81.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2003 Jan 1; 31(1):34-7.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D23-6.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D468-70.
[Nucleic Acids Res. 2004]Proc Natl Acad Sci U S A. 1993 Dec 15; 90(24):11995-9.
[Proc Natl Acad Sci U S A. 1993]Genome Biol. 2002; 3(12):RESEARCH0087.
[Genome Biol. 2002]Genes Dev. 2004 Jun 1; 18(11):1263-71.
[Genes Dev. 2004]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D109-11.
[Nucleic Acids Res. 2004]Nature. 2004 Oct 21; 431(7011):931-45.
[Nature. 2004]Science. 2004 Dec 24; 306(5705):2242-6.
[Science. 2004]Genes Dev. 2003 Feb 15; 17(4):529-40.
[Genes Dev. 2003]Proc Natl Acad Sci U S A. 2002 Apr 2; 99(7):4465-70.
[Proc Natl Acad Sci U S A. 2002]Science. 2001 Feb 16; 291(5507):1289-92.
[Science. 2001]Nature. 2002 Apr 4; 416(6880):499-506.
[Nature. 2002]Genes Dev. 1995 Mar 1; 9(5):559-72.
[Genes Dev. 1995]Nature. 2004 Sep 16; 431(7006):350-5.
[Nature. 2004]Science. 2004 Oct 22; 306(5696):636-40.
[Science. 2004]Science. 1998 Mar 20; 279(5358):1896-902.
[Science. 1998]