![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||
Copyright © 2008, Cold Spring Harbor Laboratory Press Genome-wide mapping and analysis of active promoters in mouse embryonic stem cells and adult organs 1 Ludwig Institute for Cancer Research, UCSD School of Medicine, La Jolla, California 92093-0653, USA; 2 UCSD Bioinformatics Graduate Program, UCSD, La Jolla, California 92093-0653, USA; 3 Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724; 4 Department of Medicine, UCSD School of Medicine, La Jolla, California 92093-0660, USA; 5 NimbleGen Systems Inc., Madison, Wisconsin 53711, USA; 6 Department of Cellular and Molecular Medicine, UCSD School of Medicine, La Jolla, California 92093-0653, USA 7These authors contributed equally to this work. 8Corresponding authors.E-mail biren/at/ucsd.edu; fax (858) 534-7750.E-mail rgreen/at/nimblegen.com; fax (608) 218-7601. Received May 3, 2007; Accepted October 12, 2007. This article has been cited by other articles in PMC.Abstract By integrating genome-wide maps of RNA polymerase II (Polr2a) binding with gene expression data and H3ac and H3K4me3 profiles, we characterized promoters with enriched activity in mouse embryonic stem cells (mES) as well as adult brain, heart, kidney, and liver. We identified ~24,000 promoters across these samples, including 16,976 annotated mRNA 5′ ends and 5153 additional sites validating cap-analysis of gene expression (CAGE) 5′ end data. We showed that promoters with CpG islands are typically non-tissue specific, with the majority associated with Polr2a and the active chromatin modifications in nearly all the tissues examined. By contrast, the promoters without CpG islands are generally associated with Polr2a and the active chromatin marks in a tissue-dependent way. We defined 4396 tissue-specific promoters by adapting a quantitative index of tissue-specificity based on Polr2a occupancy. While there is a general correspondence between Polr2a occupancy and active chromatin modifications at the tissue-specific promoters, a subset of them appear to be persistently marked by active chromatin modifications in the absence of detectable Polr2a binding, highlighting the complexity of the functional relationship between chromatin modification and gene expression. Our results provide a resource for exploring promoter Polr2a binding and epigenetic states across pluripotent and differentiated cell types in mammals. Over 200 different cell types underscore the functional complexity of mammals (Alberts et al. 2002). In turn, the complement of genes expressed in each cell type specifies its unique functions (Okazaki et al. 2002; Su et al. 2002, 2004; Sharov et al. 2003; Zhang et al. 2005). Throughout the genome, regulatory sequences such as promoters, enhancers, and insulators control gene expression by interacting with specific transcription factors, many of which exert their effect by modulating the local chromatin modification states (Lee et al. 2004; Guillemette et al. 2005; Mito et al. 2005; Pokholok et al. 2005; Raisner et al. 2005; Yuan et al. 2005; Zhang et al. 2005; Heintzman et al. 2007). Thus, unbiased genome-wide profiles of transcription factor binding and chromatin modifications at these regulatory sequences, across a panel of mammalian cell types, are expected to provide insights into the regulatory mechanisms of tissue-specific gene expression (Levine and Tjian 2003). Previously, large-scale efforts to understand mammalian tissue-specific expression have been devoted to the investigation of transcript expression patterns across cell and tissue types. Microarray-based technologies and high-throughput sequencing methods have been used to determine steady-state mRNA levels of genes in a compendium of cell and tissue types under normal or pathological conditions (Okazaki et al. 2002; Su et al. 2002, 2004; Sharov et al. 2003; Zhang et al. 2004). These data sets have been valuable for understanding the tissue-specific gene expression programs and provide a rich source of information for defining common transcription factor motifs that may underlie tissue-specific patterns of expression (Wasserman and Fickett 1998; Wasserman et al. 2000; Smith et al. 2005, 2006, 2007; Xie et al. 2005; Xuan et al. 2005). Recently, advances in the sequencing of transcript 5′ ends have also expanded the annotation of mammalian promoters in different mammalian tissues and provided references of potential transcriptional start sites for most mammalian genes (Carninci et al. 2005, 2006; Kimura et al. 2006). These recent studies have revealed a large spectrum of transcripts for each gene generated by extensive usage of alternative promoters, alternative splicing, and alternative polyadenylation sites. The extent of alternative promoter usage and the identification of transcription factor motifs suggest the key role of promoters in contributing to the control of gene expression leading to mammalian cell-type diversity. While measuring the abundance and defining the 5′ ends of RNA transcripts are crucial for the understanding of mechanisms that drive tissue-specific gene expression programs, such information is not sufficient to resolve the complex mechanisms of gene regulation. For example, we and others have recently shown that a significant number of promoters are in a poised state of transcription—they are bound by the general transcription machinery but do not have detectable transcription activities in steady-state cells (Kim et al. 2005b; Guenther et al. 2007). To this end, it is necessary to directly analyze transcription factor loading and chromatin structures at promoters. As a first step toward understanding the gene regulatory mechanisms in mammalian cells, we now directly identify active promoters by unbiased mapping of the RNA polymerase II pre-initiation complex (PIC) in the mouse genome across a panel of mouse organs—brain, heart, kidney, liver—and mouse embryonic stem cells (hereafter collectively referred to as “tissues”). In addition, we profiled two active chromatin modifications (H3ac and H3K4me3) at each identified promoter and tracked the corresponding gene transcript levels. By examining these complementary data sets across the tissues surveyed, we identified a complex relationship among chromatin modifications, Polr2a occupancy, and tissue-specific gene expression. The majority of CpG island containing promoters are associated with Polr2a and the active chromatin marks, regardless of tissue type. By contrast, non-CpG island promoters are typically associated with the active chromatin marks and occupied by Polr2a in a tissue-restricted manner. We developed a quantitative measure of promoter tissue-specificity based on Polr2a binding that defined 4296 tissue-specific promoters. Detailed motif analysis of the tissue-specific promoters and functional annotation of corresponding genes showed an enrichment of known tissue-specific transcription factors and functional groups in these tissue-specific promoters. Interestingly, comparisons of H3K4me3 and H3ac profiles across tissues for these tissue-specific promoters showed unexpected patterns of enrichment of these marks in adult tissues for promoters with enriched activity in ES cells. These results suggest the importance of characterizing epigenetic profiles in addition to motif analysis in cataloguing the regulatory sequences, which contribute to mammalian cell-type diversity. Results Genome-wide mapping of PIC-binding sites in mouse mES cells and adult organs We adapted the strategy we previously used to map active promoters in human fibroblast cells (Fig. 1
Using the procedure summarized in Figure 1 Since the hypophosphorylated form of Polr2a is expected to localize over transcription initiation sites in the genome (Cheng and Sharp 2003; Brodsky et al. 2005; Kim et al. 2005a), we compared the location of these binding regions with annotated mRNA transcript start sites (TSS) downloaded from the UCSC Genome Browser (MM5; refGene, knownGene, ensGene, and all_mrna) (Hinrichs et al. 2006); 16,976 (69%) of these sites mapped within 2.5 kbp of 66,559 distinct TSS based on RefSeq, Ensembl, UCSC knownGene, or GenBank annotation. These transcripts in turn correspond to 11,000 out of ~24,000 mouse genes based on Entrez Gene annotation (Maglott et al. 2005). Of the remaining unmatched sites within and outside of known gene loci, 5153 mapped within 2.5 kbp of TSS based on 5′ cap analysis of gene-expression (CAGE) sequencing from a panel of 145 mouse cDNA libraries (Shiraki et al. 2003; Carninci et al. 2005). Taken together, these two lines of evidence provide independent support that 91% of these Polr2a binding regions correspond to known transcription initiation sites (Table 1).
The distance distribution of Polr2a binding sites to matching TSS clearly supports the accuracy of our method in defining known transcription initiation sites (Supplemental Fig. S3). In addition, the number of promoters relative to the number of genes suggests the prevalence of alternative promoter usage. For instance, a recent RNA interference study defined estrogen receptor beta (Esrrb) as one of seven genes that are critical for embryonic stem cell renewal in vitro (Ivanova et al. 2006). We identified two tissue-specific promoters for this gene; one has enriched Polr2a binding in mES, while the other shows enriched binding in kidney (Fig. 2
Additionally, in characterizing the genomic distribution of the CAGE-matched sites, we validate estimates of exonic transcription initiation activity based on CAGE data (Carninci et al. 2006). The majority (62%) of the CAGE-matched sites resides within known gene boundaries (exonic and intronic) (Supplemental Figs. S3, S4). A substantial fraction is tissue-specific (37%), and the prevalence of these sites underscores the role of transcription initiation, along with splicing, in defining the complexity of transcript populations even from within known gene loci. A previous study based on CAGE tag frequency has correlated this exonic promoter activity with tissue-specific genes (Carninci et al. 2006). By examining the co-localization of H3K4me3, an epigenetic mark associated with 5′ ends of active genes from yeast to human (Pokholok et al. 2005; Heintzman et al. 2007), we predict 382 sites not near known TSS or CAGE tag clusters as putative promoters. This fraction (1.6%) of our catalog suggests that only a small number of transcription initiation sites are still missed by extensive 5′ end sequencing efforts to annotate the mouse transcriptome (Supplemental Fig. S3). A large fraction (37%) of these putative promoters appears to be tissue specific. These putative promoters are primarily from mES (67%) and kidney (18%). Further investigations are necessary to determine the matching transcripts for these uncharacterized promoters. Assessing promoter Polr2a occupancy across different tissues In order to characterize the relative Poll II occupancy at each promoter across a number of tissues, we used the Polr2a ChIP-chip log2ratio enrichment and defined an index of tissue activity for each promoter by adapting a Shannon entropy previously applied to microarray gene expression and EST data (Schug et al. 2005). We defined the relative Polr2a binding in a tissue t for a given site s as pt/s = Bt,s/∑1≤t≤NBt,s, where Bt,s is the average ChIP-chip log2ratio in the 1-kbp neighborhood centered at the midpoint of Polr2a binding site s, and N is the total number of tissues surveyed. The entropy of a site’s Polr2a binding distribution across tissues is then defined as Hs = −∑1≤t≤Npt/s log2pt/s. The measure Hs has units of bits, and, as in its use with expression data, the value of Hs ranges from zero for genes bound by Polr2a in a single tissue to log2(N) for sites bound uniformly in all tissues surveyed. We also adapted the companion measure of “categorical tissue-specificity” to characterize the bias of a Polr2a binding site for a particular tissue defined as Qs/t = Hs − log2(pt/s). This index also has units of bits and as before has a minimum of zero when a site is bound by Polr2a predominantly in the tissue and grows without bound as the relative binding of Polr2a in that tissue goes to zero. We used these measures of entropy and categorical tissue-specificity to assess the usage of all Polr2a binding sites across tissues. When applied to sites not matched to known mRNAs but near known microRNAs (miRNAs), 10 of 19 matched miRNAs were classified as tissue-specific. Recent studies have provided evidence that miRNAs play a pivotal role in defining tissue- and cellspecific expression patterns (Table 2; Ambros 2004; Lim et al. 2005). Indeed, seven of the 10 promoters we defined as tissue-specific for the miRNA were cloned from the corresponding tissue source or the closely related tissue source, in the case of mES and testis (Griffiths-Jones et al. 2006). Two of these tissue-specific miRNAs have been shown to down-regulate a large number of mRNAs in human: miR-124 transfection in HeLa cells shifted the expression profile toward that of brain, while miR-1 shifted the expression profile of HeLa cells toward heart and skeletal muscles (Lim et al. 2005).
Overall, the majority of transcript-matched promoters have ubiquitous activity (H ≥ 2) by the Polr2a binding entropy across the tissues surveyed (Fig. 3
Tissue-specific Polr2a binding and expression To further characterize the relationships among promoter Polr2a binding, active chromatin modifications, and transcript levels, we focused the remainder of our analysis on 9% of the gene promoters (937) with Polr2a binding enriched in a specific tissue and profiled the Polr2a, H3ac, and H3K4me3 ChIP-chip log2ratios 2 kbp upstream of and downstream from a reference start site. To initially validate our classification, we also assessed the normalized expression signal across tissues (Fig. 5
The panels in Figure 5 Comparison of genes defined as tissue-specific based on binding and expression allows the identification of a high-confidence set of genes with tissue-enriched activity. Conversely, examining the genes defined as tissue-specific by Polr2a binding but not supported by expression data can be useful in identifying possible misassignment of Polr2a binding to a gene based on the nearest 5′ end assumption or the transcript to gene mapping annotation. Alternatively, this minority might represent tissue-specific promoters for genes which might be regulated at steps beyond initiation (Ambros 2004; Saunders et al. 2006). For instance, two genes with enriched Polr2a binding and histone modifications at their promoter region have no enrichment in mES based on our expression profiling data: 4930511H11Rik appears to be more highly expressed, albeit in low levels in adult tissues, while Tmcc3 is called absent across the tissues we surveyed. Based on the GNF expression atlas, 4930511H11Rik appears to be selectively expressed in testis, while Tmcc3 is selectively expressed in the oocyte and fertilized egg (Supplemental Fig. S7). Tissue-specific Polr2a binding and chromatin modifications Across tissues, tissue-specific Polr2a enrichment matches the enrichment of epigenetic marks generally associated with transcriptional activity (Fig. 5
ChIP with quantitative PCR (qPCR) for Polr2a, H3K4me3, and H3ac at four genes from each mES category confirm the Polr2a enrichment at these promoters specific to mES. We also verify the partitioning of these two categories by the relative enrichment of histone modifications, in particular of H3K4me3, in adult tissues for mES c2 (Fig. 7
To examine the extent that H3K4me3 generally occurs without Polr2a enrichment at promoters, we performed individual H3K4me3 ChIP-chip for brain, heart, kidney, and liver using an array covering a nearly 60-Mbp stretch of chromosome 11. Since chromatin modification data do not conform to the peak-finding model assumptions, we used an adaptive promoter-focused hit calling strategy to define both Polr2a and H3K4me3 enrichment at these promoters (Supplemental Methods). From this analysis, 20%–38% of the promoters enriched with H3K4me in adult tissues have no detectable Polr2a binding (Table 3). This suggests that the observation of H3K4me3 enrichment at promoters without detectable Polr2a binding for mES c2 promoters in adult tissues may be a special case of a more general phenomenon.
Functional annotation of tissue-specific genes To compare our grouping of genes based on tissue-enriched Polr2a promoter binding with existing functional annotation, we determined the enriched GO biological process (GO-BP) categories in each group (Zhang et al. 2004; Gene Ontology Consortium 2006). We found that the most enriched GO-BP categories correspond to the known physiological roles of the tissue and cell type (Supplemental Table S3). In mES, we observed that the two classes of gene promoters have a subtle difference in the ranking of the most enriched GO-BP categories. The mES c2 class is most enriched in genes related to cell cycle and cell division, while mES c1 is most enriched in genes related to cell proliferation and pattern specification. Among the genes in mES c2 are those which may not have restricted expression in mES but clearly enriched activity such as a host of cell-cycle–related genes (Ube2c, Sgol2, Bub1, Bub1b, Aurkb, Cdc2a, Cdca2, Cdca7, Cdc25c) and DNA replication genes (Mcm3, Mcm8). Among genes in mES c2 with reported roles in development are Gli zinc finger transcription factors (Gli1, Gli2, Zic3) activated through the Sonic hedgehog (Shh) signal-transduction pathway as well as a hedgehog receptor gene, Ptch2 (Ruiz i Altaba et al. 2002). Gli1 and Gli2—both of which mediate Hh signals—have been implicated in tumorigenesis and are reported to be found among precursor cells in adult tissues (Ruiz i Altaba et al. 2002). Additionally, the lymphoid enhancer factor 1 (Lef1) gene, which mediates the effects of the Wnt signaling pathway, belongs in this class (Reya and Clevers 2005). Among the mES c1 genes, we find the majority of genes that have known roles in stem-cell renewal and pluripotency such as Pou5f1 and Nanog (Boyer et al. 2005; Loh et al. 2006), as well as additional stem-cell markers such as Dppa4, Nr0b1, Utf1, Tdgf1, and Zfp42 (Wei et al. 2005; Niakan et al. 2006). We also define previously identified ES-enriched genes in the TGF-β signaling pathway such as Lefty1, Lefty2, and Nodal (Besser 2004; Wei et al. 2005) as well as fibroblast growth factors such as Fgf4, Fgf15, and Fgf17. Among these FGFs, Fgf4 has a reported role in trophoblast stem-cell proliferation (Tanaka et al. 1998). Because the comparison of Polr2a binding in mES is relative to adult tissues, genes with reported roles in development were also found in mES c1. These may not necessarily be ES-specific transcription factors but may have poised promoters marked by Polr2a binding and H3K4me3 or basal transcriptional activity (Bernstein et al. 2006). Gbx2 has reported roles in nervous system development (Joyner et al. 2000); Pitx2, heart development (Kioussi et al. 2002); and Six6os, eye development (Alfano et al. 2005). Sequence motifs at tissue-specific promoters Nearly half (45%) of the promoters in mES c2 overlap CpG islands. This proportion is more than twofold higher than the overlap of promoters in mES c1 with CpG islands (20%). Among the adult tissues, brain appears to have the largest overlap (24%) between tissue-specific gene promoters and CpG islands compared with heart (10%), kidney (14%), and liver (9%). This is in agreement with a previous observation that, among transcripts with specific expression patterns, promoters associated with the central nervous system were exceptionally CpG-rich (Carninci et al. 2006). In order to define discriminating sequence motifs within each tissue-specific promoter set, we used two complementary motif-finding strategies. The first strategy measures motif enrichment in each tissue-promoter set relative to a background set based on a balanced error measure which equally weighs a motif’s ability to identify promoters in the set (sensitivity) and to correctly discriminate against promoters not in the set (specificity) (Smith et al. 2005, 2006, 2007). Using this strategy, we characterized the enrichment of known vertebrate motifs from TRANSFAC (Matys et al. 2006) and JASPAR (Sandelin et al. 2004) in each tissue-specific promoter set relative to two types of background promoter sets: (1) a random set of mouse promoters from CSHLMPD (Xuan et al. 2005), and (2) the relative complement of the tissue-specific promoter set in the set of all tissue-specific promoters (Table 4). To identify novel motifs in each tissue-specific promoter set, we used a previously described de novo motif finder, DME (Smith et al. 2005, 2006, 2007). We evaluated the significance of these novel motifs using the same misclassification metric and report the novel motifs for each set (Table 4). To complement this strategy, we also used relative overrepresentation of conserved occurrences to define characteristic motifs for each tissue set. By these methods, we identified binding sites for transcription factors with previously reported roles in the specific tissue or cell type, as well as others whose roles remain unclear or whose binding domains appear similar to those of transcription factors with reported roles in that tissue (Table 4).
Discussion One of the first steps toward a comprehensive understanding of the mechanisms of cell diversity is to define and profile the active promoters in different cell types. Here we describe an integrated approach for profiling the epigenetic and sequence features of active promoters in mouse embryonic stem cells and four adult organs. We defined 24,363 Pol II binding sites that include 16,976 annotated 5′ ends of known transcripts and 5153 TSS previously supported by CAGE evidence alone. We confirmed widespread usage of alternative promoters by mammalian genes, and identified over four thousand promoters as tissue-specific. These tissue-specific promoters led to the identification of transcription factor motifs for genes with tissue-specific expression. Our results also reveal complex relationships among Polr 2a binding, chromatin modifications, and gene expression in different tissues. We showed that most CpG island promoters are associated with Polr2a and active chromatin marks in nearly all the tissues, but non-CpG island promoters are accompanied with the active chromatin marks and Polr2a in a highly tissue-restricted manner. For most tissue-specific promoters, there is a general correspondence between Polr2a binding and presence of active chromatin marks at the promoters. However, a subset of ES cell gene promoters are persistently marked by active chromatin modifications even in the absence of detectable Polr2a binding in adult tissues. Therefore, distinct mechanisms of gene regulation appear to be involved in CpG and non-CpG promoters and at different classes of tissue specific promoters. To characterize the tissue-specificity of factor binding by ChIP-chip at promoters, we adapted a quantitative index based on Shannon entropy (Schug et al. 2005). This strategy overcomes some of the limitations associated with ChIP-chip technology. The current emphasis on “bound” versus “unbound” sites in ChIP-chip analysis sacrifices sensitivity for specificity in defining sites associated with a particular factor. This naïve binary classification becomes especially problematic, however, when comparing factor occupancy at genomic sites across cell types or conditions. Further development of quantitative measures of relative ChIP-enrichment for a factor’s genomic localization across samples or conditions will be critical in circumventing these issues. We used two complementary approaches—classification and conservation—to define the sequence motifs associated with tissue-specific promoters based on our entropy measure. Although we identified known motifs previously associated with these tissue-specific promoters, none of the novel motifs defined based on classification ability was significantly enriched based on the strict conservation metric. In particular, conservation did not support the novel motif, which was the only motif identified in mES c1. In general, promoters with mES-enriched activity were characterized by a dearth of significant motifs, known and novel, relative to adult tissues. Although our limited motif results in mES cells may reflect the bias of existing motif databases and the limitations of our motif-analyses strategies, we posit that long-range or distal regulatory elements might play a more critical role in regulating the expression of enriched transcripts in ES cells. Although in general there are close associations among Polr2a binding, histone modifications, and transcript levels at most tissue-specific promoters, we showed H3K4me3 enrichment at a substantial fraction of promoters with weak to undetectable Polr2a occupancy in adult tissues. This trend is striking for roughly half of the promoters defined as mES-specific based on Polr2a binding and gene expression (mES c2). These promoters with enriched activity in mES remain epigenetically marked by H3ac and H3K4me3 in adult tissues even without detectable Polr2a binding. Modifications associated with transcriptional activity, in particular H3K4me3, have been suggested to play additional roles as markers of recent transcription or poised activation at promoters, directly or indirectly inhibiting other forms of chromatin-mediated repression (Kouskouti and Talianidis 2005; Bernstein et al. 2006; Roh et al. 2006; Ruthenburg et al. 2007; Weber et al. 2007). Subtle differences in the known function and identity of genes between the two mES classes reveal more known mouse embryonic stem-cell markers within mES c1 (Nanog, Pouf51, Dppa4, Nr0b1, Utf1, Tdgf1). Promoters in mES c2 might be associated with a unique set of genes, such as the Gli zinc finger transcription factors, expressed at low levels, or in a small subset of cell types, within adult tissues (Ruiz i Altaba et al. 2002). The mES c2 category, relative to its complement among promoters with mES enriched activity, is distinguished by a twofold higher overlap with CpG islands (45%). This sequence distinction might provide a clue to understanding this class and its regulation (Roh et al. 2006; Weber et al. 2007). Further work is underway to more precisely characterize this phenomenon and its extent. Our approach toward understanding tissue-specific gene expression integrates Polr2a binding, chromatin modifications, and sequence features of promoters with measurements of relative transcript abundance. The genomic maps of Polr2a binding and chromatin modifications will be valuable resources that complement profiles of transcript levels and abundance for unraveling the layers of control governing gene expression patterns across cell types. Mapping these features at additional cell types at various developmental stages will likely provide further insight as to how cell-specific programs of expression are specified by sequence and epigenetic features across development. Methods Sample preparation R1 ES cells (a gift from Dr. Don Cleveland, Ludwig Institute for Cancer Research, San Diego) were maintained on top of feeder cells in a cell culture dish with DMEM high-glucose medium supplemented with 15% FBS, 0.1 mM nonessential amino acid, 1 mM sodium pyruvate, 1 μM β-mercaptoethanol, 2 mM l-glutamine, 50 g/mL pen/strep, and LIF. Cells were passed once on 0.1% gelatin without feeder cells before being harvested. Cells were harvested and cross-linked with 1% formaldehyde for 20 min when they reached ~80% confluence on the plates. Mouse tissues were dissected from a female BL6 mouse at 10–12 wk, chopped into small pieces (~1 mm3) with a razor blade in cold 1× PBS, and cross-linked with 1% formaldehyde for 30 min at room temperature. All samples were then sonicated according to previously described protocols (Li et al. 2003). Chromatin immunoprecipitation with microarrays (ChIP-chip) Chromatin immunoprecipitation was performed as previously described (Li et al. 2003). Briefly, 2 mg of sonicated chromatin (OD260) was incubated with 10 μg of antibody (anti-RNA polymerase II, MMS-126R, Covance; anti-AcH3, 06-599, Upstate; anti-Me3H3K4, 07-473, Upstate) coupled to the IgG magnetic beads (Dynal Biotech). The magnetic beads were washed eight times with RIPA buffer (50 mM HEPES at pH 8.0, 1 mM EDTA, 1% NP-40, 0.7% DOC, and 0.5 M LiCl, supplemented with Complete protease inhibitors from Roche Applied Science), and washed once with TE (10 mM Tris at pH 8.0, 1 mM EDTA). After washing, the bound DNA was eluted at 65°C in elution buffer (10 mM Tris at pH 8.0, 1 mM EDTA, and 1% SDS). The eluted DNA was incubated at 65°C overnight to reverse the cross-links. Following incubation, the immunoprecipitated DNA was treated sequentially with Proteinase K and RNase A and was then desalted using the QIAquick PCR purification kit (Qiagen). The purified DNA was blunt ended using T4 polymerase (New England Biolabs) and ligated to the linkers (oJW102, 5′-GCGGTGACCCGGGAGATCT GAATTC-3′, and oJW103, 5′-GAATTCAGATC-3′). The ligated DNA was subjected to ligation-mediated PCR, labeled with Cy3 and Cy5 dCTP using a BioPrime DNA labeling kit (Invitrogen), and hybridized to the mouse genome tiling microarray. The 37 genome-scan tiling array set containing 14.3 50-mer oligonucleotides, positioned at every 100 bp were designed and fabricated using the maskless array synthesis technology (MAS) by NimbleGen Systems. These arrays were designed to contain all the non-repetitive sequences throughout the mouse genome (NCBIv33, mm5). Initial identification of Polr2a binding sites in five tissues After scanning and image extraction, Cy5 (ChIP DNA) and Cy3 (input) signal values for each of the 37 genome tiling arrays were normalized by intensity-dependent Loess using the R package limma (Gentleman et al. 2004; Smyth 2005). Median filtering (window size=3 probes) was used to smooth log2 (Cy5/Cy3) data across the tiled regions. For each array, ChIP-enriched probe clusters were defined as regions with a minimum of four probes separated by a maximum of 500 bp with filtered log2R greater than 2.5 standard deviations from the mean log ratio, as used in our previous study of TAF1 binding in the human genome (Kim et al. 2005b). The application of the analysis above for each genome-scan tiling set corresponding to Polr2a ChIP-chip for each tissue resulted in five sets (brain, heart, kidney, liver, embryonic stem cells) of putative Polr2a binding regions in the mouse. Condensed array ChIP-chip We designed a condensed array by combining the five sets of putative Polr2a binding regions from the five Polr2a genome-wide scans. Each binding region was extended by 2000 bp upstream and downstream and overlapping regions from the Polr2a ChIP-chip of different tissues were merged to yield a set of 32,482 putative Polr2a binding regions for condensed array design. NimbleGen Systems used the same probe designs from the genome-scan tiling set overlapping the 32,482 regions to synthesize the condensed scan array set containing 1.5 million probes in four arrays. We performed 15 ChIP-chip experiments over the condensed array design for three factors (Polr2a, H3ac, H3K4me3) across five mouse tissues. After scanning and image extraction, Cy5 (ChIP DNA) and Cy3 (input) signal values for each of the four condensed-scan tiling arrays (in each set) were normalized by applying either intensity-dependent Loess or median-scaling normalization with the correction based only on the intensities of 14,572 control probes (designated RANDOM_GC11_GC34). The R package limma was used to implement the normalization (Gentleman et al. 2004; Smyth 2005). Final catalog of Polr2a binding sites To define a final catalog of Polr2a binding sites we applied an improved version of the peakfinding algorithm which we previously used to define Taf1 binding in human IMR90 cells (Kim et al. 2005b; Zheng et al. 2005). This algorithm predicts a binding site for a factor at the probe-level resolution. The P value for significant peaks is based on the following test-statistic:
As a second step in defining a catalog of Polr2a binding sites, we pooled the confirmed peaks in each tissue and merged all the sites that are within 1000 bp of each other. This cutoff was based on the distribution of nearest-neighbor distances between confirmed peaks. Sites were then merged across tissues if there was any base pair overlap. The Polr2a binding site is then defined as the range of the confirmed peaks merged across tissues. Expression analysis To complement the Polr2a mapping strategy, we defined the set of genes with transcripts relatively enriched in each tissue. We identified these genes by analyzing the genome-wide expression profiles of the each tissue using Affymetrix GeneChip Mouse Genome 430 2.0, which represents >39,000 mouse transcripts. Total RNA from each mouse tissue was extracted using Trizol reagent (Invitrogen, Carlsbad, CA) and further purified using RNeasy Mini Kit (Qiagen, Valencia, CA) according to the manufacturer’s recommendations. The purified total RNA was submitted to UCSD Cancer Center Microarray Resource for GeneChip RNA expression analysis using Mouse Genome 430 2.0 arrays. The resulting hybridization data were analyzed using Affymetrix GeneChip Operating Software (GCOS) v. 2.0 to determine the detection call as present (P), marginal (M), or absent (A) at significance level P < 0.05. We used annotation from the Affymetrix library file Mouse430_2.cdf to match probe sets to corresponding Entrez Gene identifiers. Probe sets with identifier extension “x_at” were removed from the analysis. A total of 20,827 Entrez genes were mapped to the remaining probe sets. We performed quantile normalization on the probe set signals across tissues using the R package, affy (Bolstad et al. 2003; Gentleman et al. 2004). To assign a signal for a gene in each tissue, we selected the maximum normalized expression signal of all probe sets matched to the gene if there are multiple probe sets for a gene. Tissue-specific measures of entropy and categorical tissue-specificity based on expression were computed as previously described (Schug et al. 2005). Promoter-focused ChIP-chip hit calling H3K4me3 ChIP-chip for each tissue was performed using the array covering chr11:36,912,182–99,375,819. To circumvent issues in identifying sites of H3K4me3 enrichment, we developed a promoter-focused strategy to answer this question. We took the set of known promoters surveyed (refGene, knownGene, and ensGene) and merged them into a non-redundant set of 1265 nonoverlapping promoter regions 1 kbp wide [−500,+500] from the TSS. This set does not include bidirectional promoters to prevent potential mismatching of H3K4me3 and Polr2a enrichment at head-to-head promoters. Every array for a tissue and marker (H3K4me3, Polr2a) combination was normalized using a recently reported sequence/GC-based normalization method, MA2C (Song et al. 2007). We reanalyzed corresponding Polr2a ChIP-chip array data for each tissue to make the results directly comparable. For each experiment, the average ChIP-chip log ratio in the 1-kbp window spanned by each promoter was defined as its ChIP-chip enrichment index. The distributions of the average ChIP-chip log ratios over all the promoters for all tissues, for both H3K4me3and Polr2a clearly show a bimodal distribution (mixture of two Gaussian distributions). We used an expectation-maximization (EM) strategy for estimating the parameters for a mixture of two Gaussians (http://www.mathworks.com/matlabcentral/fileexchange/loadFile.do?objectId=8636). A score cutoff for promoter ChIP enrichment is determined for each factor and tissue combination based on the estimated parameters of the null distribution centered near 0. This cutoff is defined as two standard deviations above the mean. Motif analysis Classification We identified motifs for each set of tissue-specific gene promoters by examining the relative over-representation of known vertebrate transcription factor binding site (TFBS) matrices based on TRANSFAC (Matys et al. 2006) and JASPAR (Sandelin et al. 2004) (673) in each set compared to two types of background sets: (1) a random set of mammalian promoters or (2) the relative complement of the set in the set of all tissue-specific gene promoters. The mES c2 set was excluded from the relative complement sets of tissue-specific promoters because its pattern of histone modification enrichment was not tissue-specific. A previously described enumerative strategy, DME, was also used to determine the highest ranked de novo discriminative motifs of different widths (w = 6, 8, 10, 12, 14) in each tissue-specific set compared to each of the two types of background sets (Smith et al. 2005, 2006). For known and de novo motifs, a motif’s ability to classify the foreground sequences from background sequences is measured by the balanced misclassification error rate. This error rate is defined as:
The significance of the balanced misclassification error rate for a motif (P value) is determined by estimating the expected distribution of the error rates for a given comparison. Conservation Given the set of known vertebrate TFBS matrices from TRANSFAC and JASPAR (678), the best occurrence of each motif was mapped at every orthologous pair of promoter in mouse and human in each tissue-specific set using the CREAD (http://rulai.cshl.edu/cread/index.shtml) utility storm. Promoter occurrences for all motifs were filtered to those scoring above a functional depth threshold of 0.85:
Please see Supplemental Methods for additional methods and detailed explanations. For software used in expectation-maximization (EM) strategy for estimating parameters for a mixture of two Gaussians see http://www.mathworks.com/matlabcentral/fileexchange/loadFile.do?objectId=8636. For the CREAD utility storm, see http://rulai.cshl.edu/cread/index.shtml. Acknowledgments We thank Dr. Keith Ching for bioinformatics advice and Grace Liu for style suggestions. This research was supported in part by a Ford Foundation Pre-Doctoral Fellowship (L.O.B.); Ludwig Institute for Cancer Research (B.R.); R33CA105829 (B.R.), R21CA116365-01 (R.D.G.), and HG001696 (M.Q.Z.) from NIH; and EIA-0324292 (M.Q.Z.) from NSF. Footnotes [Supplemental material is available online at www.genome.org. The sequence data from this study have been submitted to the Gene Expression Omnibus under accession no. GSE7688.] Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.6654808 References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||
Nature. 2002 Dec 5; 420(6915):563-73.
[Nature. 2002]Proc Natl Acad Sci U S A. 2002 Apr 2; 99(7):4465-70.
[Proc Natl Acad Sci U S A. 2002]Proc Natl Acad Sci U S A. 2004 Apr 20; 101(16):6062-7.
[Proc Natl Acad Sci U S A. 2004]PLoS Biol. 2003 Dec; 1(3):E74.
[PLoS Biol. 2003]Cell. 2005 Oct 21; 123(2):219-31.
[Cell. 2005]Nature. 2002 Dec 5; 420(6915):563-73.
[Nature. 2002]Proc Natl Acad Sci U S A. 2002 Apr 2; 99(7):4465-70.
[Proc Natl Acad Sci U S A. 2002]Proc Natl Acad Sci U S A. 2004 Apr 20; 101(16):6062-7.
[Proc Natl Acad Sci U S A. 2004]PLoS Biol. 2003 Dec; 1(3):E74.
[PLoS Biol. 2003]J Biol. 2004; 3(5):21.
[J Biol. 2004]Cell. 2007 Jul 13; 130(1):77-88.
[Cell. 2007]Mol Cell Biol. 2003 Mar; 23(6):1961-7.
[Mol Cell Biol. 2003]Annu Rev Genomics Hum Genet. 2006; 7():81-102.
[Annu Rev Genomics Hum Genet. 2006]Nat Rev Mol Cell Biol. 2006 Aug; 7(8):557-67.
[Nat Rev Mol Cell Biol. 2006]Mol Cell Biol. 2003 Mar; 23(6):1961-7.
[Mol Cell Biol. 2003]Genome Biol. 2005; 6(8):R64.
[Genome Biol. 2005]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D590-8.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D54-8.
[Nucleic Acids Res. 2005]Proc Natl Acad Sci U S A. 2003 Dec 23; 100(26):15776-81.
[Proc Natl Acad Sci U S A. 2003]Nature. 2006 Aug 3; 442(7102):533-8.
[Nature. 2006]Nat Genet. 2006 Jun; 38(6):626-35.
[Nat Genet. 2006]Genome Res. 2006 Jan; 16(1):55-65.
[Genome Res. 2006]Nat Genet. 2006 Jun; 38(6):626-35.
[Nat Genet. 2006]Cell. 2005 Aug 26; 122(4):517-27.
[Cell. 2005]Nat Genet. 2007 Mar; 39(3):311-8.
[Nat Genet. 2007]Genome Biol. 2005; 6(4):R33.
[Genome Biol. 2005]Nature. 2004 Sep 16; 431(7006):350-5.
[Nature. 2004]Nature. 2005 Feb 17; 433(7027):769-73.
[Nature. 2005]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D140-4.
[Nucleic Acids Res. 2006]J Mol Biol. 1987 Jul 20; 196(2):261-82.
[J Mol Biol. 1987]Curr Opin Genet Dev. 1995 Jun; 5(3):309-14.
[Curr Opin Genet Dev. 1995]Genome Biol. 2005; 6(4):R33.
[Genome Biol. 2005]Nat Genet. 2006 Jun; 38(6):626-35.
[Nat Genet. 2006]Proc Natl Acad Sci U S A. 2002 Apr 2; 99(7):4465-70.
[Proc Natl Acad Sci U S A. 2002]Proc Natl Acad Sci U S A. 2004 Apr 20; 101(16):6062-7.
[Proc Natl Acad Sci U S A. 2004]Genome Biol. 2005; 6(4):R33.
[Genome Biol. 2005]Nature. 2004 Sep 16; 431(7006):350-5.
[Nature. 2004]Nat Rev Mol Cell Biol. 2006 Aug; 7(8):557-67.
[Nat Rev Mol Cell Biol. 2006]J Biol. 2004; 3(5):21.
[J Biol. 2004]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D322-6.
[Nucleic Acids Res. 2006]Nat Rev Cancer. 2002 May; 2(5):361-72.
[Nat Rev Cancer. 2002]Nature. 2005 Apr 14; 434(7035):843-50.
[Nature. 2005]Cell. 2005 Sep 23; 122(6):947-56.
[Cell. 2005]Nat Genet. 2006 Apr; 38(4):431-40.
[Nat Genet. 2006]Stem Cells. 2005 Feb; 23(2):166-85.
[Stem Cells. 2005]Mol Genet Metab. 2006 Jul; 88(3):261-71.
[Mol Genet Metab. 2006]J Biol Chem. 2004 Oct 22; 279(43):45076-84.
[J Biol Chem. 2004]Nat Genet. 2006 Jun; 38(6):626-35.
[Nat Genet. 2006]Proc Natl Acad Sci U S A. 2005 Feb 1; 102(5):1560-5.
[Proc Natl Acad Sci U S A. 2005]Proc Natl Acad Sci U S A. 2006 Apr 18; 103(16):6275-80.
[Proc Natl Acad Sci U S A. 2006]Mol Syst Biol. 2007; 3():73.
[Mol Syst Biol. 2007]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D108-10.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D91-4.
[Nucleic Acids Res. 2004]Genome Biol. 2005; 6(4):R33.
[Genome Biol. 2005]EMBO J. 2005 Jan 26; 24(2):347-57.
[EMBO J. 2005]Cell. 2006 Apr 21; 125(2):315-26.
[Cell. 2006]Proc Natl Acad Sci U S A. 2006 Oct 24; 103(43):15782-7.
[Proc Natl Acad Sci U S A. 2006]Mol Cell. 2007 Jan 12; 25(1):15-30.
[Mol Cell. 2007]Nat Genet. 2007 Apr; 39(4):457-66.
[Nat Genet. 2007]Proc Natl Acad Sci U S A. 2003 Jul 8; 100(14):8164-9.
[Proc Natl Acad Sci U S A. 2003]Proc Natl Acad Sci U S A. 2003 Jul 8; 100(14):8164-9.
[Proc Natl Acad Sci U S A. 2003]Genome Biol. 2004; 5(10):R80.
[Genome Biol. 2004]Genome Biol. 2004; 5(10):R80.
[Genome Biol. 2004]Bioinformatics. 2003 Jan 22; 19(2):185-93.
[Bioinformatics. 2003]Genome Biol. 2004; 5(10):R80.
[Genome Biol. 2004]Genome Biol. 2005; 6(4):R33.
[Genome Biol. 2005]Genome Biol. 2007; 8(8):R178.
[Genome Biol. 2007]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D108-10.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D91-4.
[Nucleic Acids Res. 2004]Proc Natl Acad Sci U S A. 2005 Feb 1; 102(5):1560-5.
[Proc Natl Acad Sci U S A. 2005]Proc Natl Acad Sci U S A. 2006 Apr 18; 103(16):6275-80.
[Proc Natl Acad Sci U S A. 2006]Nat Genet. 1999 Jul; 22(3):281-5.
[Nat Genet. 1999]