![]() | ![]() |
Formats:
|
||||||||||||||||||
Copyright © 2006, Cold Spring Harbor Laboratory Press Genome-wide computational prediction of transcriptional regulatory modules reveals new insights into human gene expression 1 McGill Centre for Bioinformatics, Montreal, Quebec, Canada, H3A 2B4; 2 Institut de Recherches Cliniques de Montréal, Montreal, Quebec, Canada H2W 1R7; 3 Molecular Oncology Group Department of Medicine, Oncology and Biochemistry, McGill University, Montreal, Quebec, Canada H3A 1A1; 4 McGill University and Genome Quebec Innovation Center, Montreal, Quebec, Canada H3A 1A4 5Corresponding authors. E-mail blanchem/at/mcb.mcgill.ca; fax (514) 398-3387. E-mail francois.Robert/at/ircm.qc.ca; fax (514) 987-5743. Received October 31, 2005; Accepted March 2, 2006. This article has been cited by other articles in PMC.Abstract The identification of regulatory regions is one of the most important and challenging problems toward the functional annotation of the human genome. In higher eukaryotes, transcription-factor (TF) binding sites are often organized in clusters called cis-regulatory modules (CRM). While the prediction of individual TF-binding sites is a notoriously difficult problem, CRM prediction has proven to be somewhat more reliable. Starting from a set of predicted binding sites for more than 200 TF families documented in Transfac, we describe an algorithm relying on the principle that CRMs generally contain several phylogenetically conserved binding sites for a few different TFs. The method allows the prediction of more than 118,000 CRMs within the human genome. A subset of these is shown to be bound in vivo by TFs using ChIP-chip. Their analysis reveals, among other things, that CRM density varies widely across the genome, with CRM-rich regions often being located near genes encoding transcription factors involved in development. Predicted CRMs show a surprising enrichment near the 3′ end of genes and in regions far from genes. We document the tendency for certain TFs to bind modules located in specific regions with respect to their target genes and identify TFs likely to be involved in tissue-specific regulation. The set of predicted CRMs, which is made available as a public database called PReMod (http://genomequebec.mcgill.ca/PReMod), will help analyze regulatory mechanisms in specific biological systems. The regulation of gene expression is at the core of many important biological processes such as cell growth, division, differentiation, and adaptation to the extracellular environment. Gene expression is regulated in large part at the transcription level, with transcription factors (TFs) binding their specific DNA regulatory elements and activating or repressing transcription. The identification and characterization of these DNA regulatory elements are among the most important and challenging tasks for molecular biologists in the post-genome era. TFs typically have an affinity for short, 5–15 bp, degenerate DNA sequences. Decades of work in many laboratories have led to the identification of consensus-binding motifs for hundreds of these TFs. These binding motifs are generally represented by position-weighted matrices (PWM). In principle, examination of the human genome with these PWM should allow for the identification of TF-binding sites (TFBSs), and hence, regulatory regions; but the size of the genome, combined with the fact that TF-binding motifs are short and degenerate, complicates this task enormously. Indeed, these motifs can be found everywhere in the genome and experiments have shown that only an extremely small proportion represent bona fide TFBSs. The binding of a TF is thus not simply a function of the theoretical affinity for a DNA site, but also of a number of other factors like the chromatin environment and the cooperation or competition with other DNA-binding proteins. In higher eukaryotes, TFs rarely operate by themselves, but rather bind to DNA in cooperation with other DNA-binding proteins. The DNA footprint of this set of factors is called a cis-regulatory module (CRM), which consists of a set of TFBSs located in a DNA region of up to a few hundred bases located in the vicinity of the gene being regulated. These modules have been the focus of much work recently (Davidson 2001), particularly in the context of the gene regulation during development (Howard and Davidson 2004), and are believed to be key features of most transcriptional regulatory processes in mammals. Several features of known CRMs can be used to recognize new modules as follows: (1) CRMs are generally composed of several binding sites for a few different TFs; (2) CRMs, and in particular the binding sites they contain, are generally more evolutionarily conserved than their flanking intergenic regions, and (3) genes regulated by a common set of TFs tend to be coexpressed. Different combinations of those characteristics have been used, often in conjunction with PWM information, to predict regulatory elements for specific TFs. However, very few existing methods are designed to be applied on a genome-wide scale without prior knowledge about sets of interacting TFs or sets of coregulated genes (the main exception being the regulatory potential analysis of Kolbe et al. [2004] and King et al. [2005]). To date, the general properties of human nonpromoter regulatory regions indeed remain largely unexplored. Here, we describe an algorithm that allows the identification of about 118,000 putative CRMs, based on predicted sites of 229 families of human TFs (represented by 481 PWMs). We refer to these regions as ‘‘predicted cis-regulatory modules’’ (pCRMs). Together with the regions predicted for regulatory potential by Kolbe et al. (2004), this constitutes the first genome-wide, nonpromoter centric set of human cis-regulatory modules, although related studies have been reported for yeast (Segal et al. 2003) and for human promoters (Bajic et al. 2004; Segal and Sharan 2005; Robertson et al. 2006). More importantly, in the analysis our set of pCRMs yields a number of novel insights into the mechanisms of gene regulation. After experimental validation of some of our predictions using a combination of chromatin immunoprecipitation and DNA microarrays (ChIP-chip), we used these predictions to explore the regulatory potential of the human genome. We show that, despite the fact that our pCRMs undoubtedly contain a significant number of false positives, the whole-genome approach provides sufficient statistical power to formulate specific biological hypotheses. For example, (1) the CRM density is unexpectedly high downstream of the 3′ end of genes, hinting at a possible involvement in regulating antisense transcription; (2) the regions that are the densest in CRMs are associated with developmental TFs; (3) different TF families have binding sites that are enriched in different regions relative to their target genes; (4) certain TFs or combination of TFs are associated with tissue-specific regulation. The Web-accessible database that accompanies this study will prove useful to experimental biologists interested in the regulation of specific genes, and will allow further bioinformatics and data-mining efforts. Results and Discussion Existing methods for cis-regulatory module prediction The problem of computationally predicting cis-regulatory modules has been extensively studied in the last few years. Most predictive methods are either based exclusively on sequence data (see below), but some attempt to take advantage of gene expression data (Segal et al. 2003; Ihmels et al. 2004; Kloster et al. 2005; Wang et al. 2005) or DNaseI hypersensitivity data (Noble et al. 2005). Sequence-based algorithms have been developed along several lines. In the most studied case, the promoters of a set of (presumably) coregulated genes obtained from some prior experiments is analyzed to identify overrepresented motif combinations likely to be responsible for the gene’s coregulation (Wasserman and Fickett 1998; Krivan and Wasserman 2001; Aerts et al. 2003, 2004; Sharan et al. 2004; Thompson et al. 2004; Zhou and Wong 2004; Gupta and Liu 2005; Segal and Sharan 2005). Other approaches assume that the user provides a small set of transcription-factor PWMs that are expected to co-occur in modules, and identifies genomic regions densely populated in putative sites for these TFs (Bailey and Noble 2003; Frith et al. 2003; Johansson et al. 2003; Sinha et al. 2003, 2004; Alkema et al. 2004). None of these two families of approaches are applicable in our setting, where we do not have sets of coregulated genes to train from, and where we have little prior knowledge about combinations of factors that are likely to co-occur to form modules. To our knowledge, the only computational approach that has been used for de novo, genome-wide prediction of regulatory regions is the method of regulatory potential estimation from Hardison’s group (Kolbe et al. 2004; King et al. 2005). This method is trained to recognize sequence features and interspecies conservation patterns that allow us to distinguish between known regulatory regions and nonfunctional sequences. A comparison of the results obtained by this approach and ours is given below. A new algorithm for prediction of cis-regulatory modules We designed a computational method with the goal of (1) identifying the DNA regions within the human genome that are likely to be important for regulating gene expression and (2) predicting what TFs are likely to bind these regions. Because our interest does not lie on any specific TF or specific system, but rather on having a global map of the regulatory elements of the entire genome, we exploited the fact that PWMs representing binding sites for a few hundreds of TFs have been described in databases such as Transfac (Matys et al. 2003) and JASPAR (Sandelin et al. 2004). Our algorithm takes advantage of the fact that regulatory regions often consist of clusters of binding sites for a few different TFs and that they are more conserved than their flanking intergenic DNA (Davidson 2001; Bulyk 2003; Levine and Tjian 2003). Our approach, based on the detection of statistically significant clusters of phylogenetically conserved TFBSs, shares some of the features of algorithms previously proposed by Sharan et al. (2004) and Aerts et al. (2004), but differs in that it allows the detection of modules without prior knowledge regarding which TFs are likely to be involved together in modules of interest. Our method also shares some similarities with the word-based approach of Philipakis et al. (2005), but uses a very different approach to module scoring. Our algorithm involves two steps (see Fig. Fig.11
Our algorithm was used to scan the regions of the human genome that were alignable to the mouse and rat genome using the MULTIZ program (Blanchette et al. 2004; these regions cover 34% of the human genome). This resulted in the identification of 118,402 predicted modules, covering 2.88% of the human genome. Taken as a whole, this set of pCRMs, although likely to contain a non-negligible fraction of false positives, reveals a number of properties of human gene regions. Although we considered putative modules of size up to 2000 bp, 58% of the pCRMs are less than 500 bp long, with an overall average length of 635 bp per CRM (see Supplemental Fig. S1A for a size histogram). This size distribution is quite close to that of the experimentally verified modules contained in the TRRD database (Kolchanov et al. 2002). However, we cannot exclude the possibility that some of the larger pCRMs are in fact made of more than one biological CRM. Modules have, on average, 3.1 tags (see Supplemental Fig. S1B), with shorter modules usually built from fewer tags than larger ones. While the total number of individual sites predicted in phase (1) of our algorithm varies significantly from one PWM to another (see Supplemental Table S1), our procedure for correcting for low-specificity matrices ensures that no PWM is chosen as a tag too frequently. Supplemental Table S2 shows that tags are not seriously biased toward particular matrices, a sign that our algorithm for tag selection is sufficiently robust to avoid PWMs with low specificity. The PWM chosen as a tag the most often (5401 times, of 118,402 modules) is that for E2F, while the median PWM is selected as a tag in 704 modules. The PWMs that are the most often chosen for tags fall under two categories. The first is that of general promoter-associated factors, like E2F, ZF5, and TBP, which are indeed expected to bind a large number of regulatory regions. The second set of common tags consists of homeobox TFs (e.g., NKX family, POU family, etc.). In silico validation of predicted modules We evaluated the biological relevance of the pCRMs by measuring the extent to which they overlap known regulatory elements such as those compiled in the TRRD (Kolchanov et al. 2002), Transfac (Matys et al. 2003), and GALA (Giardine et al. 2003) databases. We also measured the overlap between the pCRMs and other putative regulatory elements, such as “promoter” regions (defined as the 1-kb region upstream of the transcription start sites [TSS] of all known genes), CpG islands (based on the UCSC Genome Browser annotation [Karolchik et al. 2003]), and DNaseI hypersensitive sites (Dorschner et al. 2004; Sabo et al. 2004;) from the Encode regions (Thomas et al. 2003). Figure Figure2A2A
By definition, the sensitivity of our method for detecting annotated regulatory regions increases with the number of modules that are predicted. This increase is very rapid for the first ~20,000 modules predicted, but the sensitivity for most indicators then increases more slowly. This observation is likely due to the fact that the modules that are the easiest to detect are those located in promoter regions. These also turn out to be the regions where most regulatory modules have been studied. However, the fact that our most reliable indicators of performance (TRRD modules, GALA modules, and, to a lesser extent, hypersensitive sites) continue to grow steadily after the first 20,000 pCRMs indicates that nonproximal modules can still be identified, and justifies considering a much larger set of modules. Comparison to other genome-wide predictions The ability of our algorithm to take advantage of interspecies TFBS conservation contributes in good part to the accuracy of the predictions. Indeed, the 34% of the human genome that lies within an alignment block with the mouse and rat genome contains 90% of bases within Transfac sites, 67% of those within TRRD modules, and 87% of those within GALA regulatory regions. Nonetheless, the sensitivity obtained by our pCRMs on these indicators remains three to five times higher than what would be obtained if modules were randomly predicted within the alignment blocks. To measure more accurately the extent to which sequence conservation alone can be used to predict known regulatory modules, sensitivity curves were computed based on the noncoding interspecies conserved regions identified by the PhastCons program (Siepel et al. 2005) (See Supplemental Fig. S2). The sensitivity of pCRMs is consistently 30%–70% higher than that of PhastCons elements for 1-kb “promoter” regions and TRRD and GALA modules, while it is comparable for Transfac and DNaseI hypersensitive sites. The advantage of pCRMs over PhastCons is most marked when only the highest-scoring half of each set of predictions is considered, in which case, the pCRMs sensitivity is at least twice that of PhastCons for all indicators.6 Overall, 41% of the bases within pCRMs lie within a PhastCons region (and 31% of PhastCons bases are within a pCRM), an 11-fold enrichment over what would be expected by chance. Kolbe et al. (2004) and King et al. (2005) have developed a method called “regulatory potential,” which has been applied to the complete human genome to yield a set of CRM predictions. The method is trained to identify sequence features and interspecies conservation patterns that allow one to distinguish between a set of known regulatory regions and a set of nonfunctional regions. The overlap between the regulatory regions predicted by King et al. and our pCRMs is very significant—choosing a score threshold that results in about the same number of predicted bases as we get in our pCRMs (2.88% of the genome); more than 25% of the bases in pCRMs are also in King’s regions (nine times more than would be expected by chance). The accuracy of the two sets of predictions was compared based on the set of known regulatory regions used above, and none of the two methods appears significantly better than the other (see Supplemental Fig. S2), despite the fact that King’s method was trained on some of the specific regulatory regions used here for validation. Experimental validation of predicted modules In order to further validate our pCRMs, we took advantage of a technique called genome-wide location analysis (or ChIP-chip) (Ren et al. 2000; Iyer et al. 2001). This method allows for the large-scale identification of protein–DNA interactions as they occur in vivo. Briefly, proteins are cross-linked to DNA by treating live cells with formaldehyde and specific protein–DNA complexes are enriched by immunoprecipitation of fragmented chromatin using antibodies directed against a protein of interest. After reversal of the cross-links, the enriched DNA fragments are identified by hybridization onto DNA microarrays. We selected modules predicted to be bound by the estrogen receptor (ER), the E2F transcription factor 4 (E2F4), the signal transducer and activator of transcription 3 (STAT3), and the hypoxia-inducible factor 1 (HIF1) to print a DNA microarray. The microarray contains 758, 1370, 860, and 1882 modules predicted to be bound by ER, E2F4, STAT3, and HIF1, respectively. In the current study, the microarray was then probed by ChIP-chip for ER and E2F4 (see Methods for experimental details). After statistical analysis and experimental validation of the data (see Methods and Supplemental Table S3), we have identified 55 and 433 modules bound by ER and E2F4, respectively (see Supplemental Tables S4 and S5, respectively, and Table S6 for full ChIP-chip results). Approximately 3% of the 758 ER-predicted pCRMs on the microarray actually proved to be bound by ER, while 17% of the 1370 E2F4-predicted pCRMs on the microarray were bound by E2F4. These numbers need to be considered as an underestimation of the actual specificity of the algorithm, since the protein–DNA interactions were tested in a single cell type, while TFs are known to regulate different sets of genes in different cell types, physiological conditions, and time in development (Zeitlinger et al. 2003; Hartman et al. 2005). For example, ER was tested in MCF-7, a breast cancer-derived cell line, due to its importance in breast cancer. ER, however, also plays important roles in many tissues such as ovaries, bone, brain, liver, and more. It is very likely that ER binds many pCRMs in some of these tissues, but not in MCF-7. In addition, the experiment was conducted under a single set of conditions (concentration of estradiol, time of treatment, etc.). For all of these reasons, it is difficult to determine the real accuracy of the algorithm. Because our microarray contains predicted modules for four different TFs, the data can be used to assess the specificity of our TFBS predictions, e.g., to evaluate whether our prediction of which TFs should bind to each module is accurate. Among the 55 modules bound by ER, 44% (24/55, whereas 8/55 would be expected by chance) had indeed been selected for their ER-binding sites, and among the 433 modules bound by E2F4, 54% (236/433, whereas 147/433 would be expected by chance) had been selected for that factor. In addition to false-positive ChIP-chip signals or the failure of the algorithm to detect some binding sites, it is likely that binding of TFs through alternative mechanisms such as protein–protein interactions contributes to this result. For example, ER has been shown to be recruited to DNA by interaction with AHR to repress AHR-dependent gene regulation in an ER-responsive element-independent manner (Beischlag and Perdew 2005). It is important to note that our algorithm can only predict the binding of TF through direct DNA-binding interactions. It is likely that other TFs, in addition to those predicted here, may play roles in these modules. Of note, while 87% of the validated pCRMs for E2F4 were located in promoter regions, only 20% of those for ER were in these regions, confirming that our nonproximal pCRMs are also highly enriched for functional CRMs. Finally, Carroll et al. (2005) have used ChIP-chip on a tiling array to identify ER-binding sites on human chromosomes 21 and 22. Of the 57 regions they found to be bound by ER in MCF-7 cells, 14 overlap our predicted modules (five times more than expected by chance). Despite the fact that the goal of this study is not to discuss specific interactions, we would like to highlight an interesting result that came out of the ChIP-chip experiments. While it is well known that the expression of the progesterone receptor gene PGR is up-regulated in breast cancer cells in response to estradiol, the absence of consensus estrogen response elements (ERE) in the two promoters driving its expression led to the suggestion that ER binds via other TFBSs (Petz et al. 2004). However, our data show that ER binds pCRMs present both ~35 kb upstream of the TSS and ~5 kb downstream of the 3′ end. Functional characterization of these pCRMs may reveal important clues about the molecular mechanisms implicated in long-range regulation by ER and other nuclear receptors (Carroll et al. 2005; Laganière et al. 2005). A global view of the gene regulatory landscape Having validated our predictions, we went on using them to study different global aspects of gene regulation. The genome-wide distribution of predicted modules is exemplified by Figure Figure3,3
As illustrated in Figure Figure3,3 The genomic locations that are the densest in predicted modules (measured over 100-kb windows) are listed in Table 1. Most of these are located upstream, in the introns, or downstream of genes that are themselves TFs often involved in development. Among the 15 densest regions, we find parts of all four HOX clusters that operate differential genetic programs along the anterior–posterior axis of animal bodies (Alonso 2002), and regions near the EBF3, ZFHX1B, NR2F2, BCOR, MEIS2, and DLX5-6 genes, all of which are characterized TFs. The pCRMs in these regions have the unusual property of often being significantly conserved back to zebrafish and fugu, an indication that they may be part of the core regulatory mechanism of vertebrate development. There are 137 100-kb regions covered at least at 20% of CRMs, and these regions contain the TSSs of 115 genes with GO annotations (Harris et al. 2004). These genes are very strongly enriched for involvement in the regulation of transcription (79 genes, P-value 10−89), morphogenesis (24 genes, P-value 10−13), organogenesis (17 genes, P-value 3 × 10−5), and neurogenesis (10 genes, P-value 4 × 10−4), based on the Gostat program (Beissbarth and Speed 2004). We conjecture that genes involved in these processes often require very tight regulation, which in turn requires an elaborate set of regulatory modules. Notably, the presence in that group of ZBTB20, a poorly characterized gene encoding a predicted zinc finger TF, suggests the intriguing possibility that this TF may have a critical biological role, perhaps in regulating development.
There also exist regions that are very sparsely populated in predicted modules. One of the most striking examples is a 4-Mb region of chromosome 2 (chr2:123,000,001–127,000,000), of which <0.1% is covered by predicted modules. The region is somewhat of a gene desert, containing only one large gene annotated, hypothetical gene CNTNAP5. Other gene deserts are the opposite, quite rich in pCRMs. Many of those appear to be located in the vicinity of developmental TFs. For example, the homeobox gene MEIS1 is surrounded by a 1-Mb region devoid of any other TSS, but contains >130 kb of pCRMs. Regulatory modules are preferentially located in specific regions relative to genes We studied the position of pCRMs with respect to their closest gene. The genome was divided into several types of noncoding regions, i.e., upstream of a gene, 5′ UTR, 1st intron, internal introns, last intron, 3′ UTR, and downstream region. Within each type of region, we computed the fraction of bases included in a pCRM as a function of the distance to a reference point for each type of region (e.g., for upstream regions and 5′ UTR, the reference point is the TSS; see legend of Figure Figure44
From Figure Figure4,4
Specific TFs target different regions relative to their target genes As described above, our predictions, when taken altogether, are enriched in the 5′ and 3′ region of known genes. When broken down into predictions for individual TFs, however, a great variability in observed. For example, our predictions of ER modules (e.g., modules predicted to contain at least one high-scoring ER-binding site) are enriched in regions located more than 10 kb upstream of known genes, while our predictions for E2F4 are enriched in the proximal 5′ region of known genes. This suggests that ER functions mainly through distal, enhancer-like elements, while E2F4 regulates gene transcription via promoter-proximal elements. Notably, evidence in the literature supports this hypothesis (see Blais and Dynlacht 2005; Carroll et al. 2005). Importantly, our ChIP-chip data also supports this model. Indeed, despite the fact that pCRMs printed on the array were uniformly distributed with respect to genes, only 20% of the pCRMs bound by ER in our ChIP-chip experiments were within 1 kb on either side of the TSS, while the proportion is of 87% for the pCRM bound by E2F4. Based on this observation, we have computed the location preferences of each of the 229 TF families represented by the PWMs used in our predictions (see Figure Figure55
A second set of TFs preferentially binds within 1 kb of the TSSs. This set is enriched for leucine zipper TF and factors from the Ets family. Notably, most of these factors, contrary to what is observed for those binding distal sites, are involved in basic cellular functions. Among the best-known examples we found NF-Y, E2F, CREB, ATF, and others. Interestingly, and much to our surprise, most of these TFs show a clear preference for either the 1 kb upstream or the 1 kb downstream of the TSS, but not both. The most striking example is Nuclear Factor Y (NF-Y), which is highly enriched 1 kb upstream, but highly depleted 1 kb downstream of the TSS. This preference may reflect a mechanistic characteristic of these TFs. Finally, note that when we computed enrichment statistics based on all genome-wide predicted TFBSs instead of based only on those located in modules, much fewer TFs obtained significant enrichment in any given type of region, indicating that our pCRMs are effective at reducing the false-positive rate in TFBS predictions. Long-range correlation of TFBS predictions We observe that the closer together two modules are on the genome, the more likely they are to contain predicted binding sites for the same factors. Part of this is simply due to isochors, those broad variations of GC content along the genome (International Human Genome Sequencing Consortium 2001). However, even after correcting for this factor (see Methods), a number of TFs show significant long-range correlation between their predicted sites (Supplemental Fig. S3; Supplemental Table S8). This is likely to be due to the fact that if several regulatory modules regulate a gene, they are likely to be bound by a similar set of TFs. Not surprisingly, most of the TFs that exhibit long-range correlation are those that show preferences for binding sites located more than 10 kb upstream of the TSS. The set of nearby pCRMs that contain binding sites for similar TFs tends to be located in large intergenic or intronic regions and they tend to be located near genes encoding TFs. Predicted TFBSs induce correlated tissue-specific gene expression Comparison of TF-binding data with gene expression data in yeast showed that genes bound by a common set of TFs tend to be coregulated (Lee et al. 2002). Such a correlation is expected to occur in mammalian cells as well, but was never thoroughly tested because of the lack of genome-wide data for TF binding. Our predicted module data allows us to investigate this question. For each TF family in our study, a set of putatively regulated genes was identified as those with at least one predicted high-scoring site in a pCRM located within 10 kb upstream of the TSS. We computed the average pairwise Pearson correlation coefficient between tissue-specific expression levels of the genes of the set using expression data from 79 human cell types or tissues from the GNF Atlas 2 (Su et al. 2004). A total of 27 of the 229 TF families are associated to a significant expression correlation (P-value < 0.01, false-discovery rate (FDR) = 8%; see Supplemental Table S9). We repeated our correlation analysis, this time measuring the expression correlation for genes sharing binding sites for pairs of TFs. Of the 26,106 pairs of TF families considered, 595 are associated to a significant expression correlation (P-value < 0.01, FDR = 43%) (See Supplemental Table S10 for a complete list). For example, most of the 20 genes that have a pCRM containing OCT-1 and BACH1-binding sites are highly expressed in various brain tissues, excluding the cerebellum and the olfactory bulb, and in the pituary gland. While the role of OCT-1 in brain cells has already been characterized (Givens et al. 2004), its association with BACH1 has not been reported before. Since most TFs are only expressed in a subset of the 79 cell types considered, they are unlikely to induce significant coexpression when measured over all 79 cell types. In order to identify transcription factors regulating expression in specific cell types, we analyzed each pair of TF and cell type. For each pair, the average expression level of the genes associated with predicted binding sites for the TF was computed and its significance assessed by a permutation test. Of the 229 × 79 = 18,091 possible (TF-cell type) pairs, we found 119 where genes are overexpressed (P-value < 0.001, FDR = 15%), and 78 where genes are underexpressed (P-value < 0.001, FDR = 23%). Table 2 lists the pairs with the most significant associations (see Supplemental Table S11 for the complete list). For example, the genes associated with pCRMs for MyoD tend to be highly expressed in skeletal muscle cells, while those associated to Ets are highly expressed in white blood cells. Both the role of MyoD in skeletal muscles and that of Ets in blood cells are very well characterized, thereby validating the approach.
We also discovered associations that are not well characterized. For instance, we found that genes around pCRMs for NF-Y tend to have low expression in the ciliary and superior cervical ganglia and high expression in thymus and lymphoblasts. NF-Y binds an element called the CCAAT box, which has been reported to be present within promoters of genes activated during peptide presentation in antigen presenting cells (APC) (Mach et al. 1996) and within the promoters of housekeeping genes such as those regulated during the cell cycle (Mantovani 1999). From this literature, one would not have predicted a role for NF-Y in the brain and the thymus, but the fact that ciliary and ganglia cells are not (or only slowly) dividing and that some APC originate from thymus (Choi et al. 2005) is however consistent with our findings. The average expression levels were also computed for the set of genes associated with each pair of TFs. Of the roughly 2 million triplets (TF1, TF2, cell-type) tested, 5242 triplets show significant overexpression (P-value < 0.001, FDR <39%), while 6407 triplets show significant underexpression (P-value < 0.001, FDR <31%; see Supplemental Table S12). A searchable public database of predicted regulatory modules The modules predicted by the algorithm were stored in a database with a Web-based interface (http://genomequebec.mcgill. ca/PReMod). The database supports a variety of queries and contains hyperlinks pointing to the NCBI Entrez of the closest gene. The module information includes its genomic position as well as its TFBS content. A graphical view of the TFBS distribution of the highest scoring matrices is also provided (see, for example, Fig. Fig.3C).3C Conclusions Using the literature as a guideline, we have identified a set of rules describing the architecture of DNA regulatory elements and used them to build an algorithm allowing us to explore the regulatory potential of the human genome. Although the error rate in CRM predictions is likely to be relatively high, the statistical power obtained through a large-scale, genome-wide approach revealed new insights into the biology of transcriptional regulation. Among other things, we observe a strong enrichment for pCRMs in regions at the 3′ end of genes. By concentrating on predicted TF-binding sites within pCRMs, we are able to improve the specificity of individual TFBS predictions, which allows the detection of signals that could not be seen otherwise. For example, we noted that a significant number of TFs have a strong bias for regulating genes either from a great distance or from promoter-proximal binding sites. Noteworthy is the fact that most TFs that preferentially work from a large distance are involved in development, while those predicted to work from promoter-proximal sites tend to regulate genes involved in basic cellular processes. We have identified a set a TFs that are predicted to play important roles in specific tissues, including cells and tissues issued from tumors and metastases. Finally, our data provides a starting point for the elaboration of human gene networks. In a bootstrap-like fashion, several of the features derived from our pCRMs could be used to design improved CRM prediction algorithms. For example, the fact that specific TFs prefer binding at specific locations with respect to genes and that CRMs tend to organize in larger and looser clusters often containing binding sites for similar sets of factors could allow improved predictions. We expect that the database containing the modules predicted in this study may speed up the discovery and experimental validation of CRMs. Finally, deeper data-mining approaches are likely to yield a plethora of specific testable biological hypotheses. Methods Transfac position weight matrices A set of 481 vertebrate PWMs from Transfac 7.2 (Matys et al. 2003) was used for the analysis. Pseudocounts were introduced to regularize matrices based on few known sites (Durbin et al. 1998). Many PWMs represent the same or very similar factors. This does not cause any problem to our CRM prediction algorithm (since it excludes overlapping sites), but it is undesirable for downstream analyses of individual TF properties, e.g., localization with respect to the genes and tissue-specific expression. For these sections of the study, PWMs were grouped into 229 families based on the following rule: If many related TFs had individual PWMs, but Transfac also contained a generic PWM for the family, then only that generic matrix was used. Module prediction algorithm The outline of our module prediction algorithm is provided in Figure Figure1.1 Species-specific scores are then mapped onto the alignment and for each alignment column p and PWM m, we compute: hitScorealn(m,p) = hitScoreHum(m,p) + 1/2 max(0, hitScoreMou(m,p) + hitScoreRat(m,p)). Thus, hitScorealn(m,p) will be high if all three species have a high-scoring site at position p. Notice that if the hit score of human is very high, the resulting hitScorealn may be relatively good even if mouse and/or rat do not have high-scoring hits at that position. This allows us to predict human-specific binding sites, provided that they are very good matches to the PWM considered. Once the alignment scan is completed, only positions with hitScorealn(m,p) > 10 are retained to construct modules. This results in a total number of predicted sites that varies from 1.5 million for E2F (M00103) to about 8000 for Hogness (M00316), many of which are expected to be false positives (see Supplemental Table S1). We now discuss how to compute moduleScore(p1..p2) for the alignment region going from position p1 to p2 of human. We first define TotalScore(m, p1..p2) to be the sum of the hitScoresaln of all nonoverlapping hits for m in the region p1..p2. Formally, letting Hm be the set of all hits for matrix m in region p1..p2, we have TotalScore(m,p1..p2) = max {H Hm s.t. hits in H do not overlap} Σh H hitScore(m,p).The optimization problem of choosing the best set of nonoverlapping hits is solved heuristically, using a greedy algorithm that iteratively selects the hit with the maximal score that does not overlap with the other hits previously chosen. For each matrix and each region, a P-value is assigned to the TotalScore observed, measuring the probability that a random region of the human–mouse–rat alignment would have a total score that would exceed the observed one. This P-value takes into consideration the length and GC-content of the region considered, as well as the overall frequency and score distribution of hits predicted for that matrix in the genome. This allows for a region dense in hits for a rare matrix (i.e., one with few hits in the genome) to obtain a higher score than a region equally dense in hits for a more common matrix. Matrices that tend to have a large number of hits throughout the genome are thus penalized. More precisely, for each matrix m, GC-content g and window length l, the distribution of TotalScore is estimated empirically through simulation, repeating 10 million times the following procedure: (1) choose l random positions from alignment regions with GC-content g and (2) compute the TotalScore of the set of positions selected, assuming that the l positions chosen form a contiguous region. The score of a candidate module is computed based on one to five PWMs called tags. The first tag for region p1..p2 is the matrix with the most significant TotalScore, i.e., tag1 = argminm PWMs pValue(TotalScore(m,p1..p2)). The regions belonging to the hits selected for tag1 are then masked out and the TotalScores for each matrix are recomputed, excluding hits overlapping those of tag1. The second tag is then the matrix that achieves the most significant totalScore, and its occurrences are masked out. The process is repeated until five tags are selected, if possible. Finally, we define moduleScore(p1..p2) = max{k = 1..5} –log (pValueMaxUnif (k, 481, i =1..k pValue(totalScore(tagk, p1..p2)))), where pValueMaxUnif(k, 481, a) is the probability that the product of k random variables, each defined as the maximum of 481 uniform(0,1) random variables, is smaller than a.7 A module can thus consist of one to five tags, depending on which number of tags yields the highest statistical significance.The above procedure was used to search for modules of maximal length 100, 200, 500, 1000, and 2000bp.8 For each window size, regions with moduleScore > 10 (i.e., P-value < e−10) were identified. This choice of threshold is somewhat arbitrary, but results in a total number of bases predicted in pCRMs to be ~2.88% of the genome, a reasonable upper bound for the fractions of bases in regulatory regions. To address the fact that many of these modules overlap each other, a greedy algorithm was used to repeatedly select the highest-scoring module not overlapping any of the previously selected higher-scoring modules. This resulted in the set of 118,402 nonoverlapping modules studied in this work. Predictions were then mapped onto the latest human assembly (hg17) using the liftOver program (Karolchik et al. 2003; <0.1% of modules could not be mapped onto the new assembly and were discarded). Microarray design and production A subset of the pCRMs was selected to build a microarray to be used for ChIP-chip validation experiments. For each TFs among ER, HIF1, STAT3, and E2F4, at most 50 pCRMs were randomly selected for each combination of the following categories: (1) module score: High vs. non-high; (2) totalScore for the given TF: High vs. non-high; (3) genomic location with respect to closest TSS: 10–100 kb upstream, 800 bp–10 kb upstream, -800 to +200 bp, +200 bp to +1000 bp, +1 kb to +10 kb, 0–10 kb downstream of 3′ UTR, or other. Most combinations could be not filled up to their quota. Each pCRM selected was extended symmetrically to a size of 1 kb, excluding repetitive regions. Primer pairs were designed for each region, using the Primer3 algorithm (Rozen and Skaletsky 2000), and the specificity was tested in silico by using a virtual PCR algorithm (Lexa et al. 2001). When the primer pair gave no satisfactory virtual PCR results, a new primer pair was designed by using Primer3 and tested again. The process was iterated three times to generate primer pairs predicted to be efficient to amplify regions from human genomic DNA for almost all of our selected pCRMs. This primer design pipeline allowed us to design primer pairs to amplify pCRMs from human genomic DNA with a success rate of ≈85%. ChIP-chip assay and data analysis ER ChIP-chip experiments were performed as described previously (Laganière et al. 2005). E2F4 ChIP-chip experiments were performed as follows: T98G cells (ATCG) were grown in DMEM containing 10% FBS and arrested through contact inhibition by allowing cells to reach confluence. Medium was changed after the second day of confluence and cells harvested on the third day. Confluent T98G cells were fixed with 1% formaldehyde, rinsed twice with PBS, and harvested. The cell pellet was lysed and sonicated to obtain DNA fragments of 600 bp on average. ChIP was performed using anti-E2F4 antibody (sc-1082, Santa-Cruz) and Dynabeads (Dyna). ChIP samples and nonimmunoprecipitated fragments were blunted with T4 DNA polymerase and ligated to unidirectional linkers. The DNA was then amplified by LM–PCR and labeling carried out post PCR by incorporation of Cy5 or Cy3-dUTP using Klenow polymerase reaction. Detail protocol can be found at http://www.ircm.qc.ca/microsites/francoisrobert/en. Data were normalized and triplicates were combined using a weighted average method as described previously (Ren et al. 2000). The P-value threshold used for the analysis was established by testing the enrichment of 10 targets for each of the following P-value intervals for both ER and E2F4 ChIPs using quantitative PCR with SYBR Green: <0.001, 0.001–0.005, 0.005–0.01, 0.01–0.05, 0.05–0.1, 0.1–0.5, 0.5–1. The results of this validation process are shown in Supplemental Table S1. Using P < 0.01 (ER) and P < 0.1 (E2F4), virtually all targets are bona fide binding sites (see Supplemental Tables S2 and S3). All microarray data will be deposited to ArrayExpress. Statistical significance of TF location preferences and spatial correlation We used a permutation test to estimate the statistical significance of the observed number of binding sites predicted in each type of region of the genome. Given the set of all predicted sites for all TFs, we first removed from consideration all but one of the hits of a TF within a given module. Each module thus contains at most one binding site for a given TF. To perform our permutation test, we repeatedly randomly chose two sites for two different factors, and exchanged their labels (but kept the original positions), provided they both lie in regions of the same GC-content (within 1% difference, measured over 1 kb). The scrambling procedure was sufficiently repeated often to reach a random distribution, at which point the number of sites in each region was counted. The experiment was repeated 100 times, from which the expectation and variance of the count of each TF in each region was estimated and the Z-score calculated. Notice that this procedure preserves the varying density of binding sites across the genome (since only labels, but not positions, are modified), as well as the local GC-content preferences of each TF. To estimate the significance of the long-range spatial correlations observed between sites of a given TF, a similar permutation test was applied and the observed number of co-occurrence within a given distance was compared with those obtained in the permuted data sets, allowing to compute a Z-score for each TF and distance interval. Correlation between predicted TFBS and tissue-specific gene expression For each TF, a set of putative target genes was defined as the genes with at least one high-scoring predicted site for that TF within a pCRM and within 10 kb of the TSS. The average expression level of these genes in each of 79 tissues (GNF Atlas II) was calculated and its significance was estimated using a permutation test. Tissues showing overexpression or underexpression with Z-score > 5 are reported in Table 2. Acknowledgments This work was funded by grants from Génome Québec and Génome Canada (M.B., V.G., B.C., and F.R.) and by the Canadian Institutes for Health Research (V.G). A.R.B. is a recipient of a doctoral fellowship from the IRCM/CIHR Cancer Research Program. X.C. is a recipient of a Génome Québec Comparative and Integrative Genomics Program. J.L. is a recipient of a U.S. Department of Defense Breast Cancer Research Program Predoctoral Traineeship Award (#W8IWXH-04-1-0399). F.R. holds a new investigator award from the CIHR. We thank Adam Siepel for his PhastCons data, UCSC Genome Browser group for their support, and John Stamatoyannopoulos for the DNaseI hypersensitive regions data. Footnotes [Supplemental material is available online at www.genome.org.] Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.4866006 6Since PhastCons was designed to detect any type of region under selective pressure, many of its noncoding predictions are likely to have other nonregulatory functions. 7Note that the formula for moduleScore is actually an approximation of the true P-value, for the following reasons: (1) Since competition for space between different tags is not modeled, the computed P-value of the total score of the 2nd, 3rd, 4th, and 5th tags are slightly conservative; (2) since the totalScores are discrete variables (but with a very large number of possible values), the approximation with a continuous uniform distribution introduces a small error; (3) since the moduleScore is obtained by selecting the best of five P-values, a multiple hypothesis testing correction should be applied. However, since we are mostly interested in the ranking of modules, this correction would make no difference. 8Only a small number of maximal lengths could be tried, as the calculation of the TotalScore P-values are computationally expensive and depend on that length. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||
Dev Biol. 2004 Jul 1; 271(1):109-18.
[Dev Biol. 2004]Genome Res. 2004 Apr; 14(4):700-7.
[Genome Res. 2004]Genome Res. 2005 Aug; 15(8):1051-60.
[Genome Res. 2005]Genome Res. 2004 Apr; 14(4):700-7.
[Genome Res. 2004]Bioinformatics. 2003; 19 Suppl 1():i273-82.
[Bioinformatics. 2003]In Silico Biol. 2004; 4(2):109-25.
[In Silico Biol. 2004]J Comput Biol. 2005 Jul-Aug; 12(6):822-34.
[J Comput Biol. 2005]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D68-73.
[Nucleic Acids Res. 2006]Bioinformatics. 2003; 19 Suppl 1():i273-82.
[Bioinformatics. 2003]Bioinformatics. 2004 Sep 1; 20(13):1993-2003.
[Bioinformatics. 2004]Bioinformatics. 2005 Apr 1; 21(7):1172-9.
[Bioinformatics. 2005]Proc Natl Acad Sci U S A. 2005 Feb 8; 102(6):1998-2003.
[Proc Natl Acad Sci U S A. 2005]Bioinformatics. 2005 Jun; 21 Suppl 1():i338-43.
[Bioinformatics. 2005]Genome Res. 2004 Apr; 14(4):700-7.
[Genome Res. 2004]Genome Res. 2005 Aug; 15(8):1051-60.
[Genome Res. 2005]Nucleic Acids Res. 2003 Jan 1; 31(1):374-8.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D91-4.
[Nucleic Acids Res. 2004]Genome Biol. 2003; 5(1):201.
[Genome Biol. 2003]Nature. 2003 Jul 10; 424(6945):147-51.
[Nature. 2003]Nucleic Acids Res. 2004 Jul 1; 32(Web Server issue):W253-6.
[Nucleic Acids Res. 2004]Genome Res. 2004 Apr; 14(4):708-15.
[Genome Res. 2004]Dev Biol. 2004 Jul 1; 271(1):109-18.
[Dev Biol. 2004]Genome Res. 2004 Apr; 14(4):708-15.
[Genome Res. 2004]Dev Biol. 2004 Jul 1; 271(1):109-18.
[Dev Biol. 2004]Genome Res. 2004 Apr; 14(4):708-15.
[Genome Res. 2004]Nucleic Acids Res. 2002 Jan 1; 30(1):312-7.
[Nucleic Acids Res. 2002]Nucleic Acids Res. 2002 Jan 1; 30(1):312-7.
[Nucleic Acids Res. 2002]Nucleic Acids Res. 2003 Jan 1; 31(1):374-8.
[Nucleic Acids Res. 2003]Genome Res. 2003 Apr; 13(4):732-41.
[Genome Res. 2003]Nucleic Acids Res. 2003 Jan 1; 31(1):51-4.
[Nucleic Acids Res. 2003]Nat Methods. 2004 Dec; 1(3):219-25.
[Nat Methods. 2004]Genome Res. 2005 Aug; 15(8):1034-50.
[Genome Res. 2005]Genome Res. 2004 Apr; 14(4):700-7.
[Genome Res. 2004]Genome Res. 2005 Aug; 15(8):1051-60.
[Genome Res. 2005]Science. 2000 Dec 22; 290(5500):2306-9.
[Science. 2000]Nature. 2001 Jan 25; 409(6819):533-8.
[Nature. 2001]Cell. 2003 May 2; 113(3):395-404.
[Cell. 2003]Genes Dev. 2005 Dec 15; 19(24):2953-68.
[Genes Dev. 2005]J Biol Chem. 2005 Jun 3; 280(22):21607-11.
[J Biol Chem. 2005]Cell. 2005 Jul 15; 122(1):33-43.
[Cell. 2005]Cell. 2005 Jul 15; 122(1):33-43.
[Cell. 2005]Proc Natl Acad Sci U S A. 2005 Aug 16; 102(33):11651-6.
[Proc Natl Acad Sci U S A. 2005]Cell. 2004 Feb 20; 116(4):499-509.
[Cell. 2004]Science. 2005 May 20; 308(5725):1149-54.
[Science. 2005]Science. 2005 Sep 2; 309(5740):1559-63.
[Science. 2005]Science. 2004 May 28; 304(5675):1321-5.
[Science. 2004]Mamm Genome. 2005 Feb; 16(2):91-5.
[Mamm Genome. 2005]Curr Biol. 2002 Nov 19; 12(22):R776-8.
[Curr Biol. 2002]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D258-61.
[Nucleic Acids Res. 2004]Bioinformatics. 2004 Jun 12; 20(9):1464-5.
[Bioinformatics. 2004]Genome Res. 2005 Aug; 15(8):1034-50.
[Genome Res. 2005]Cell. 2004 Feb 20; 116(4):499-509.
[Cell. 2004]Nature. 2004 Jun 3; 429(6991):571-4.
[Nature. 2004]Science. 2005 Sep 2; 309(5740):1564-6.
[Science. 2005]Science. 2002 May 3; 296(5569):916-9.
[Science. 2002]Science. 2005 May 20; 308(5725):1149-54.
[Science. 2005]Cell. 2004 Feb 20; 116(4):499-509.
[Cell. 2004]Nature. 2004 Jun 3; 429(6991):571-4.
[Nature. 2004]Science. 2005 Sep 2; 309(5740):1564-6.
[Science. 2005]Science. 2002 May 3; 296(5569):916-9.
[Science. 2002]Science. 2005 May 20; 308(5725):1149-54.
[Science. 2005]Genome Res. 2003 Jul; 13(7):1631-7.
[Genome Res. 2003]Genome Res. 2003 Jul; 13(7):1631-7.
[Genome Res. 2003]Genes Dev. 2005 Jul 1; 19(13):1499-511.
[Genes Dev. 2005]Cell. 2005 Jul 15; 122(1):33-43.
[Cell. 2005]Science. 2004 May 28; 304(5675):1321-5.
[Science. 2004]PLoS Biol. 2005 Jan; 3(1):e7.
[PLoS Biol. 2005]Development. 2005 Feb; 132(4):797-803.
[Development. 2005]Nature. 2001 Feb 15; 409(6822):860-921.
[Nature. 2001]Proc Natl Acad Sci U S A. 2004 Apr 20; 101(16):6062-7.
[Proc Natl Acad Sci U S A. 2004]Mol Endocrinol. 2004 Dec; 18(12):2950-66.
[Mol Endocrinol. 2004]Annu Rev Immunol. 1996; 14():301-31.
[Annu Rev Immunol. 1996]Immunity. 2005 Oct; 23(4):387-96.
[Immunity. 2005]Nucleic Acids Res. 2003 Jan 1; 31(1):374-8.
[Nucleic Acids Res. 2003]Genome Res. 2004 Apr; 14(4):708-15.
[Genome Res. 2004]Nucleic Acids Res. 2003 Jan 1; 31(1):51-4.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2003 Jan 1; 31(1):51-4.
[Nucleic Acids Res. 2003]Methods Mol Biol. 2000; 132():365-86.
[Methods Mol Biol. 2000]Bioinformatics. 2001 Feb; 17(2):192-3.
[Bioinformatics. 2001]Proc Natl Acad Sci U S A. 2005 Aug 16; 102(33):11651-6.
[Proc Natl Acad Sci U S A. 2005]Science. 2000 Dec 22; 290(5500):2306-9.
[Science. 2000]