pmc logo image
Logo of pnasPNAS Home page.Reference to the article.PNAS Info for AuthorsPNAS SubscriptionsPNAS About

Formats:

Proc Natl Acad Sci U S A. 2007 June 12; 104(24): 10116–10121.
Published online 2007 May 29. doi: 10.1073/pnas.0704028104.
PMCID: PMC1891267
Genetics
Identifying pattern-defined regulatory islands in mammalian genomes
Tom H. Cheung,* Kristen K. B. Barthel, Yin Lam Kwan, and Xuedong Liu
Department of Chemistry and Biochemistry, University of Colorado, Boulder, CO 80309
To whom correspondence should be addressed. E-mail: xuedong.liu/at/colorado.edu
Communicated by Marvin H. Caruthers, University of Colorado, Boulder, CO, May 1, 2007.
Author contributions: T.H.C. and K.K.B.B. contributed equally to this work; T.H.C., K.K.B.B., and X.L. designed research; K.K.B.B. performed research; T.H.C. and Y.L.K. contributed new reagents/analytic tools; T.H.C., K.K.B.B., and X.L. analyzed data; and K.K.B.B. and X.L. wrote the paper.
*Present address: Department of Neurology and Neurological Sciences, Stanford University School of Medicine, Stanford, CA 94305.
Present address: Dharmacon Research, Lafayette, CO 80026.
Received March 15, 2007.
Identifying cis-regulatory regions in mammalian genomes is a key challenge toward understanding transcriptional regulation. However, identification and functional characterization of those regulatory elements governing differential gene expression has been hampered by the limited understanding of their organization and locations in genomes. We hypothesized that genes that are conserved across species will also display conservation at the level of their transcriptional regulation and that this will be reflected in the organization of cis-elements mediating this regulation. Using a computational approach, clusters of transcription factor binding sites that are absolutely conserved in order and in spacing across human, rat, and mouse genomes were identified. We term these regions pattern-defined regulatory islands (PRIs). We discovered that these sequences are frequently active sites of transcriptional regulation. These PRIs occur in ≈1.1% of the half-billion base pairs covered in the search and are located mainly in noncoding regions of the genome. We show that the premise of PRIs can be used to identify previously known and novel cis-regulatory regions controlling genes regulated by myogenic differentiation. Thus, PRIs may represent a fundamental property of the architecture of cis-regulatory elements in mammalian genomes, and this feature can be exploited to pinpoint critical transcriptional regulatory elements governing cell type-specific gene expression.
Keywords: cis-regulatory elements, combinatorial regulation, computational prediction, myogenesis, transcription
Analysis of the human genome unexpectedly revealed that close to 99% of the DNA base pairs do not code for proteins (1). Although the functions of this noncoding sequence remain obscure, an important function associated with noncoding regions is cis-acting regulation. Cis-regulatory sequences are sequences of DNA to which regulatory molecules, such as transcription factors and microRNAs, can bind to modify gene expression. There is speculation that one-third of the human genome may represent cis-regulatory sequences, and it is hypothesized that the degree of complexity of cis-regulatory sequences partially accounts for the complexity of the organism (2). Therefore, identifying and characterizing functional cis-regulatory elements is critical to understanding regulation of gene expression.
The sequencing of many genomes has enabled comparative genomics approaches to annotate functions of genomic sequences. Indeed, exploiting phylogenetic conservation has proved to be a productive approach for finding cis-regulatory elements in less organismically complex species where the regulatory regions tend to be short and compact (310). Unfortunately, identifying cis-regulatory elements in vertebrate genomes is difficult because of the increased size and complexity of their genomes. For example, functional cis-regulatory elements typically consist of short stretches of DNA sequences (<500 bases) and are often scattered in the noncoding regions hundreds or thousands of base pairs away from the transcriptional start site (11). In addition, the increased size of the genome implies that an increased number of known binding-site sequence motifs will occur by random chance, even though only a fraction of them are likely to figure prominently in transcription regulation. Finally, although genomewide location analysis using ChIP-on-chip may ultimately elucidate many transcription factor binding sites in mammalian genomes, identification of cis-elements by this approach is impeded by the daunting size of the genomes and the one-by-one nature of probing transcription factor binding. Therefore, unraveling functional cis-regulatory elements buried in the genome remains a challenge because of the insufficient knowledge of the organization and architecture of regulatory DNA.
A defining feature for mammalian transcriptional regulation is combinatorial control, in which multiple transcription factors are required simultaneously to regulate gene expression (2, 12, 13). However, it is unclear what characteristics of functional cis-regulatory regions will facilitate their identification in the context of combinatorial control. A number of studies suggest that combinatorial regulation often relies on direct interactions between participating transcription factors and that these interactions depend on proper spatial orientation of transcription factors with respect to each other (2, 12, 13). In addition, there is significant conservation at the level of transcriptional regulation among human, mouse, and rat species and functional cis-regulatory elements frequently cluster in small regions of the genome (2, 12, 13). An offshoot of this observation is that the architecture of cis-regulatory sequence motifs will likewise be conserved. These fundamental features prompted us to develop a motif-centric computational algorithm to search for regions in mammalian genomes that harbor clusters of multiple transcription factor binding sites that are absolutely conserved with respect to order and spacing across three mammalian genomes (human, mouse, and rat).
Development of a Method to Identify Conserved Cis-Regulatory Regions.
We expect to find clusters of conserved binding sites to permit short-range interactions between transcription factors cooperating to regulate transcription. Moreover, we assume that there is selective pressure on the orientation of transcription factors with respect to each other. Consequently, we expect that binding-site sequences and the spacing between them are under purifying selection but that the identity of the intervening sequence is not. By similar reasoning, we also expect that the order in which the binding sites occur within a cluster is conserved. We term the genomic regions that satisfy these criteria pattern-defined regulatory islands (PRIs).
The algorithm for finding PRIs includes the following steps [supporting information (SI) Fig. 5]. First, we developed a custom curated version of the transcription factor database (TFD) (14, 15). The TFD contains experimentally characterized transcription factor binding sites. Our custom curation minimized redundancy and eliminated many of the poorly defined binding sites that could potentially affect the robustness of the algorithm. Second, we located all putative transcription factor binding sites from the TFD in the same orientation as the gene feature within 20 kb on either side of the start codon of all annotated gene orthologs in all three species [determined by the participation of a given gene in a homologene group as defined by the National Center for Biotechnology Information (16)]. This amounted to 13,520 sets of sequences. Third, we identified PRIs for each gene with a pattern-matching approach that locates clusters of binding sites whose order and spacing are conserved across the three genomes. Spacing conservation was determined by calculating the difference in genomic position of a given binding site in the human genome with respect to its position in the mouse and rat genomes. We called this set of numbers the genomic position offset (GPO). Binding sites with the same GPO in one 40-kb region were grouped into a PRI.
To define a minimum number of binding sites within a PRI in an effort to ensure that the PRIs found are statistically significant, we evaluated the likelihood of different numbers of binding sites occurring in concert in random sequences. Specifically, we selected the 1,000 sequence sets in our database with PRIs containing the greatest number of binding sites and extracted the 40-kb region centered about the start codon of each of these genes in each genome. Each of these sequences was then randomized while maintaining GC distribution and subjected to the PRI search algorithm. This procedure was repeated 100 times per gene. In total, we obtained 100,000 sets of randomized sequences that were searched for PRIs (see SI Text for further details).
A histogram depicting the distributions of PRIs in the scrambled and natural sequences is shown in Fig. 1Fig. 1.. We found that PRIs from the natural sequences generally had more binding sites than those from the scrambled sequences. Only 1% of PRIs from scrambled sequences contained seven or more binding sites, implying that we could be 99% confident that PRIs with seven or more binding sites from natural sequences are unlikely to occur by chance and likely represent a cis-regulatory region. Although it is quite likely that functional PRIs exist with fewer binding sites, we wanted to set a conservative threshold to account for any residual redundancy in the TFD. Therefore, the number of binding sites should be thought of as a computational index and one that could change as the customized TFD expands. With the current database, we computationally define a PRI as a region in the genome in which at least seven distinct binding sites are conserved in order and spacing across the three mammalian genomes.
Fig. 1.
Fig. 1.
Fig. 1.
In silico analysis to determine the minimum number of transcription factor binding sites required to define a PRI with 99% confidence. The results are displayed in a histogram of the number of PRIs versus the number of binding sites per PRI.
We then preprocessed 13,520 orthologous genes and found 7,453 genes with at least one PRI. The search corresponded to a coverage of ≈540,800,000 bases, or about one-sixth, of the human genome. Of this, we estimate that ≈6,154,852 bases, equal to 1.1% of the searched regions, are covered by PRIs. The mean PRI size is 335 bp, even though we placed no constraints on PRI size but did constrain number of binding sites. It is important to note that any apparent clustering of binding sites is a natural feature of the genomic organization and is not imposed artificially.
To ease exploration of the PRI database, we created a web-based interface (http://barcode.colorado.edu/pri/). (For more information relating to the web site, please see SI Text, and see SI Table 1 for a catalog of all binding sites conserved within PRIs to be examined in this study). Using the interface, we recovered many cis-regulatory regions in our database that have been reported in the literature, including well characterized regions from IL-2, myogenin, and CDC2.
To make a broader assessment of the biological relevance of PRIs, we surveyed 100 regulatory regions reported in the literature to ask how frequently PRIs in our database overlap with regions defined by functional studies. As shown in SI Table 2, documented functional transcription regulatory elements can be matched with the PRI database by using the default search on the web site in 54 of the 100 cases examined. There are several reasons that could explain why the PRI algorithm did not find all regions. First, nine of the genes controlled by these regions are not found in a National Center for Biotechnology Information Homologene group. In addition, it is probable that not all genes are subject to combinatorial regulation. It is also important to note that any very distal elements would not have been covered by the search. Finally, binding sites that have yet to be discovered or that do not conform to a canonical consensus sequence would not be in the TFD. However, new binding sites can be added to the TFD as they are discovered. Overall, our analysis suggests that the PRI database is highly enriched with experimentally defined functional regulatory regions.
In an effort to assess the uniqueness of the PRI approach, we evaluated the performance of other programs dedicated to genomewide identification of cis-regulatory regions across species, PReMod (17) and EEL (18). We used the default search parameters for each database. Interestingly, EEL did not predict any of the same 100 literature-defined regions tested with PRI. Conversely, PReMod finds 63 of 100; 43 of these overlap with those found by PRI (SI Table 2). Curiously, of 12 regions heretofore unexplored with respect to transcriptional regulation but predicted by PRI, PReMod predicts only 4, yielding overall totals of 66 of 112 regions for PRI and 67 of 112 for PReMod. As detailed below, we tested these regions for their potential as genomic sites of transcriptional regulation. (For a more detailed comparison of these two predictive programs, please see SI Text.)
Identifying Skeletal Muscle-Specific PRIs.
A corollary to the above survey is that the PRI approach can be used to discover novel, biologically relevant cis-regulatory regions on a genomewide scale. To test this proposition, we experimentally investigated as yet uncharacterized regulatory regions of genes that may be involved in myogenic differentiation. Myogenesis is principally orchestrated by a series of transcriptionally controlled events governed by myogenic regulatory factors (MRFs) (19, 20). A key MRF is MYOD, which binds to a short, degenerate consensus binding site (CANNTG) called an E-box (20). The probability of finding an E-box in the genome is 1/256, only a few of which are likely to be functional. However, MYOD often acts in concert with other transcription factors (20); therefore, we expect that functional E-boxes occur within PRIs. Examples of transcription factors known to cooperate with MYOD are members of the MEF2 family (19, 20) and the SRF family (21). Thus, as a result of the characteristics of the E-box and the combinatorial nature of MYOD-mediated transcriptional regulation, finding functional PRIs in myogenesis poses an excellent challenge to the PRI hypothesis.
To test this hypothesis, we developed a systematic approach with several criteria to sift through the PRI database. We first extracted all PRIs with E-boxes and/or MEF2 binding sites. Next, we crossed this list with a list of genes up-regulated during myogenesis as revealed by several microarray studies (2224). We deemed this step necessary because many other transcription factors in addition to MYOD are known to bind E-boxes and MEF2 family members can regulate other processes, such as neurogenesis (25). We chose 12 genes, of which the regulatory elements that mediate responses to myogenic signals in mice are unidentified (TNNI2, AQP1, EDA2R, PGCP, PKIA, CPA1, FHL3, MAX, RDH5, SOX6, TPM2, and USP2), for further study. Schematic representations of the organization of associated PRIs are displayed in SI Fig. 6. Some of the PRIs display unexpected features. For example, the PRI for FHL3 is found in the intron of an upstream gene (UTP11L) in the opposite orientation, and the PRI for EDA2R encompasses an exon. Although it is generally believed that transcriptional regulatory regions do not reside within exons, we still were interested in pursuing this region.
We then proceeded to verify that these genes are up-regulated during myogenesis with real-time PCR. We chose to study these candidate PRIs in the C2C12 mouse myoblast cell line as it expresses MYOD and can recapitulate skeletal myotube formation. Fig. 2Fig. 2.A tabulates the fold change in transcript level after either 1 or 4 days of serum deprivation. We included the genes SERPINE1 and CUEDC2 as negative controls as the expression of SERPINE1 is known to be regulated in response to TGF-β through an E-box that does not lie within a PRI (26) and CUEDC2 harbors a MEF2 binding site in its upstream region that also is not covered by a PRI. MYOG serves as a positive control for this study as it is a well studied muscle-specific gene whose documented regulatory region (27, 28) overlaps with a PRI. Of the 12 candidate genes, 10 were up-regulated in at least one time point tested. The remaining two, AQP1 and MAX, were not amenable to real-time PCR analysis. However, AQP1 is known to be strongly expressed in adult cardiac and skeletal muscle (29), and this is corroborated by the expression pattern of AQP1 on a microarray following muscle regeneration after injury (23). In addition, it can be demonstrated by microarray (22) that transcript levels of MAX are induced as an early myogenic event.
Fig. 2.
Fig. 2.
Fig. 2.
PRIs with MEF2 binding sites and/or E-boxes from genes regulated during myogenesis are bound by these transcription factors. (A) Real-time PCR analysis of genes with PRIs selected for study. Fold change in transcript level of the indicated genes relative (more ...)
Next, using ChIP assays, we qualitatively assessed whether binding sites within PRIs are actually bound during C2C12 myogenic differentiation. Briefly, we immunoprecipitated lysate from growing myoblasts (GM) and differentiating myotubes (DM) after 1 day of serum deprivation with antibodies specific to MYOD and MEF2 (30). We then analyzed the PRIs in question through the use of traditional PCR. Fig. 2Fig. 2.B reveals that all 12 PRIs are bound at least by day 1, with a subset also bound in the GM, according to which binding sites are conserved in the region. Importantly, the non-PRI MEF2 binding site in the upstream region of CUEDC2 is not bound by the MEF2 protein, and the TGF-β-responsive region (3APP) of SERPINE1 is not bound by MYOD at the E-box known to be critical for TGF-β signaling. Additionally, the positive control region from MYOG is bound by both MYOD and MEF2 as expected.
To test the capacity of candidate PRIs to drive transcription during myogenic differentiation, six of these PRIs were cloned into a luciferase reporter gene vector upstream of a minimal TATA box-containing promoter and subsequently transfected into C2C12 cells. We induced cell differentiation and measured luciferase activity in GM and differentiated myotubes 1 day after induction. As shown in Fig. 3Fig. 3.A, minTATA, which is the parental construct for all of our cloned PRIs, has minimal basal activity and just 1.8-fold induction. In contrast, mMYOG −184 → +11, a construct that harbors the same region probed in the ChIP experiments, showed robust transcriptional activity and induction upon myogenic differentiation. The 3APP region of SERPINE1 displays minimal activity during myogenesis and is induced only 1.3-fold, suggesting that an E-box alone without the proper combination of other transcription factor binding sites is insufficient for mediating gene activation during myogenic differentiation. The six PRIs showed robust basal transcriptional activity as well as induction greater than the minTATA construct upon myogenic differentiation.
Fig. 3.
Fig. 3.
Fig. 3.
Predicted PRIs can be sites of cis-regulation in myogenesis. (A) C2C12 cells were transfected with luciferase reporter plasmids containing PRIs upstream of a TATA box and normalized with a renilla reporter plasmid driven by the thymidine kinase promoter. (more ...)
To demonstrate that the binding sites used to identify the PRIs contribute to transcriptional regulation, we made single point mutations in E-boxes (CANNTG to CANNTA), MEF2 binding sites (YTWWAAATAR to YAAWAAATAR), or serum response elements (CCW6GG to CCW6GA) and tested the resulting reporter construct. Fig. 3Fig. 3.B shows the effects of mutated PRIs from the genes AQP1 and FHL3. In general, when compared with their wild-type counterparts, the transcriptional output mediated by the mutant promoters was significantly decreased for both PRIs tested. Moreover, in the case of the FHL3 PRI, both conserved E-boxes appear to contribute to driving transcription. In addition, mutation of the SRE alone or in combination with the E-boxes also impairs the luciferase response governed by the FHL3 PRI, which suggests combinatorial regulation of FHL3 by MYOD and SRF.
We were therefore able to successfully mine our PRI database by asking candidate PRIs to fulfill the following criteria: (i) the PRI contains binding sites for transcription factors known to regulate myogenesis; (ii) the associated gene is differentially expressed during myogenesis as demonstrated by microarray and real-time PCR; (iii) the transcription factors for which there are conserved binding sites in the PRI bind this region as assessed by ChIP; and (iv) the PRIs can drive transcription during myogenesis and depend on the integrity of conserved binding sites as determined by reporter gene assays.
Assessing the Importance of Spacing Between Conserved Binding Sites in PRIs.
We next explored the idea that the identity of the intervening bases between binding sites is not critical. An objective variation in spacer base identity can be found in the form of orthologous human sequences. As an illustration of this idea, SI Fig. 7 depicts the conservation of binding sites and the differences between spacing base pairs of human and mouse versions of PRI 8 of AQP1. While there is 100% conservation at the level of the conserved binding sites, only 61% of the intervening sequence is conserved, which is satisfying because it is a good illustration of what we expect to find given the PRI hypothesis is true. Fig. 4Fig. 4.A demonstrates that the human ortholog of mouse AQP1 PRI 8 is quite active in our experimental C2C12 system, supporting the notion that the identity of spacer bases between conserved binding sites in PRIs is not crucial.
Fig. 4.
Fig. 4.
Fig. 4.
Spacing between binding sites is important for PRI-mediated transcriptional regulation, whereas spacer identity is not. (A) Orthologous human regions of PRI 8 of AQP1 can drive luciferase expression in mouse C2C12 cells. Luciferase assay was performed (more ...)
To test the assumption that the size of the spacer sequence is important, we deleted the bases between conserved binding sites and reintroduced the resulting reporter construct into our experimental C2C12 system. Fig. 4Fig. 4.B reveals that the deletion severely impairs reporter gene transcription in PRIs from mAQP1 and mCPA1, which further supports the notion that the spatial organization of binding sites is critical to mediating gene activation. Although we cannot definitively rule out the possibility that these deletions would inadvertently delete uncharacterized binding sites, the fact that we observe decreased activity in the cases of two PRIs with different intervening sequences does in part validate our assumption that the spacing constraint is informative in computational identification of PRIs. Taken together, these results suggest that by following the logic that the organization of cis-elements will be conserved with respect to transcription factor binding site identity, order, and spacing across human, mouse, and rat genomes, the PRI algorithm can reliably predict the regulatory regions targeted in the actions of transcription factors of interest and lends support to the combinatorial hypothesis of mammalian transcriptional regulation.
The richness of genomic sequence data now available has prompted exponential growth in comparative genomics in the last several years. A number of predictive programs for identifying cis-regulatory regions, including PRI as well as EEL and PReMod, have taken advantage of the power of evolutionary conservation in an effort to reduce noise in the predictions. We have presented here the successful application of the PRI algorithm to pinpointing regulatory regions involved in regulating the transcriptional program of myogenesis. The algorithm is especially powerful for predicting regions engaged in combinatorial transcriptional regulation. As stated above, we expect orthologous genes to be similarly regulated, and we expect conservation of the organization of the regulatory region at the level of the DNA to reflect this. The fact that we can find thousands of PRIs that display conservation of binding-site order and spacing across three mammalian species lends credence to this assumption. It also suggests that approaches that look for simple clustering of binding sites with no cross-species comparison will suffer understandably from high false positive rates. This does appear to be the case when examining earlier generation predictive programs. For example, an early program developed by Wasserman and Fickett (31) and dedicated to identifying muscle-specific regulatory regions looks for clusters of binding sites for muscle-specific transcription factors as informed by the organization of already reported myogenic regulatory regions. As there was not yet enough sequence data available from multiple species, the authors could not incorporate a multigenome comparison. As a result, while they were able to retrieve 60% of validated regulatory regions from a set of genes known to be expressed in muscle, they report a 79% false positive rate when searching a database of eukaryotic promoters not restricted to muscle-specific genes. In addition, a program generated by Berman et al. (9) that looks for clusters of binding sites for early developmental transcription factors in Drosophila predicted 37 regions. Of these regions, 18 displayed no enhancer-like properties, yielding a false positive rate of 48.6% (32). Although it is generally acknowledged that Drosophila display less complex transcriptional regulation than mammals, it is apparent that even in this case a single-species, simple clustering approach does not fully suffice.
The experiments presented here have verified the idea that binding-site order and spacing are defining features of regulatory regions, particularly those that serve as sites of combinatorial control. What we do not know is how stringent these criteria should be. As enhancer properties are not well enough defined to be certain how relaxed the criteria can be, we allowed no tolerance for any deviations in binding-site order and spacing. Programs such as PReMod are more flexible from this point of view. Although PReMod also performs well, it is possible that this relaxation with respect to conservation could result in a higher false positive rate. At the same time, there is no existing program that can guarantee identification of all regulatory regions, as clearly evidenced by the fact that PRI and PReMod each find literature-defined regulatory regions that the other does not. Thus, it is distinctly possible that there are key features of regulatory regions that are not yet known but that could improve the algorithm and allow it to be more inclusive.
An interesting observation is that there are frequently multiple PRIs associated with a given gene. It is possible that this is correlated with the complex expression patterns observed for many of the genes of higher eukaryotes. For example, different PRIs could be used in different scenarios. This also establishes the potential for long-range interactions between regulatory regions, which could allow fine-tuning of the expression level of a given gene in different biological settings. As a result, we anticipate that there could be cases where a PRI cloned out of its genomic context and thereby isolated from any potential interactions could not recapitulate the expression pattern of the associated gene. Conversely, we encountered a few cases of PRIs from genes not normally differentially expressed in muscle that could unexpectedly drive reporter gene expression in a C2C12 differentiation time course. This aberrant behavior could be a good example of chromatin structure playing an important role in determining DNA accessibility to transcription factors as the plasmid-borne PRI would escape much of the influence of chromatin modifications normally present. In the future, perhaps overlaying our database with databases that contain information about chromatin structure, such as DNA methylation or histone posttranslational modifications, could allow even more confident forecasting of which PRIs are directing transcription in different scenarios.
There are many potential applications of the PRI algorithm to answering diverse questions. For instance, by examining a list of genes containing PRI-associated binding sites for a given transcription factor, it could be possible to predict the biological process in which this transcription factor is involved. This, in effect, would constitute a functional annotation of cis-regulatory elements. We performed such an analysis on all of the genes in the PRI database with conserved MEF2 binding sites. We compared these genes with all genes in the genome and looked for enrichment of specific functional categories by using Gene Ontology (33). As a control, we also performed the same analysis on genes that contain MEF2 binding sites in their upstream regions but that are not conserved in PRIs. The analysis done on genes with MEF2 binding sites in PRIs returned significantly low (<0.05) binomial P values for functional categories compatible with myogenesis and neurogenesis (data not shown). No such enrichment was observed with genes whose MEF2 consensus binding sites were not within PRIs. We anticipate that one could apply this approach to uncharacterized binding sites to predict what system might use them.
In addition to discovering new cis-regulatory regions, the computational framework developed to support our analysis could be exploited to perform global analyses of transcriptional networks. One obvious question one can ask is what is the network of genes controlled by two or more transcription factors suspected of working together to achieve combinatorial transcriptional regulation? The first search tool on the PRI web site can serve as a launch pad for such an analysis. Further analysis could be done to functionally categorize the PRIs or determine whether key combinations of transcription factors might cooperate to regulate gene expression under different conditions, for example, in response to cytokines, in governing tissue-specific expression patterns, or in coordinating developmental timing events.
Finally, PRIs can be viewed from an evolutionary perspective. It has already been proposed that the complexity of gene regulatory mechanisms is proportional to the complexity of the organism (2), more so than gene number or gene conservation. Could it be that natural mutations within the PRI region that change spacing, binding-site sequence, or orientation alter the transcriptional output and drive the phenotypic variations that distinguish species?
In summary, identification of PRIs coupled with the experimental validation presented here has revealed heretofore unexplored characteristics of the architecture of cis-regulatory regions. This analysis underscores the importance of the order and spacing of transcription factor binding sites that serve as the platform for transcriptional regulation of gene expression in mammalian genomes.
Cloning of Luciferase Reporters and Mutagenesis.
The reporter constructs p3TP-Lux and p3APP-Lux have been described (26). We cloned some of the representative PRIs into the KpnI–PstI site of p3APP-lux by substituting the existing TGF-β-responsive elements cloned previously between the two restriction sites. Point mutations and deletions were generated with the QuikChange II kit (Stratagene, La Jolla, CA) per the manufacturer's instructions.
Cell Lines and Luciferase Assay.
C2C12 mouse myoblasts were a gift from Leslie Leinwand (University of Colorado) and maintained in DMEM supplemented with l-glutamine, penicillin, streptomycin, and 20% FBS. Myogenic differentiation was induced by serum deprivation. Briefly, cells were washed with PBS, and the culture media were replaced with DMEM supplemented with 5% horse serum. For luciferase assays, C2C12 cells were plated at 5,000 cells per cm2 in triplicate and 24 h after were transfected with FuGENE6 transfection reagent (Roche, Indianapolis, IN) with 1 μg of the indicated plasmids and 10 ng of pRL-TK for normalization according to the manufacturer's conditions. Twenty-four hours after transfection, GM were harvested and differentiation was induced to yield DM. DM cells were harvested at the time points indicated. Reporter gene assays were performed by using the Dual Luciferase kit (Promega, Madison, WI) according to the manufacturer's instructions and read with a Dynex luminometer.
Real-Time PCR.
Total RNA was isolated from C2C12 cells with an RNeasy kit (Qiagen, Valencia, CA) following the manufacturer's instructions. The total RNA was then reverse-transcribed with a Transcriptor First Strand cDNA synthesis kit using Anchored oligo-(dT)18 (Roche). The real-time PCR experiment was performed by using LightCycler FastStart DNA Master(PLUS) SYBR Green I following the manufacturer's instructions on a Light Cycler 2.0 (Roche). Expressions were normalized to the control gene EPB7.2, which was selected from a microarray for its high expression level and minimal fold change during myogenic differentiation. For oligo sequences used in this analysis, please refer to SI Text.
ChIP.
ChIP assays were performed essentially according to Lambert and Nordeen (34) except that antibodies against MYOD (sc-760; Santa Cruz Biotechnology, Santa Cruz, CA) and MEF2 (sc-313; Santa Cruz Biotechnology) were used. The results were analyzed by PCR on a Mastercycler Gradient (Eppendorf, Hamburg, Germany). For more information, please refer to SI Text.
Supplementary Material
Supporting Information
Acknowledgments
We thank Natalie Ahn, Jim Goodrich, Rob Knight, Leslie Leinwand, and David Clarke for critical reading of the manuscript and members of the Leinwand and Knight laboratories at the University of Colorado for discussion and reagents. K.K.B.B. was supported by National Institute of General Medical Sciences Predoctoral Training Grant T32GM08759. T.H.C. was supported by National Institutes of Health Predoctoral Training Grant 5T32HL07851. This work was supported by National Institutes of Health Grants CA107098-01 and R01CA95527 (to X.L.).
Abbreviations
PRIpattern-defined regulatory island
TFDtranscription factor database
GMgrowing myoblasts
DMdifferentiating myotubes.

Footnotes
The authors declare no conflict of interest.
This article contains supporting information online at www.pnas.org/cgi/content/full/0704028104/DC1.
1. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, et al. Nature. 2001;409:860–921. [PubMed]
2. Levine M, Tjian R. Nature. 2003;424:147–151. [PubMed]
3. Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES. Nature. 2003;423:241–254. [PubMed]
4. Cliften P, Sudarsanam P, Desikan A, Fulton L, Fulton B, Majors J, Waterston R, Cohen BA, Johnston M. Science. 2003;301:71–76. [PubMed]
5. Pritsker M, Liu YC, Beer MA, Tavazoie S. Genome Res. 2004;14:99–108. [PubMed]
6. Wang T, Stormo GD. Proc Natl Acad Sci USA. 2005;102:17400–17405. [PubMed]
7. Papatsenko D, Levine M. Nat Methods. 2005;2:529–534. [PubMed]
8. Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW, Hannett NM, Tagne JB, Reynolds DB, Yoo J, et al. Nature. 2004;431:99–104. [PubMed]
9. Berman BP, Nibu Y, Pfeiffer BD, Tomancak P, Celniker SE, Levine M, Rubin GM, Eisen MB. Proc Natl Acad Sci USA. 2002;99:757–762. [PubMed]
10. Bejerano G, Siepel AC, Kent WJ, Haussler D. Nat Methods. 2005;2:535–545. [PubMed]
11. Kleinjan DA, van Heyningen V. Am J Hum Genet. 2005;76:8–32. [PubMed]
12. Remenyi A, Scholer HR, Wilmanns M. Nat Struct Mol Biol. 2004;11:812–815. [PubMed]
13. Arnone MI, Davidson EH. Development (Cambridge, UK). 1997;124:1851–1864.
14. Ghosh D. Nucleic Acids Res. 1992;(20) Suppl:2091–2093. [PubMed]
15. Cheung TH, Kwan YL, Hamady M, Liu X. Genome Biol. 2006;7:R97. [PubMed]
16. Wheeler DL, Church DM, Lash AE, Leipe DD, Madden TL, Pontius JU, Schuler GD, Schriml LM, Tatusova TA, Wagner L, et al. Nucleic Acids Res. 2001;29:11–16. [PubMed]
17. Blanchette M, Bataille AR, Chen X, Poitras C, Laganiere J, Lefebvre C, Deblois G, Giguere V, Ferretti V, Bergeron D, et al. Genome Res. 2006;16:656–668. [PubMed]
18. Hallikas O, Palin K, Sinjushina N, Rautiainen R, Partanen J, Ukkonen E, Taipale J. Cell. 2006;124:47–59. [PubMed]
19. Molkentin JD, Olson EN. Proc Natl Acad Sci USA. 1996;93:9366–9373. [PubMed]
20. Tapscott SJ. Development (Cambridge, UK). 2005;132:2685–2695.
21. Groisman R, Masutani H, Leibovitch MP, Robin P, Soudant I, Trouche D, Harel-Bellan A. J Biol Chem. 1996;271:5258–5264. [PubMed]
22. Tomczak KK, Marinescu VD, Ramoni MF, Sanoudou D, Montanaro F, Han M, Kunkel LM, Kohane IS, Beggs AH. FASEB J. 2004;18:403–405. [PubMed]
23. Zhao P, Iezzi S, Carver E, Dressman D, Gridley T, Sartorelli V, Hoffman EP. J Biol Chem. 2002;277:30091–30101. [PubMed]
24. Chen IH, Huber M, Guan T, Bubeck A, Gerace L. BMC Cell Biol. 2006;7:3. [PubMed]
25. McDermott JC, Cardoso MC, Yu YT, Andres V, Leifer D, Krainc D, Lipton SA, Nadal-Ginard B. Mol Cell Biol. 1993;13:2564–2577. [PubMed]
26. Hua X, Liu X, Ansari DO, Lodish HF. Genes Dev. 1998;12:3084–3095. [PubMed]
27. Buchberger A, Ragge K, Arnold HH. J Biol Chem. 1994;269:17289–17296. [PubMed]
28. Edmondson DG, Cheng TC, Cserjesi P, Chakraborty T, Olson EN. Mol Cell Biol. 1992;12:3665–3677. [PubMed]
29. Au CG, Cooper ST, Lo HP, Compton AG, Yang N, Wintour EM, North KN, Winlaw DS. J Mol Cell Cardiol. 2004;36:655–662. [PubMed]
30. Blais A, Tsikitis M, Acosta-Alvear D, Sharan R, Kluger Y, Dynlacht BD. Genes Dev. 2005;19:553–569. [PubMed]
31. Wasserman WW, Fickett JW. J Mol Biol. 1998;278:167–181. [PubMed]
32. Berman BP, Pfeiffer BD, Laverty TR, Salzberg SL, Rubin GM, Eisen MB, Celniker SE. Genome Biol. 2004;5:R61. [PubMed]
33. Gene Ontology Consortium. Nucleic Acids Res. 2006;34:D322–D326. [PubMed]
34. Lambert JR, Nordeen SK. Methods Mol Biol. 2001;176:273–281. [PubMed]

See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph