![]() | ![]() |
Formats:
|
||||||||||||||||||
Copyright © 2006 The Author(s) A comparative analysis of genome-wide chromatin immunoprecipitation data for mammalian transcription factors Department of Statistics, Harvard University, 1 Oxford Street, Cambridge, MA 02138, USA 1Department of Molecular and Cellular Biology, Harvard University, 16 Divinity Avenue, Cambridge, MA 02138, USA 2Department of Statistics, Stanford University, 390 Serra Mall, Stanford, CA 94305, USA *To whom correspondence should be addressed. Tel: +1 650 725 2915; Fax: +1 650 725 8977; Email: whwong/at/stanford.edu Received July 13, 2006; Revised September 26, 2006; Accepted September 28, 2006. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract Genome-wide location analysis (ChIP-chip, ChIP-PET) is a powerful technique to study mammalian transcriptional regulation. In order to obtain a basic understanding of the location data generated for mammalian transcription factors and potential issues in their analysis, we conducted a comparative study of eight independent ChIP experiments involving six different transcription factors in human and mouse. Our cross-study comparisons, to the best of our knowledge the first to analyze multiple datasets, revealed the importance of carefully chosen genomic controls in the de novo identification of key transcription factor binding motifs, raised issues about the interpretation of ubiquitously occurring sequence motifs, and demonstrated the clustering tendency of protein-binding regions for certain transcription factors. INTRODUCTION Genome-wide chromatin immunoprecipitation (also known as location analysis) is a powerful technique to identify and locate mammalian transcription factor binding regions at a resolution of 0.5–2 kb (1–4). Combined with downstream sequence analysis (5–9), this technology has the potential to provide a detailed characterization of structures [i.e. identities and locations of 6–30 bp long transcription factor binding sites (TFBS)] and functions of mammalian cis-regulatory elements. Applying this technology to mammalian genomes is still at an early stage; hence, systematic understandings about the data itself are limited, as is our knowledge about potential issues in their analysis and interpretation. Because of this, it is still not clear as to which types of computational analyses are generally useful, how they should be performed and how the results should be interpreted. To help clarify these issues, we performed a comparative analysis of eight independent ChIP experiments in human and mouse (Table 1). These experiments involve six different transcription factors [Gli, Estrogen Receptor (ER), p53, Oct4, Sox2 and Nanog] and three different technological platforms (ChIP-chip on Agilent tiling arrays, ChIP-chip on Affymetrix tiling arrays and ChIP-PET). Through the cross-study comparisons, we show that
The study here represents the first cross-platform, multi-factor analysis of genome-wide ChIP data. Our results revealed common characteristics of such data for mammalian transcription factors and provide guidelines for their future analysis. MATERIALS AND METHODS Data preparation To conduct the comparative study, we collected ChIP-chip data for Gli (mouse, GEO accession no. GSE5683) (S.A. Vokes, H. Ji, S. McCuine, T. Tenzen, S. Giles, S. Zhong, W.J.R. Longabaugh, E.H. Davison, W.H. Wong and A.P. McMahon, submitted for publication), estrogen receptor (human) (2), Oct4, Sox2 and Nanog (human) (1), and ChIP-PET data for p53 (human) (3), Oct4 and Nanog (mouse) (4). The five ChIP-chip datasets were generated using three different platforms. The Gli ChIP-chip was generated using a custom array produced by Agilent Technology. 50–150 kb regions surrounding promoters and 3′-untranslated regions (3′-UTRs) of a selected set of genes were surveyed by 60mer oligo probes at a density of one probe per 125 bp. The Oct4, Sox2 and Nanog ChIP-chip were based on Agilent promoter arrays which surveyed −8 to +2 kb promoter regions of all human genes using 60mer probes with an estimated probe spacing of 280 bp. The ER ChIP-chip was produced using Affymetrix chromosome 21 and 22 tiling arrays where 25mer probes were tiled at a density of 1 probe per 35 bp. These data were summarized in Table 1 and were analyzed using a unified protocol as described below. General data analysis protocol We applied the ChIP-chip peak detection tool TileMap (15) to define potential protein-binding regions for the five ChIP-chip datasets (Gli, ER, Oct4, Sox2 and Nanog). Enriched sequence patterns in high-quality binding regions were then identified through de novo motif discovery (8,9). In order to identify the key motif that may mediate sequence-specific protein binding, we compared different motifs' relative enrichment levels in high-quality binding regions versus control genomic regions. Then, the key motif's relative enrichment levels were used to refine the cutoff for defining binding regions which in turn were subject to further analysis of GC-content, phylogenetic conservation and physical distribution. For the three ChIP-PET datasets (p53, Oct4 and Nanog), our analysis followed the order of de novo motif discovery, key motif ascertainment, analysis of GC-content, conservation and distributional properties. Initial definition of binding regions For Gli, Oct4, Sox2 and Nanog ChIP-chip (on Agilent arrays), we applied TileMap (15) to compute a moving average (MA) statistic for each probe. Probes with the MA statistic three standard deviations away from the global mean were used to define potential binding regions, resulting in 65 initial regions in Gli, 1262 initial regions in Oct4, 1220 initial regions in Sox2 and 1842 initial regions in Nanog. For ER ChIP-chip (on Affymetrix arrays), we applied a hidden Markov model (HMM) using TileMap to detect binding regions. We detected 107 initial regions using a posterior probability cutoff value of 0.9. The rationale for choosing the algorithms and cutoffs is explained in Supplementary Data S1 and S2. For p53, Oct4 and Nanog ChIP-PET, all regions reported by the original authors were included in our subsequent analysis. For all datasets, the number of initial regions and the criteria used to define them are summarized in Table 2.
The genomic coordinates of all human regions were converted into coordinates based on NCBI build 35 (hg17). All mouse regions were converted into NCBI build 34 (mm6) coordinates. Repeat-masked sequences of these two assemblies were downloaded from the UCSC genome browser (http://genome.ucsc.edu) and were used for all subsequent sequence analyses. De novo motif discovery A subset of high-quality regions from each dataset were selected for de novo motif discovery (Table 2). For Gli, Oct4, Sox2 and Nanog ChIP-chip, the high-quality regions were defined as regions with at least one probe whose MA statistic is four standard deviations away from the global mean. This resulted in 30 Gli regions, 388 Oct4 regions, 477 Sox2 regions and 728 Nanog regions. All 107 initial regions were used for de novo motif discovery in the ER ChIP-chip dataset. For p53 ChIP-PET, high-quality regions were defined as 323 PET3+ regions (i.e. regions with ≥3 overlapping paired-end ditags) as suggested by Wei et al. (3). For Oct4 ChIP-PET, all 1052 regions were used. For Nanog ChIP-PET, since the identity of the Nanog motif has not yet been clearly established, we only included 602 PET8+ regions in the de novo motif discovery to assure the high quality of input sequences. De novo motif discovery was performed by running a Gibbs motif sampler (8,9) (Supplementary Data S3) three times independently. Each time, 10 motifs were sampled simultaneously. An initial motif length (L = 9, 12, 15) was specified for all motifs at the beginning of the sampling, and the motif lengths were then adjusted during the sampling procedures as described previously (16). A position-specific weight matrix (PWM) was reported for each motif and a motif score was computed as follows:
ni = niA + niC + niG + niT, where nij (j = A, C, G, T) is the number of occurrences of nucleotide j at the i-th position of the motif. A pseudocount 0.5 was added to each nij to avoid zero. pij=nij/ni. qj is the occurrence frequency of nucleotide j in the background sequences (derived from all input sequences). W is the length of the motif. This score is essentially the motif score used by MDSCAN under a zeroth-order Markov background model (5). It reflects both the information content of the motif and the evidence strength (i.e. the number of TFBSs). Define max_score to be the maximum score of all motifs discovered from the three independent runs of the Gibbs motif sampler. Motifs with a score less than max{0.4*max_score, 1.5} were considered to have low quality and were excluded from our further analysis. A number of motifs reported by the three independent runs were redundant, i.e. they had almost the same sequence pattern. Only one copy (the one with the highest motif score) of these redundant motifs was kept after visual inspection of their sequence logos (17). The complete results of de novo motif discovery are shown in Supplementary Figures S1–S8.Mapping transcription factor binding motif to sequences When mapping a motif PWM to DNA, background sequences were modeled as a third-order Markov chain. At each position, the likelihood ratio (LR) between the motif model (PWM) and the background model was computed. A site with LR greater than certain cutoff was declared as a TFBS. TFBS can be filtered further by cross-species conservation. In this paper, LR > 500 was used to define TFBS. This cutoff represents a balance between sensitivity and specificity of the analysis (Supplementary Data S4). Several other cutoff values were also tried but did not change the conclusions drawn in the paper. Conserved TFBS was defined as a TFBS that resides within the top 10% most conserved genomic regions. Conservation is evaluated through conservation scores, either a phastCons score (for human hg17) (18) or a score based on a window's percent identity measure (for mouse mm6) (Supplementary Data S5). Examination of motif's relative enrichment levels Three statistics, r1, r2 and r3, were defined to characterize relative enrichment levels of a motif in ChIP-binding regions compared to control regions. Assume that n1B counts how many times a motif occurs in ChIP-binding regions, n2B is the total length of non-repeat sequences in binding regions, n1C counts how many times the motif occurs in control regions and n2C is the total length of non-repeat sequences in control regions. We define r1 = (n1B/n2B)/(n1C/n2C) as the relative enrichment level of the motif. Similarly, let n3k (k = B or C) count the number of phylogenetically conserved motif sites in specified genomic regions, and let n4k count phylogenetically conserved non-repeat base pairs in the regions. r2 = (n3B/n4B)/(n3C/n4C) then defines the motif's relative enrichment level in phylogenetically conserved ChIP-binding regions. Note that n3k/n1k is the percentage of motif sites that are conserved and n4k/n2k is the percentage of genomic sequences that are conserved. Finally, r3, defined as (n3B/n2B)/(n3C/n2C), characterizes the relative enrichment level of phylogenetically conserved sites in general ChIP-binding regions (not necessarily conserved). Note that r3/r2 = (n4B/n2B)/(n4C/n2C) characterizes whether or not ChIP-binding regions tend to be more phylogenetically conserved than control regions. Random genomic controls, design-based controls and matched genomic controls To evaluate a motif's relative enrichment level, three types of control regions were prepared. ‘Random genomic controls’ were regions randomly chosen from the genome. Each region was 2 kb in length and 5000 regions were chosen for each dataset. ‘Design-based Controls’ were regions randomly chosen from part of the genome that was surveyed in the original ChIP study. For Oct4, Sox2 and Nanog ChIP-chip, these were 5000 segments (each 2 kb long) randomly chosen from −8 to +2 kb promoter regions around transcription start sites (TSS). For Gli, all regions tiled in the custom array were used as the design-based control. For ER, we used human chromosomes 21 and 22. For p53, Oct4 and Nanog ChIP-PET, design-based controls were the same as the random genomic controls. ‘Matched genomic controls’ were control regions carefully chosen to match the physical distributions of ChIP-binding regions. To choose the matched controls, for each dataset, we first annotated binding regions by their closest RefSeq genes. We then computed the distance between the centers of ChIP-binding regions and the neighboring genes' TSS. The center of a binding region is defined as its middle point, i.e. (region start + region end)/2. Next, we randomly chose genes from the RefSeq database. For each chosen gene, a 2 kb region was picked up so that the distance between the gene TSS and the region center followed the same empirical distribution of distances between ChIP regions and their closest genes. The number of matched control regions was chosen to be a multiple of the number of high-quality binding regions. This resulted in 5010 regions for Gli-chip, 5029 for ER-chip, 4845 for p53-PET, 5044 for Oct4-chip, 4770 for Sox2-chip, 5096 for Nanog-chip, 5260 for Oct4-PET and 4816 for Nanog-PET. Refining the cutoff for defining binding regions After the key motif was ascertained from the relative enrichment level r1, r2 and r3, binding regions initially defined by TileMap were binned according to their raw ChIP signal strength (defined as MA or HMM statistics). The relative enrichment level r1, r2 and r3 of the key motif were computed for each bin. In general, the enrichment levels decreased as the raw ChIP signals went down (Supplementary Figure S10). We chose a cutoff to define our final binding regions by simultaneously requiring r1 > 2, r2 > 2, r3 > 2, and [(the number of motif sites in a bin) > 0.25*(the number of binding regions in the bin)]. This results in 30 Gli regions, 80 ER regions, 600 Oct4-chip regions, 900 Sox2-chip regions and 600 Nanog-chip regions (Table 2). For p53-PET, Oct4-PET and Nanog-PET, all regions reported by the original authors were used as our final regions, resulting in 542 p53-PET regions, 1052 Oct4-PET regions and 2947 Nanog-PET regions after conversion to human hg17 or mouse mm6 assembly (Table 2). These final regions were used for subsequent GC-content, conservation and peak clustering analysis. Clustering of binding regions To test if binding regions have a clustering tendency, we constructed a null distribution as follows. We first mapped the key transcription factor binding motif to the genome (or part of the genome on which the original ChIP study was performed); we then simulated binding regions by randomly choosing them from the mapped TFBSs. The chosen TFBSs were assumed to be the center of the simulated binding regions and their distances define the random peak-to-peak distances. The number of selected regions was set equal to the number of observed ChIP-binding regions. The distance between simulated regions was then used to construct the null, to which distance distribution of observed ChIP-binding regions was compared. Here, observed ChIP-binding regions are defined as binding regions that were obtained from the ChIP study and that contained at least one mapped TFBS of the key transcription factor. We used MATLAB gamfit and expfit functions to fit gamma and exponential distributions that describe the observed and simulated peak-to-peak distance, respectively. Maximum-likelihood estimates for the parameters are shown in Figure 4E–G. RESULTS Transcription factor binding motif can be unambiguously recovered by de novo motif discovery In order to determine if current genome-wide ChIP technology allows unambiguous recovery of the DNA motif responsible for sequence-specific protein-binding, we applied Gibbs motif sampler (8,9) to the high-quality binding regions identified from the raw ChIP intensity data (see Materials and Methods). Gibbs motif sampler is a de novo motif discovery algorithm that searches for enriched sequence patterns in a collection of DNA sequences. The motifs are assumed to be unknown before the search. In all six cases where the genuine transcription factor binding motifs were indeed known from previous biological studies (Gli, ER, p53, Oct4-chip, Sox2-chip, Oct4-PET), the genuine motifs were successfully recovered (Figure 1
Certain sequence patterns tend to be detected in binding regions of multiple transcription factors Certain motifs were found by Gibbs motif sampler in multiple unrelated datasets. The two most prominent examples are a GC-rich pattern GGGG[A/C/T]GGGG (denoted by G[n]HG[n] thereafter) and a AT-rich sequence pattern TTTTTTT (or T[n]). The former was reported in all datasets except for ER-chip, whereas the latter was found ubiquitously (Figure 1 Besides the low-complexity patterns above, a non-trivial motif CCCAG occurs frequently too. This motif is part of the core Gli-binding consensus (Figure 2A
Properly chosen control regions are important for ascertaining key motifs Motifs non-specifically enriched in ChIP-binding regions may obscure the identification of the motif that is of main interest in each individual study. For example, the strongest motifs discovered by Gibbs motif sampler in Oct4-Sox2-Nanog-chip and Oct4-Nanog-PET were the ubiquitously occurring G[n]HG[n] and T[n] sequences (Supplementary Figures S4–S8). Likewise, in ER-chip, a GC-rich pattern (M2) and the AT-rich pattern T[n] (M3) both had higher motif scores than the ER-binding element (Supplementary Figure S2). Many de novo motif discovery methods [e.g. Gibbs motif sampler and MEME (7)] only rely on the use of positive sequences in which the motif is expected to be enriched. Recent studies suggest that negative sequences (i.e. genomic control regions where the motif is expected not to be enriched) may be used as an additional source of information to refine the motif discovery (6,29,30). Ideally, by comparing motif occurrence rate in positive sequences (i.e. ChIP-binding regions) with its occurrence rate in negative sequences, one would hope that the key motif can stand out as the motif that has the highest relative enrichment level. To check if this is indeed the case in mammalian genomes, we did the following comparisons. We first defined three statistical variables, r1, r2 and r3, to characterize three different aspects of relative enrichment level (Materials and Methods). r1 characterizes relative enrichment level of a motif in ChIP-binding regions as compared with control regions; r2 characterizes relative enrichment level of phylogenetically conserved motif sites in phylogenetically conserved ChIP-binding regions; and r3 characterizes relative enrichment level of phylogenetically conserved motif sites in general ChIP-binding regions (not necessarily conserved). We then prepared three sets of control regions for each dataset (Materials and Methods). ‘Random genomic controls’ are regions randomly chosen from the genome. ‘Design-based controls’ are regions randomly chosen from the part of the genome that was surveyed by the original ChIP study. ‘Matched genomic controls’ are control regions carefully chosen to match the physical distributions of ChIP binding, so that the distance between the simulated regions and the TSS of their neighboring genes has the same distribution as the distance between real binding regions and their neighboring TSS. When high-quality binding regions were compared to random genomic controls, the motifs of main interest stood out as the one with the highest relative enrichment level in some but not all datasets (Figure 1 When we compared the key motif's r1, r2 and r3, in most cases (except for p53-PET), r3 > r2 (Figure 1 Given what was observed in ER, Gli, p53, Oct4 and Sox2 data, we revisited the question of ‘what is the Nanog motif’. Until now, this motif has not been well established. By using matched genomic controls, the motif with the highest r1, r2 and r3 in Nanog-chip data was the composite Oct-Sox motif (Figure 1F GC-content and cross-species conservation of ChIP-binding regions One critical factor in all genome-wide ChIP studies, which seek to define a comprehensive cataloguing of target sites, lies in how to define the threshold values for these sites. We defined the final ChIP-binding regions by choosing cutoffs using the key motif enrichment level (Materials and Methods). This resulted in 30 Gli regions, 80 ER regions, 600 Oct4-chip regions, 900 Sox2-chip regions, 600 Nanog-chip regions, 542 p53-PET regions, 1052 Oct4-PET regions and 2947 Nanog-PET regions. In all eight ChIP experiments, transcription factor binding regions had a higher GC-content than the genome-wide level (Figure 3A
In the five ChIP-chip datasets, the binding regions are more conserved phylogenetically than the genome-wide conservation level and the base-line conservation level determined by the experimental design (Figure 3B Binding regions of some transcription factors have a clustering tendency We noticed that certain signature genes in each individual studies contained multiple binding regions identified by ChIP. For example, as a membrane receptor, Ptch1 plays a key role in transducing the sonic hedgehog (SHH) signal to Gli (31). The Gli-chip study identified five GLI-binding peaks within a 20 kb region of the Ptch1 promoter (see Figure 2A
This observation led us to hypothesize that some transcription factor binding regions have certain clustering tendency. To check if this is indeed the case, for each of the three genome-wide or chromosome-wide datasets (ER-chip, Oct4-PET, p53-PET) and the Gli dataset where individual genes were covered substantially, we picked up binding regions that contained at least one occurrence of the key transcription factor binding motif. The distance between neighboring binding regions was computed and its distribution was compared to what was expected by random (Materials and Methods). Compared to the simulated random distance distributions, binding regions of Gli, ER and Oct4 showed a clear clustering tendency (Figure 4D–F Previously it was shown that 6–30 bp long transcription factor binding sites may be clustered within a cis-regulatory module 0.5–2 kb in length, and this type of clustering can be used to improve performance of motif discovery and CRM prediction (10–14). In contrast, our studies emphasize a higher level clustering of discrete cis-regulatory modules at a scale of 1–100 kb. The existence of this clustering tendency for certain transcription factors raises the question of whether this higher level clustering can be potentially used as an additional source of information to improve the prediction CRMs and target genes of transcription factors. Detailed study on this issue is beyond the scope of our current paper, but a simple comparison in the Gli case provided in Supplementary Data S7 is helpful to illustrate the potential. Briefly, we extracted the Gli-binding motif from TRANSFAC (28) and mapped it to conserved non-coding segments that are >200 bp long in the mouse genome. When the segments were ranked by their combined strength of Gli motif sites (i.e. the sum of log-likelihood ratios of all TFBSs within a segment), the top 50 segments contained no known Gli targets. We then tried to combine Gli strengths from multiple clustered modules associated with each individual gene. When we rank all mouse genes by their combined strength of Gli sites, the top 50 genes surprisingly contained two well-known Gli targets, Ptch1 and Hhip, as well as two additional genes that showed SHH-responsiveness in neural tube development, including Gpc3 and Robo2, both were verified to be bound by GLI in subsequent biological validations (S.A. Vokes, H. Ji, S. McCuine, T. Tenzen, S. Giles, S. Zhong, W.J.R. Longabaugh, E.H. Davison, W.H. Wong and A.P. McMahon, submitted for publication). Because these four genes were not used to construct the TRANSFAC GLI matrix, the comparison here suggests that by combining information from multiple clustered modules, it is possible to obtain higher discriminating power to predict target genes of certain transcription factors. A potential strategy to improve CRM prediction for some transcription factors would be to generate predictions at the gene level first and then predict CRMs from their corresponding genomic regions. DISCUSSION To summarize, we have performed a comparative analysis of multiple independent ChIP experiments. Although the data we analyzed were generated from three different technological platforms, common characteristics of mammalian location analysis emerge from the cross-study comparisons and these commonalities have implications in analyzing future genome-wide location data. First, we demonstrated that by combining de novo motif discovery with the examination of relative enrichment level using matched genomic control regions, we were able to recover the transcription factor binding motif in all six cases where the key motifs were indeed known. This gives us confidence in applying ChIP technology to identify unknown mammalian transcription factor binding motif in future studies. We note that in many original ChIP studies, these motifs were also reported to be found by de novo discovery. Our results here, however, showed that not only can these motifs be recovered, but also they can rank at the top among the many candidate motifs reported by a de novo motif discovery algorithm, provided that their enrichment levels are examined against matched genomic controls. Second, we showed that certain motifs occur ubiquitously in transcription factor binding regions. The real functions of these motifs need to be clarified by future biological experiments. Before this, we suggest that these simple sequence patterns should be interpreted with caution. In particular, for computational biologists, these ubiquitous motifs may not be a good choice for assessing sensitivity and specificity of de novo motif discovery algorithms especially when these algorithms are tested on mammalian ChIP-chip data. The result also emphasizes the need to develop better sequence generating models for modeling genomic background in motif discovery when these patterns are not of our main interest. Third, the GC-content bias emphasizes the importance of choosing good genomic controls for ascertaining sequence patterns that are of main interest. Although several previous studies showed that negative sequences can help to improve the performance of motif discovery, the fact that the results can be affected greatly by the method used to choose negative sequences is less well-known. Fourth, the clustering tendency of binding regions is a new piece of information that may be exploited in future to improve the computational prediction of cis-regulatory elements from mammalian genome. Although not used further in our current study, the observation that the peak-to-peak distance distributions follow a gamma distribution can be potentially incorporated into future statistical models for ChIP-chip data analysis (e.g. a model that reprioritizes peaks based on their clustering tendency). Besides its computational use, perhaps it is more interesting to understand the mechanisms behind the clustering of multiple binding regions. One possible explanation is that a gene may need multiple enhancers to confer tissue and context specific response to a single transcription factor. Another possibility is that this clustering can raise the local concentration of TFBSs to a sufficiently high level so that the regulatory regions can be easily recognized by a transcription factor. The eventual clarification of the mechanisms warrants future investigation. With these new observations, it is our hope that the comparative study here will facilitate a better understanding of the genome-wide location data for mammalian transcription factors, and help us to use future data more efficiently. SUPPLEMENTARY DATA Supplementary Data are available at NAR Online. [Supplementary Data]
Acknowledgments We thank the three anonymous reviewers for their helpful comments. This work is partially supported by NIH grant GM67250 (H.J. and W.H.W.) and the Helen Hay Whitney Foundation (S.A.V.). Funding to pay the Open Access publication charges for this article was provided by NIH. Conflict of interest statement. None declared. REFERENCES 1. Boyer L.A., Lee T.I., Cole M.F., Johnstone S.E., Levine S.S., Zucker J.P., Guenther M.G., Kumar R.M., Murray H.L., Jenner R.G., et al. Core transcriptional regulatory circuitry in human embryonic stem cells. Cell. 2005;122:947–956. [PubMed] 2. Carroll J.S., Liu X.S., Brodsky A.S., Li W., Meyer C.A., Szary A.J., Eeckhoute J., Shao W., Hestermann E.V., Geistlinger T.R., et al. Chromosome-wide mapping of estrogen receptor binding reveals long-range regulation requiring the forkhead protein FoxA1. Cell. 2005;122:33–43. [PubMed] 3. Wei C.L., Wu Q., Vega V.B., Chiu K.P., Ng P., Zhang T., Shahab A., Yong H.C., Fu Y., Weng Z., et al. A global map of p53 transcription-factor binding sites in the human genome. Cell. 2006;124:207–219. [PubMed] 4. Loh Y.H., Wu Q., Chew J.L., Vega V.B., Zhang W., Chen X., Bourque G., George J., Leong B., Liu J., et al. The Oct4 and Nanog transcription network regulates pluripotency in mouse embryonic stem cells. Nature Genet. 2006;38:431–440. [PubMed] 5. Liu X.S., Brutlag D.L., Liu J.S. An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat. Biotechnol. 2002;20:835–839. [PubMed] 6. Hong P., Liu X.S., Zhou Q., Lu X., Liu J.S., Wong W.H. A boosting approach for motif modeling using ChIP-chip data. Bioinformatics. 2005;21:2536–2643. 7. Bailey T.L., Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology; Menlo Park, CA: AAAI Press; 1994. pp. 28–36. 8. Lawrence C.E., Altschul S.F., Boguski M.S., Liu J.S., Neuwald A.F., Wootton J.C. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science. 1993;262:208–214. [PubMed] 9. Liu J.S. The collapsed Gibbs sampler with applications to a gene regulation problem. JASA. 1994;89:958–966. 10. Wasserman W.W., Fickett J.W. Identification of regulatory regions which confer muscle-specific gene expression. J. Mol. Biol. 1998;278:167–181. [PubMed] 11. Frith M.C., Hansen U., Weng Z. Detection of cis-element clusters in higher eukaryotic DNA. Bioinformatics. 2001;17:878–889. [PubMed] 12. Zhou Q., Wong W.H. CisModule: de novo discovery of cis-regulatory modules by hierarchical mixture modeling. Proc. Natl Acad. Sci. USA. 2004;101:12114–12119. [PubMed] 13. Thompson W., Palumbo M.J., Wasserman W.W., Liu J.S., Lawrence C.E. Decoding human regulatory circuits. Genome Res. 2004;14:1967–1974. [PubMed] 14. Gupta M., Liu J.S. De novo cis-regulatory module elicitation for eukaryotic genomes. Proc. Natl Acad. Sci. USA. 2005;102:7079–7084. [PubMed] 15. Ji H., Wong W.H. TileMap: create chromosomal map of tiling array hybridizations. Bioinformatics. 2005;21:3629–3636. [PubMed] 16. Jensen S.T., Liu X.S., Zhou Q., Liu J.S. Computational discovery of gene regulatory binding motifs: a Bayesian perspective. Statist. Sci. 2004;19:188–204. 17. Crooks G.E., Hon G., Chandonia J.M., Brenner S.E. WebLogo: a sequence logo generator. Genome Res. 2004;14:1188–1190. [PubMed] 18. Siepel A., Bejerano G., Pedersen J.S., Hinrichs A.S., Hou M., Rosenbloom K., Claswson H., Spieth J., Hillier L.W., Richards S., et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005;15:1034–1050. [PubMed] 19. Wasserman W.W., Palumbo M., Thompson W., Fickett J.W., Lawrence C.E. Human–mouse genome comparisons to locate regulatory sites. Nature Genet. 2000;26:225–228. [PubMed] 20. Liu Y., Liu X.S., Wei L., Altman R.B., Batzoglou S. Eukaryotic regulatory element conservation analysis and identification using comparative genomics. Genome Res. 2004;14:451–458. [PubMed] 21. Moses A.M., Chiang D.Y., Eisen M.B. Phylogenetic motif detection by expectation-maximization on evolutionary mixtures. Pac. Symp. Biocomput. 2004:324–335. [PubMed] 22. Sinha S., Blanchette M., Tompa M. PhyME: a probabilistic algorithm for finding motifs in sets of orthologous sequences. BMC Bioinformatics. 2004;5:170. [PubMed] 23. Siddharthan R., Siggia E.D., van Nimwegen E.J. PhyloGibbs: a gibbs sampling motif finder that incorporates phylogeny. PLoS Comput. Biol. 2005;1:e67. [PubMed] 24. Li X., Wong W.H. Sampling motifs on phylogenetic trees. Proc. Natl Acad. Sci. USA. 2005;102:9481–9486. [PubMed] 25. Cawley S., Bekiranov S., Ng H.H., Kapranov P., Sekinger E.A., Kampa D., Pic-colboni A., Sementchenko V.I., Cheng J., Williams A.J., et al. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell. 2004;116:499–509. [PubMed] 26. Kinzler K.W., Vogelstein B. The GLI gene encodes a nuclear protein which binds specific sequences in the human genome. Mol. Cell. Biol. 1990;10:634–642. [PubMed] 27. Hallikas O., Palin K., Sinjushina N., Rautiainen R., Partanen J., Ukkonen E., Taipale J. Genome-wide prediction of mammalian enhancers based on analysis of transcription-factor binding affinity. Cell. 2006;124:47–59. [PubMed] 28. Matys V., Kel-Margoulis O.V., Fricke E., Liebich I., Land S., Barre-Dirrie A., Reuter I., Chekmenev D., Krull M., Hornischer K., et al. TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 2006;34:D108–D110. [PubMed] 29. Bussermaker H.J., Li H., Siggia E.D. Regulatory element detection using correlation with expression. Nature Genet. 2001;27:167–171. [PubMed] 30. Conlon E.M., Liu X.S., Lieb J.D., Liu J.S. Integrating regulatory motif discovery and genome-wide expression analysis. Proc. Natl Acad. Sci. USA. 2003;100:3339–3344. [PubMed] 31. Hooper J.E., Scott M.P. Communicating with Hedgehogs. Nature Rev. Mol. Cell Biol. 2005;6:306–317. [PubMed] 32. Berry M., Nunez A.M., Chambon P. Estrogen-responsive element of the human pS2 gene is an imperfectly palindromic sequence. Proc. Natl Acad. Sci. USA. 1989;86:1218–1222. [PubMed] 33. Yuan H., Corbi N., Basilico C., Dailey L. Developmental-specific activity of the FGF-4 enhancer requires the synergistic action of Sox2 and Oct-3. Genes Dev. 1995;9:2635–2645. [PubMed] 34. Botquin V., Hess H., Fuhrmann G., Anastassiadis C., Gross M.K., Vriend G., Scholer H.R. New POU dimer configuration mediates antagonistic control of an osteopontin preimplantation enhancer by Oct-4 and Sox-2. Genes Dev. 1998;12:2073–2090. [PubMed] 35. Nishimoto M., Fukushima A., Okuda A., Muramatsu M. The gene for the embryonic stem cell coactivator UTF1 carries a regulatory element which selectively interacts with a complex composed of Oct-3/4 and Sox-2. Mol. Cell. Biol. 1999;19:5453–5465. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||
Cell. 2005 Sep 23; 122(6):947-56.
[Cell. 2005]Nat Genet. 2006 Apr; 38(4):431-40.
[Nat Genet. 2006]Nat Biotechnol. 2002 Aug; 20(8):835-9.
[Nat Biotechnol. 2002]J Mol Biol. 1998 Apr 24; 278(1):167-81.
[J Mol Biol. 1998]Proc Natl Acad Sci U S A. 2005 May 17; 102(20):7079-84.
[Proc Natl Acad Sci U S A. 2005]J Mol Biol. 1998 Apr 24; 278(1):167-81.
[J Mol Biol. 1998]Proc Natl Acad Sci U S A. 2005 May 17; 102(20):7079-84.
[Proc Natl Acad Sci U S A. 2005]Cell. 2005 Jul 15; 122(1):33-43.
[Cell. 2005]Cell. 2005 Sep 23; 122(6):947-56.
[Cell. 2005]Cell. 2006 Jan 13; 124(1):207-19.
[Cell. 2006]Nat Genet. 2006 Apr; 38(4):431-40.
[Nat Genet. 2006]Bioinformatics. 2005 Sep 15; 21(18):3629-36.
[Bioinformatics. 2005]Science. 1993 Oct 8; 262(5131):208-14.
[Science. 1993]Bioinformatics. 2005 Sep 15; 21(18):3629-36.
[Bioinformatics. 2005]Cell. 2006 Jan 13; 124(1):207-19.
[Cell. 2006]Science. 1993 Oct 8; 262(5131):208-14.
[Science. 1993]Nat Biotechnol. 2002 Aug; 20(8):835-9.
[Nat Biotechnol. 2002]Genome Res. 2004 Jun; 14(6):1188-90.
[Genome Res. 2004]Genome Res. 2005 Aug; 15(8):1034-50.
[Genome Res. 2005]Science. 1993 Oct 8; 262(5131):208-14.
[Science. 1993]J Mol Biol. 1998 Apr 24; 278(1):167-81.
[J Mol Biol. 1998]Proc Natl Acad Sci U S A. 2005 May 17; 102(20):7079-84.
[Proc Natl Acad Sci U S A. 2005]Nat Genet. 2000 Oct; 26(2):225-8.
[Nat Genet. 2000]Proc Natl Acad Sci U S A. 2005 Jul 5; 102(27):9481-6.
[Proc Natl Acad Sci U S A. 2005]Cell. 2004 Feb 20; 116(4):499-509.
[Cell. 2004]Proc Natl Acad Sci U S A. 2004 Aug 17; 101(33):12114-9.
[Proc Natl Acad Sci U S A. 2004]Proc Natl Acad Sci U S A. 2005 May 17; 102(20):7079-84.
[Proc Natl Acad Sci U S A. 2005]Mol Cell Biol. 1990 Feb; 10(2):634-42.
[Mol Cell Biol. 1990]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D108-10.
[Nucleic Acids Res. 2006]Cell. 2006 Jan 13; 124(1):47-59.
[Cell. 2006]Nat Genet. 2001 Feb; 27(2):167-71.
[Nat Genet. 2001]Proc Natl Acad Sci U S A. 2003 Mar 18; 100(6):3339-44.
[Proc Natl Acad Sci U S A. 2003]Nat Genet. 2006 Apr; 38(4):431-40.
[Nat Genet. 2006]Nat Rev Mol Cell Biol. 2005 Apr; 6(4):306-17.
[Nat Rev Mol Cell Biol. 2005]Proc Natl Acad Sci U S A. 1989 Feb; 86(4):1218-22.
[Proc Natl Acad Sci U S A. 1989]Genes Dev. 1995 Nov 1; 9(21):2635-45.
[Genes Dev. 1995]Mol Cell Biol. 1999 Aug; 19(8):5453-65.
[Mol Cell Biol. 1999]Cell. 2005 Sep 23; 122(6):947-56.
[Cell. 2005]J Mol Biol. 1998 Apr 24; 278(1):167-81.
[J Mol Biol. 1998]Proc Natl Acad Sci U S A. 2005 May 17; 102(20):7079-84.
[Proc Natl Acad Sci U S A. 2005]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D108-10.
[Nucleic Acids Res. 2006]Nat Genet. 2006 Apr; 38(4):431-40.
[Nat Genet. 2006]