![]() | ![]() |
Formats:
|
||||||||||||||||||||||||
Copyright © The Author 2009. Published by Oxford University Press on behalf of Kazusa DNA Research Institute Exhaustive Search for Over-represented DNA Sequence Motifs with CisFinder Developmental Genomics and Aging Section, Laboratory of Genetics, National Institute on Aging, NIH, Baltimore, MD 21224, USA *To whom correspondence should be addressed. Tel. Phone: +1 410-558-8359. Fax. +1 410-558-8331. E-mail: kom/at/mail.nih.gov Received May 16, 2009; Accepted July 27, 2009. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract We present CisFinder software, which generates a comprehensive list of motifs enriched in a set of DNA sequences and describes them with position frequency matrices (PFMs). A new algorithm was designed to estimate PFMs directly from counts of n-mer words with and without gaps; then PFMs are extended over gaps and flanking regions and clustered to generate non-redundant sets of motifs. The algorithm successfully identified binding motifs for 12 transcription factors (TFs) in embryonic stem cells based on published chromatin immunoprecipitation sequencing data. Furthermore, CisFinder successfully identified alternative binding motifs of TFs (e.g. POU5F1, ESRRB, and CTCF) and motifs for known and unknown co-factors of genes associated with the pluripotent state of ES cells. CisFinder also showed robust performance in the identification of motifs that were only slightly enriched in a set of DNA sequences. Keywords: algorithm, software, transcription factor binding site, ChIP-seq, embryonic stem cells 1. Introduction Transcription factor (TF) binding motifs in eukaryotes have been identified by examining binding sequences of purified TFs (e.g. SELEX1 and Protein Binding Microarrays2) and by carrying out chromatin immunoprecipitation coupled with massively parallel sequencing (ChIP-seq3–5) and microarray (ChIP-chip).6 The ChIP methods can account for biological context of TF binding7–10 because many TFs require co-factors for sequence-specific binding to DNA, which are not present in in vitro assays. On the other hand, TF binding sites identified in the ChIP methods will include not only direct binding sites but also binding sites indirectly associated with the TF through the protein–protein interaction of other TFs that binds directly to DNA. Furthermore, ChIP-seq data often include several million sequence tags and >10 000 binding locations.4,9,11 These features of high-throughput genome-wide ChIP technology make the bioinformatic task of identifying TF binding motifs a great challenge. Various software tools have been developed to identify over-represented DNA sequence motifs (reviewed in Das and Dai,12 Sandve et al.,13 and Tompa et al.14). For example, traditional probabilistic methods include expectation maximization (MEME15), Gibbs sampling,7,16 genetic algorithms (GAME17), integrated Bayesian models,18 neural networks, support vector machines, Bayesian additive regression trees,19 and approximate maximum a posteriory (MAP) scoring functions.20 These methods work well when data sets are small, and thus, only a small fraction of top-scored binding sites is usually processed with these algorithms.10 Weeder, which is based on counting matching patterns with a certain maximum number of mismatches, has been reported to outperform many other software tools.14 However, most existing algorithms are limited to searching only for a single motif at a time. To find additional motifs, the software has to be run again after removing the first motif from the sequence.15 With this approach, results may be different depending on the order in which motifs are processed. For example, a composite motif that supports binding of two TFs (TF1 and TF2) may be lost if a more abundant motif (TF1) is processed first and then removed from the sequence. Machine-learning algorithms, such as Gibbs sampler and neural networks, tend to fall into local maxima7 and often fail to differentiate between similar motifs. In this paper, we present a new algorithm for de novo identification of over-represented DNA motifs, which is implemented as the online software tool CisFinder (http://lgsun.grc.nia.nih.gov/CisFinder). It is a complementary method to existing probabilistic algorithms and has advantages in the exploratory analysis of large input files typical for ChIP-chip or ChIP-seq data sets. CisFinder can effectively process large sequences (up to 50 Mb), extract a comprehensive list of over-represented motifs in a single run, and analyze data with poor enrichment of DNA-binding motifs. Because of high processing speed (<1 min for complete data analyses), the software can be used in an interactive manner to test many different parameter sets. The software has been tested using available ChIP-seq data on TFs expressed in ES cells.9 2. Materials and methods 2.1. Estimating position frequency matrices from n-mer word counts The proposed algorithm is based on estimating position frequency matrices (PFMs) directly from n-mer word counts in the test set and control set of sequences. To explain the algorithm, we first describe a numerical example and then present the formal justification of the method. Consider a specific n-mer word W (e.g. W = ‘ATGCAAAT’), which has T(W) = 200 matches (instances) in the set of test sequences and C(W) = 50 matches in the set of control sequences. For simplicity, we count only a total number of instances as if all sequences in a set are concatenated. (However, the CisFinder has an option to count only one match of each word per sequence.) In this example, the total length of both test and control sequences is 3 Mb. For a word W, we define a nucleotide substitution matrix [Wpi], which contains words that are derived from W by placing a nucleotide i in a position p (Fig. 1
pi is the estimate of PFM element, and Tpi and Cpi are the counts of word Wpi, in the test and control sequences, respectively. Because word counts are random variables, they may appear smaller in the test sequences than in the control sequences by chance, resulting in a negative PFM element. To avoid this, negative differences are replaced with zero and then normalized as shown in Fig. 1 pi) can now be presented as follows:
If test and control sequence sets have different total lengths, then the number of word counts in the control sequences is adjusted by the total sequence length. This method is justified by the following model. Let us assume that a TF binds to a set of locations in the genome where corresponding DNA sequences can be aligned together. Using this alignment, we can estimate the frequency, fpi, of each nucleotide i in each position of aligned sequences, p, which is the element of the PFM. We further assume that binding strength of the TF is additive with no interaction between positions. This simplification is justified by the fact that all existing databases use PFMs to describe TF motifs, and this strategy works reasonably well. Consider a word W with a sequence of nucleotides that corresponds to the maximum values of the PFM at each position. This word is then used to generate frequency substitution matrices [Tpi] and [Cpi] for the test and control sets of sequences, respectively. Each instance of word Wpi in the test or control sequences can either correspond to a true binding site of the TF (we call it functional) or not (non-functional). Factors determining the functionality of different instances of the same DNA word are largely unknown and may include sequence context and chromatin status. Because the probability of TF binding is proportional to PFM elements at each position (based on the assumption of additive contribution of each position to TF binding), the number of functional instances, FT(Wpi), of word Wpi in the test sequences is proportional to fpi:
Similarly, the total number of instance of word Wpi in control sequences equals the sum of functional, FC(Wpi), and non-functional, NC(Wpi), instances. Although the functional instances are enriched in the test sequences compared with control sequences, some functional instances may be present in the control sequences, because of possible false negatives in ChIP data. Because we can assume that non-functional instances of the word are not affected by the ChIP procedure, their counts are equal in the test and control sets of sequences: NT(Wpi) = NC(Wpi). Then, the numerator in Equation (e1) is
Because the PFM is estimated as a difference between word counts in the test and control sets of sequences [Equations (e1) and (e2)], the variance of PFM elements is equal to the sum of variances of word counts in the test and control sequences. The variance of word counts is very close to the mean, which is expected from the Poisson distribution. This was also checked using pseudo-random sequences generated with the third order Markov process. For example, if word counts are 120 in the test set of sequences and 40 in the control set (i.e. 3-fold over-representation), then the relative error (accuracy) is equal to sqrt(120 + 40)/(120–40) = 0.158. 2.2. Implementation of the method for PFM estimation A successful ChIP-seq experiment generates a set of genome locations that are enriched in TF binding sites. For a test sequence set, we usually extract 200 bp sequence segments centered at a peak of projected TF binding sites. For a control sequence set, we usually extract 500 bp sequence segments starting from nucleotide positions 400 bp away from both ends of 200 bp test sequence segments. (However, the CisFinder allows users to choose different sequence lengths.) The CisFinder identifies binding motifs of TFs using direct counts of all possible 8-mer words with and without gaps in both test and control sequence sets (Fig. 1 Over-representation of word counts in the test sequences compared with the control sequences is then evaluated using a z-score which is estimated based on the hypergeometric probability distribution. Let us first consider the case where only one instance of each word is counted per each sequence of equal (or approximately equal) length. The proportion of sequences, q, with a given word in the set of test sequences is compared with the proportion of sequences, p, with the same word in the combined set of test and control sequences (if the null-hypothesis is true, then test and control sets of sequences can be combined) with z-score: where n is the number of test sequences, and N is the number of combined test and control sequences. If multiple instances of each word are counted per each sequence, then the method is modified as follows. The set of test sequences with the total length T is split into T/m segments of length m, where m is the actual length of the word including gaps. Each instance of the word is then associated with the segment where it starts. Because overlapping instances of the same word are counted as one instance, there is not more than one instance associated with the same segment. Similarly, the set of control sequences with the total length C is split into C/m segments of length m. We use the same equation for the z-score (see above) where q is the proportion of test segments with the word, p is the proportion of test and control segments with the word, n = T/m, and N = (T + C)/m. Although occurrences of word instances in adjacent segments may be weakly correlated, the hypergeometric distribution gives a reasonable approximation of the z-score.To fill the gaps and extend the length of PFMs, the test and control sequences are searched again for the exact match to each word with z > 1.643 (to satisfy the condition of P < 0.05 for one-tail z-test). Each match of the word in the test (or control) sequence is then examined for nucleotides in the gaps and flanking sequences (2 bp at each side) that are not included in the word. In this way, we can count nucleotide frequencies in gaps and flanking regions and estimate the PFM for these positions using Equation (e2) (Fig. 1 The frequency distribution of nucleotides in the flanking regions and in gaps of a certain word may differ substantially between the test and control sets of sequences. In such a case, this difference can be used to increase the statistical power for identification of significant motifs. To incorporate this factor into the statistical evaluation of motif significance, we compared frequency distributions of nucleotides (counted for each nucleotide and each flanking/gap position) in the test and control sequences using the G-test.21 Assuming that this test is independent from the z-test for over-representation of word counts (see above), we combined P-values from these tests using Fisher's method.22 Finally, we used the false discovery rate (FDR) to account for simultaneous testing of multiple hypotheses.23 We designed the program to generate at least 100 top-scored motifs and additional motifs, if they satisfy the criterion of FDR < 0.05. PFMs are then clustered based on similarity and/or co-occurrence (Fig. 1 We use single-linkage clustering, and then each cluster is checked for homogeneity. If the cluster is not homogeneous, it is separated into subclusters using the second round of clustering. Subclustering is done iteratively starting from a pair of seed motifs, adding sequentially most similar motifs, and re-estimating the combined PFM for the subcluster. Each pair of motifs is characterized by the score =r m1 m2, where r is the correlation between PWMs, and m1 and m2 are the numbers of linked members for motifs 1 and 2, respectively. Then the pair with the highest score is selected as a seed for the subcluster. This procedure is different from the single-linkage clustering because motifs are added to the subcluster based on the similarity to the combined PFM of all motifs that are already included into the subcluster, whereas the single-linkage clustering is based on the similarity between individual (non-combined) motifs. Motifs are added until no motif within the cluster can be added to the subcluster using the given threshold of similarity. If all elements in the cluster appear to be in the same subcluster, then the cluster is considered homogeneous. Otherwise, the elements of the subcluster are removed from the cluster, and the same algorithm is applied to the remaining elements. The advantage of clustering PFMs compared with clustering words (as in RSAT26) is that PFMs contain more information than words alone. Words differ qualitatively (the nucleotide either matches or mismatches), whereas PFMs differ quantitatively (i.e. the probability of each nucleotide correlates between two PFMs). Motifs within the same cluster are then arranged using the hierarchical clustering with cluster flips to place similar motifs near each other.27 Then, the PFM for the entire cluster is estimated as the weighted average of member PFMs using local information content at each position p
As an alternative criterion for clustering, the CisFinder also uses co-occurrence of word instances in the test sequences. This method is generally less accurate than the similarity-based method because of the limited number of word pairs in the sequence set. We, therefore, designed the program to use the correlation-based clustering method as a default. However, the co-occurrence method may help to cluster PFMs with a high level of self-similarity (after shifting a position by 1–4 bp), because their relative positions cannot be uniquely identified based on the correlation. Further details of the algorithm implementation are available in Supplementary Text S1 and online (http://lgsun.grc.nia.nih.gov/CisFinder).29 2.3. Implementation of additional tool to search for motifs that match to PFMs Once PFMs are estimated by the CisFinder or given in the literature, it is often necessary to find DNA sites that match a specific PFM in a given sequence (e.g. in promoters of certain genes). This task is computationally intensive if a matching score is estimated sequentially at each position of the sequence as in MatInspector or MATCH.30,31 We implemented as additional tool in the CisFinder website, a faster method to identify DNA sites, which is based on a lookup table. For each motif represented by a PFM, we selected the most informative stretch of eight nucleotides, which is used as a core. Then, a lookup table is generated that specifies all PFMs from the list whose cores match sufficiently well to each possible 8-mer word. The length of 8 bp is selected for the core because the number of all 8-mers is small enough to keep the lookup table in the computer memory, and 8-mer words are specific enough to be linked with only a few PFMs that match them. A match score is defined as log likelihood that a specific sequence matches a matrix and is equal to the sum of those elements of the PWM (log-transformed PFM) that corresponds to nucleotides at each position of the sequence. The match score for the full matrix, Tfull = T8 + Tresid, where T8 is the match score for 8-mer core and Tresid is the match score for the residual of the matrix. The program finds the threshold value R8 for the match score T8, which ensures that the match score for the full matrix exceeds the given threshold Rfull with probability 0.999 if T8 > R8:
2.4. Data sets used in this study CisFinder was tested using published ChIP-seq data on binding of 14 TFs [CTCF, ESRRB, KLF4, MYC, POU5F1 (also known as OCT4 or OCT3/4), SMAD1, SOX2, STAT3, TCFCP2L2, ZFX, P300, NMYC, NANOG, E2F1] in ES cells9 (Supplementary Table S1). We also used a deliberately selected low-quality subset of ChIP-PET data10 on binding of POU5F1 to test if CisFinder can process sequences with low enrichment of binding motifs. We used genome locations with 2 (N = 19 803) and 3 ditags (N = 3361) from POU5F1 ChIP-PET that did not include loci with additional NANOG ditags to avoid indirect binding effects (Supplementary Table S2). All binding regions were mapped to the latest mouse genome (mm9, NCBI/NIH) using the UCSC coordinate conversion tool (http://genome.ucsc.edu/cgi-bin/hgLiftOver). 3. Results and discussion 3.1. CisFinder algorithm and its main features The proposed CisFinder algorithm, which is implemented as an online software tool,29 is described in detail in Section 2. In brief, CisFinder has the following features.
3.2. CisFinder algorithm accurately identifies PFMs of TF binding motifs To test the performance of the new algorithm, we used ChIP-seq data for 12 TFs associated with the pluripotent state of ES cells.9 We extracted 200 bp sequence segments centered at TF binding locations identified with ChIP-seq and compared them with control sequences (i.e. 500 bp sequence segments starting from nucleotide positions 400 bp away from both ends of 200 bp test sequence segments). Clustering of PFMs generates highly consistent TF binding motifs that were independent from the correlation threshold used for clustering (Fig. 2
Computation time for all steps of the CisFinder algorithm ranged from 5 to 120 s (Supplementary Table S1), with median time 38 s needed to process a 7.5 Mb sequence (sum of test and control sequences). Our estimate is that our software works >1000 times faster than both MEME15 and Weeder36 and >100 times faster than RSAT.26 The MDscan20 works fast; however, it is designed to process a small number of sequences (from 20 to 400, see http://ai.stanford.edu/~xsliu/MDscan), and the online version of the software accepts only 200 sequences. Taken together, the data indicate that CisFinder works faster than existing tools without sacrificing sensitivity. It is, however, difficult to make a fair comparison between tools for sensitivities and calculation speeds because each tool was designed to process different types of data. CisFinder was developed to process ChIP-seq or ChIP-chip data, which typically include several thousands of sequences (i.e. a few megabase) and cannot be effectively processed by probabilistic methods (e.g. MEME and Weeder). On the other hand, the probabilistic methods works efficiently on relatively small numbers of sequences (e.g. a typical benchmark sequence set is <32 kb13), as they are designed to process a small data set by selecting only high-scoring sequences. However, the reduction in the data set often leads to the loss of useful information, as we describe below (Section 3.3). 3.3. Cisfinder algorithm detects alternative binding motifs Eukaryotic transcription regulation is extremely complex, and most TFs have multiple binding motifs, which correspond to direct binding of single TFs, tandems of identical TFs in various orientations and spacings, binding with various co-factors, and finally, indirect binding via protein–protein interactions with other TFs. Analysis from 50 to 200 high-score binding sites (which is a typical data size for MEME or Weeder) is usually sufficient to extract the main motif, but it is often not sufficient to examine alternative motifs. For example, Chen et al.9 used Weeder36 and reported only a single motif for each TF. In contrast, using the same data set, CisFinder was able to find multiple motifs for each TF, e.g. POU5F1 (also known as OCT4 or OCT3/4), ESRRB, and CTCF (Fig. 2 For the POU5F1, predicted alternative binding motifs included several palindromes (Fig. 2 For the estrogen-related receptor beta (ESRRB), CisFinder predicted 12 alternative binding motifs (Fig. 2 For the CTCF (an insulator in the regulation of transcription42), several alternative binding motifs were detected. The main DNA motif enriched in ChIP–CTCF loci (Fig. 2 3.4. CisFinder algorithm detects binding motifs of potential co-factors Genome locations identified with ChIP for a specific TF often do not carry the primary or alternative binding motifs, but are enriched with binding motifs for other TFs (cofactors). The most likely interpretation of this phenomenon is that the TF used for the immunoprecipitation binds to DNAs indirectly through binding to a co-factor that directly binds to DNA. Thus, the analysis of co-factor binding motifs may help to infer potential mechanisms of transcription regulation. To explore this issue, we first selected 22 motifs that were over-represented in ChIP loci for single or multiple TFs reported by Chen et al.9 and used the corresponding PFMs generated by CisFinder to search for these motifs in 200 bp DNA segments centered at ChIP loci (Fig. 3
Next, we compared the abundance of these motifs in the 200 bp DNA segments centered at ChIP loci with the control sequences (i.e. 500 bp sequence segments starting from nucleotide positions 400 bp away from both ends of 200 bp test sequence segments). To obtain a homogeneous data set, we used only the ChIP loci that were located at >500 bp away from the transcription start sites of genes (distal ChIP loci). Another reason to focus on the distal ChIP loci was that pluripotency-related TFs, such as POU5F1 and NANOG, are active mostly at distal locations rather than at proximal promoters.10 We then tabulated the motif abundance data and found that TFs and corresponding binding motifs formed three distinctive groups (Fig. 3 We also noticed that some DNA motifs were negatively associated with binding of some TFs, which may indicate the inhibition of DNA binding. For example, the major OCT4 palindrome motifs, OCT4-GCGC and OCT4-MORE, were strongly under-represented in many ChIP loci including binding sites of pluripotency-related TFs (NANOG, SOX2, STAT3, KLF4) (green color), except for POU5F1 that bound to these motifs (Fig. 3 3.5. CisFinder algorithm can find motifs with a low level of enrichment We tested whether the CisFinder algorithm was robust enough to identify motifs that were only slightly enriched in the set of DNA sequences. According to Loh et al.,10 ChIP loci with at least 4 ditags (ChIP-PET data) were reliable enough to infer binding of POU5F1 and NANOG. Thus, we used ChIP loci with 2 or 3 ditags for POU5F1 as examples of data with a low level of motif enrichment. To evaluate the over-representation of binding motifs, we searched for the OCT–SOX motif in 200 bp test DNA segments centered at ChIP loci and in control sequences (i.e. two 500 bp sequence segments starting from nucleotide positions 400 bp away from both ends of 200 bp test sequence segments). To avoid a circular reference, we took the PFM for the OCT–SOX motif from an independent source, where the PFM was estimated on the basis of ChIP-PET loci with at least 4 ditags for POU5F1.10 The over-representation ratios of the OCT–SOX motif density were only 1.57 and 0.99 in ChIP-PET data sets with 3 and 2 ditags, respectively (Supplementary Fig. S3). They were substantially lower than the over-representation ratio (7.10) of the OCT–SOX motif in the ChIP-seq data, which confirms the low level of motif enrichment. The CisFinder algorithm was successful in finding the OCT–SOX composite motif ATTGTTATGCAAAT as the top-scored consensus sequence for the set of 3361 ChIP loci with 3 ditags. Similarly, in the set of 19 803 genome loci with 2 ditags for POU5F1, CisFinder identified a canonical POU-motif ATGCAAAT.50 However, this motif was not the top-scored one (rank = 11), which may be the result of a large proportion of false positives in the data set. The OCT–SOX composite motif was not found, which can be explained by no enrichment of this motif (over-representation ratio = 0.99) (Supplementary Fig. S3). Thus, we hypothesized that the weak binding of POU5F1 does not require SOX2 as a co-factor. Top-scored motifs over-represented in ChIP loci with 2 ditags were also meaningful: they corresponded to NRF1 and KLF motifs, which were associated with POU5F1 binding as shown above (Fig. 3 3.6. Other potential applications and limitations of CisFinder Although CisFinder was designed specifically for the analysis of ChIP experiments on TF binding, it can be used for other purposes. For example, it can be used to find over-represented motifs in promoters of co-regulated genes, in introns of alternatively spliced genes, or in 3′-untranslated regions of genes with high or low rates of mRNA degradation. The search for over-represented motifs can be improved by limiting the search to evolutionarily conserved regulatory regions because functional sequences have a tendency to be conserved during evolution.52 [However, recent findings indicate that many regulatory regions are located in transposable elements, which are usually not conserved.32] Because of its high processing speed, CisFinder can be used interactively by adjusting parameters of motif detection. Also, it can be utilized effectively as a component of systems for reconstructing gene regulatory networks. For example, Reiss et al.53 used de novo motif discovery in promoters of co-regulated genes, which were clustered using the data on gene expression in various conditions. Because the identification of motifs is repeated many times in this analysis, the use of CisFinder algorithm can increase the processing speed. The main limitation of CisFinder algorithm is that its performance decreases if the input sequence is too short. For example, if the length of sequence is 32 kb, then it contains only one 8-mer word on average (based on the random model). In this case, the CisFinder can detect only highly over-represented motifs (e.g. with >10-fold enrichment) and, thus, other software (e.g. MEME15) should be used instead. 3.7. Conclusion CisFinder implements an express method for de novo identification of over-represented DNA motifs and is specifically designed to process ChIP-chip and ChIP-seq data. It is a complementary method to existing motif-finding tools, which are highly efficient in processing short input sequences. Unique features of CisFinder are: (i) it extracts all over-represented motifs in a single run and describes them with PFMs; (ii) it can effectively process large sequences (up to 50 Mb); (iii) because of its high processing speed, it can be used in an interactive manner by running the analyses multiple times after re-adjusting parameters; and (iv) it can process data with a low-level enrichment of DNA motifs. Funding This research was supported entirely by the Intramural Research Program of the NIH, National Institute on Aging. [Supplementary Data]
Acknowledgements We thank Dawood Dudekula for help with configuration of web server and program testing. We thank Dr Huck-Hui Ng at the Genome Institute of Singapore for kindly providing raw data of their published ChIP-PET study. Footnotes Edited by Kenta Nakai References 1. Stoltenburg R., Reinemann C., Strehlitz B. SELEX–a (r)evolutionary method to generate high-affinity nucleic acid ligands. Biomol. Eng. 2007;24:381–403. [PubMed] 2. Badis G., Berger M.F., Philippakis A.A., et al. Diversity and complexity in DNA recognition by transcription factors. Science. 2009;324:1720–3. [PubMed] 3. Barski A., Cuddapah S., Cui K., et al. High-resolution profiling of histone methylations in the human genome. Cell. 2007;129:823–37. [PubMed] 4. Johnson D.S., Mortazavi A., Myers R.M., Wold B. Genome-wide mapping of in vivo protein–DNA interactions. Science. 2007;316:1497–502. [PubMed] 5. Robertson G., Hirst M., Bainbridge M., et al. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat. Methods. 2007;4:651–7. [PubMed] 6. Lieb J.D. Genome-wide mapping of protein–DNA interactions by chromatin immunoprecipitation and DNA microarray hybridization. Methods Mol. Biol. 2003;224:99–109. [PubMed] 7. Xie D., Cai J., Chia N.Y., Ng H.H., Zhong S. Cross-species de novo identification of cis-regulatory modules with GibbsModule: application to gene regulation in embryonic stem cells. Genome Res. 2008;18:1325–35. [PubMed] 8. Berger M.F., Badis G., Gehrke A.R., et al. Variation in homeodomain DNA binding revealed by high-resolution analysis of sequence preferences. Cell. 2008;133:1266–76. [PubMed] 9. Chen X., Xu H., Yuan P., et al. Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell. 2008;133:1106–17. [PubMed] 10. Loh Y.H., Wu Q., Chew J.L., et al. The Oct4 and Nanog transcription network regulates pluripotency in mouse embryonic stem cells. Nat. Genet. 2006;38:431–40. [PubMed] 11. Bock C., Lengauer T. Computational epigenetics. Bioinformatics. 2008;24:1–10. [PubMed] 12. Das M.K., Dai H.K. A survey of DNA motif finding algorithms. BMC Bioinformatics. 2007;8(Suppl 7):S21. [PubMed] 13. Sandve G.K., Abul O., Walseng V., Drablos F. Improved benchmarks for computational motif discovery. BMC Bioinformatics. 2007;8:193. [PubMed] 14. Tompa M., Li N., Bailey T.L., et al. Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotechnol. 2005;23:137–44. [PubMed] 15. Bailey T.L., Williams N., Misleh C., Li W.W. MEME: discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Res. 2006;34:W369–73. [PubMed] 16. Thompson W., Rouchka E.C., Lawrence C.E. Gibbs Recursive Sampler: finding transcription factor binding sites. Nucleic Acids Res. 2003;31:3580–5. [PubMed] 17. Wei Z., Jensen S.T. GAME: detecting cis-regulatory elements using a genetic algorithm. Bioinformatics. 2006;22:1577–84. [PubMed] 18. Li S.M., Wakefield J., Self S. A transdimensional Bayesian model for pattern recognition in DNA sequences. Biostatistics. 2008;9:668–85. [PubMed] 19. Zhou Q., Liu J.S. Extracting sequence features to predict protein–DNA interactions: a comparative study. Nucleic Acids Res. 2008;36:4137–48. [PubMed] 20. Liu X.S., Brutlag D.L., Liu J.S. An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat. Biotechnol. 2002;20:835–9. [PubMed] 21. Sokal R.R., Rohlf F.J. Biometry. The Principles and Practice of Statistics in Biological Research. New York: Freeman; 2001. 22. Hess A., Iyer H. Fisher's combined P-value for detecting differentially expressed genes using Affymetrix expression arrays. BMC Genomics. 2007;8:96. [PubMed] 23. Benjamini Y., Hochberg Y. Controlling the false discovery rate—a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B. 1995;57:289–300. 24. Habib N., Kaplan T., Margalit H., Friedman N. A novel Bayesian DNA motif comparison method for clustering and retrieval. PLoS Comput. Biol. 2008;4:e1000010. [PubMed] 25. Gupta S., Stamatoyannopoulos J.A., Bailey T.L., Noble W.S. Quantifying similarity between motifs. Genome Biol. 2007;8:R24. [PubMed] 26. van Helden J., Andre B., Collado-Vides J. Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J. Mol. Biol. 1998;281:827–42. [PubMed] 27. Eisen M.B., Spellman P.T., Brown P.O., Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA. 1998;95:14863–8. [PubMed] 28. Schneider T.D., Stephens R.M. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 1990;18:6097–100. [PubMed] 29. Sharov A.A., Ko M.S.H. 2008. CisFinder. http://lgsun.grc.nia.nih.gov/CisFinder . 30. Kel A.E., Gossling E., Reuter I., Cheremushkin E., Kel-Margoulis O.V., Wingender E. MATCH: a tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res. 2003;31:3576–9. [PubMed] 31. Quandt K., Frech K., Karas H., Wingender E., Werner T. MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Res. 1995;23:4878–84. [PubMed] 32. Bourque G., Leong B., Vega V.B., et al. Evolution of the mammalian transcription factor binding repertoire via transposable elements. Genome Res. 2008;18:1752–62. [PubMed] 33. Bryne J.C., Valen E., Tang M.H., et al. JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update. Nucleic Acids Res. 2008;36:D102–6. [PubMed] 34. Sharov A.A., Dudekula D.B., Ko M.S. CisView: a browser and database of cis-regulatory modules predicted in the mouse genome. DNA Res. 2006;13:123–34. [PubMed] 35. Karolchik D., Baertsch R., Diekhans M., et al. The UCSC genome browser database. Nucleic Acids Res. 2003;31:51–4. [PubMed] 36. Pavesi G., Zambelli F., Pesole G. WeederH: an algorithm for finding conserved regulatory motifs and regions in homologous sequences. BMC Bioinformatics. 2007;8:46. [PubMed] 37. Tomilin A., Remenyi A., Lins K., et al. Synergism with the coactivator OBF-1 (OCA-B, BOB-1) is mediated by a specific POU dimer configuration. Cell. 2000;103:853–64. [PubMed] 38. Botquin V., Hess H., Fuhrmann G., et al. New POU dimer configuration mediates antagonistic control of an osteopontin preimplantation enhancer by Oct-4 and Sox-2. Genes Dev. 1998;12:2073–90. [PubMed] 39. Tantin D., Gemberling M., Callister C., Fairbrother W. High-throughput biochemical analysis of in vivo location data reveals novel distinct classes of POU5F1(Oct4)/DNA complexes. Genome Res. 2008;18:631–9. [PubMed] 40. Yuan H., Corbi N., Basilico C., Dailey L. Developmental-specific activity of the FGF-4 enhancer requires the synergistic action of Sox2 and Oct-3. Genes Dev. 1995;9:2635–45. [PubMed] 41. Akter M.H., Chano T., Okabe H., Yamaguchi T., Hirose F., Osumi T. Target specificities of estrogen receptor-related receptors: analysis of binding sequences and identification of Rb1-inducible coiled-coil 1 (Rb1cc1) as a target gene. J. Biochem. 2008;143:395–406. [PubMed] 42. Bell A.C., West A.G., Felsenfeld G. The protein CTCF is required for the enhancer blocking activity of vertebrate insulators. Cell. 1999;98:387–96. [PubMed] 43. Moon H., Filippova G., Loukinov D., et al. CTCF is conserved from Drosophila to humans and confers enhancer blocking of the Fab-8 insulator. EMBO Rep. 2005;6:165–70. [PubMed] 44. Szabo P.E., Tang S.H., Silva F.J., Tsark W.M., Mann J.R. Role of CTCF binding sites in the Igf2/H19 imprinting control region. Mol. Cell. Biol. 2004;24:4791–800. [PubMed] 45. Xie X., Mikkelsen T.S., Gnirke A., Lindblad-Toh K., Kellis M., Lander E.S. Systematic discovery of regulatory motifs in conserved regions of the human genome, including thousands of CTCF insulator sites. Proc. Natl Acad. Sci. USA. 2007;104:7145–50. [PubMed] 46. Xie X., Lu J., Kulbokas E.J., et al. Systematic discovery of regulatory motifs in human promoters and 3’ UTRs by comparison of several mammals. Nature. 2005;434:338–45. [PubMed] 47. Matys V., Fricke E., Geffers R., et al. TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res. 2003;31:374–8. [PubMed] 48. Kim J.D., Faulk C., Kim J. Retroposition and evolution of the DNA-binding motifs of YY1, YY2 and REX1. Nucleic Acids Res. 2007;35:3442–52. [PubMed] 49. Kim J. YY1's longer DNA-binding motifs. Genomics. 2009;93:152–8. [PubMed] 50. Scholer H.R., Balling R., Hatzopoulos A.K., Suzuki N., Gruss P. Octamer binding proteins confer transcriptional activity in early mouse embryogenesis. EMBO J. 1989;8:2551–7. [PubMed] 51. Bruce S.J., Gardiner B.B., Burke L.J., Gongora M.M., Grimmond S.M., Perkins A.C. Dynamic transcription programs during ES cell differentiation towards mesoderm in serum versus serum-freeBMP4 culture. BMC Genomics. 2007;8:365. [PubMed] 52. Zhang Z., Gerstein M. Of mice and men: phylogenetic footprinting aids the discovery of regulatory elements. J. Biol. 2003;2:11. [PubMed] 53. Reiss D.J., Baliga N.S., Bonneau R. Integrated biclustering of heterogeneous genome-wide datasets for the inference of global regulatory networks. BMC Bioinformatics. 2006;7:280. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||
Biomol Eng. 2007 Oct; 24(4):381-403.
[Biomol Eng. 2007]Science. 2009 Jun 26; 324(5935):1720-3.
[Science. 2009]Cell. 2007 May 18; 129(4):823-37.
[Cell. 2007]Nat Methods. 2007 Aug; 4(8):651-7.
[Nat Methods. 2007]Methods Mol Biol. 2003; 224():99-109.
[Methods Mol Biol. 2003]BMC Bioinformatics. 2007 Nov 1; 8 Suppl 7():S21.
[BMC Bioinformatics. 2007]BMC Bioinformatics. 2007 Jun 8; 8():193.
[BMC Bioinformatics. 2007]Nat Biotechnol. 2005 Jan; 23(1):137-44.
[Nat Biotechnol. 2005]Nucleic Acids Res. 2006 Jul 1; 34(Web Server issue):W369-73.
[Nucleic Acids Res. 2006]Genome Res. 2008 Aug; 18(8):1325-35.
[Genome Res. 2008]Cell. 2008 Jun 13; 133(6):1106-17.
[Cell. 2008]BMC Genomics. 2007 Apr 9; 8():96.
[BMC Genomics. 2007]PLoS Comput Biol. 2008 Feb 29; 4(2):e1000010.
[PLoS Comput Biol. 2008]Genome Biol. 2007; 8(2):R24.
[Genome Biol. 2007]J Mol Biol. 1998 Sep 4; 281(5):827-42.
[J Mol Biol. 1998]Proc Natl Acad Sci U S A. 1998 Dec 8; 95(25):14863-8.
[Proc Natl Acad Sci U S A. 1998]Nucleic Acids Res. 1990 Oct 25; 18(20):6097-100.
[Nucleic Acids Res. 1990]Nucleic Acids Res. 2003 Jul 1; 31(13):3576-9.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 1995 Dec 11; 23(23):4878-84.
[Nucleic Acids Res. 1995]Cell. 2008 Jun 13; 133(6):1106-17.
[Cell. 2008]Nat Genet. 2006 Apr; 38(4):431-40.
[Nat Genet. 2006]J Mol Biol. 1998 Sep 4; 281(5):827-42.
[J Mol Biol. 1998]Genome Res. 2008 Nov; 18(11):1752-62.
[Genome Res. 2008]Nucleic Acids Res. 2008 Jan; 36(Database issue):D102-6.
[Nucleic Acids Res. 2008]DNA Res. 2006 Jun 30; 13(3):123-34.
[DNA Res. 2006]Nucleic Acids Res. 2003 Jan 1; 31(1):51-4.
[Nucleic Acids Res. 2003]J Mol Biol. 1998 Sep 4; 281(5):827-42.
[J Mol Biol. 1998]Genome Res. 2008 Nov; 18(11):1752-62.
[Genome Res. 2008]Nucleic Acids Res. 2008 Jan; 36(Database issue):D102-6.
[Nucleic Acids Res. 2008]DNA Res. 2006 Jun 30; 13(3):123-34.
[DNA Res. 2006]Nucleic Acids Res. 2003 Jan 1; 31(1):51-4.
[Nucleic Acids Res. 2003]Cell. 2008 Jun 13; 133(6):1106-17.
[Cell. 2008]J Mol Biol. 1998 Sep 4; 281(5):827-42.
[J Mol Biol. 1998]Cell. 2008 Jun 13; 133(6):1106-17.
[Cell. 2008]Nucleic Acids Res. 2006 Jul 1; 34(Web Server issue):W369-73.
[Nucleic Acids Res. 2006]BMC Bioinformatics. 2007 Feb 7; 8():46.
[BMC Bioinformatics. 2007]J Mol Biol. 1998 Sep 4; 281(5):827-42.
[J Mol Biol. 1998]Nat Biotechnol. 2002 Aug; 20(8):835-9.
[Nat Biotechnol. 2002]BMC Bioinformatics. 2007 Jun 8; 8():193.
[BMC Bioinformatics. 2007]Cell. 2008 Jun 13; 133(6):1106-17.
[Cell. 2008]BMC Bioinformatics. 2007 Feb 7; 8():46.
[BMC Bioinformatics. 2007]Cell. 2000 Dec 8; 103(6):853-64.
[Cell. 2000]Genes Dev. 1998 Jul 1; 12(13):2073-90.
[Genes Dev. 1998]Genome Res. 2008 Apr; 18(4):631-9.
[Genome Res. 2008]Genes Dev. 1995 Nov 1; 9(21):2635-45.
[Genes Dev. 1995]J Biochem. 2008 Mar; 143(3):395-406.
[J Biochem. 2008]Cell. 1999 Aug 6; 98(3):387-96.
[Cell. 1999]EMBO Rep. 2005 Feb; 6(2):165-70.
[EMBO Rep. 2005]Proc Natl Acad Sci U S A. 2007 Apr 24; 104(17):7145-50.
[Proc Natl Acad Sci U S A. 2007]Cell. 2008 Jun 13; 133(6):1106-17.
[Cell. 2008]Nature. 2005 Mar 17; 434(7031):338-45.
[Nature. 2005]Nucleic Acids Res. 2003 Jan 1; 31(1):374-8.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2007; 35(10):3442-52.
[Nucleic Acids Res. 2007]Genomics. 2009 Feb; 93(2):152-8.
[Genomics. 2009]Cell. 2008 Jun 13; 133(6):1106-17.
[Cell. 2008]Nat Genet. 2006 Apr; 38(4):431-40.
[Nat Genet. 2006]Nucleic Acids Res. 2003 Jan 1; 31(1):374-8.
[Nucleic Acids Res. 2003]Nat Genet. 2006 Apr; 38(4):431-40.
[Nat Genet. 2006]EMBO J. 1989 Sep; 8(9):2551-7.
[EMBO J. 1989]BMC Genomics. 2007 Oct 10; 8():365.
[BMC Genomics. 2007]Nucleic Acids Res. 2006 Jul 1; 34(Web Server issue):W369-73.
[Nucleic Acids Res. 2006]BMC Bioinformatics. 2007 Feb 7; 8():46.
[BMC Bioinformatics. 2007]J Biol. 2003; 2(2):11.
[J Biol. 2003]Genome Res. 2008 Nov; 18(11):1752-62.
[Genome Res. 2008]BMC Bioinformatics. 2006 Jun 2; 7():280.
[BMC Bioinformatics. 2006]Nucleic Acids Res. 2006 Jul 1; 34(Web Server issue):W369-73.
[Nucleic Acids Res. 2006]