![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 Genome-wide de novo prediction of cis-regulatory binding sites in prokaryotes 1Department of Bioinformatics and Genomics, Bioinformatics Research Center, the University of North Carolina at Charlotte, Charlotte, NC 28223, USA and 2College of Computer and Information Engineering, Tianjin Normal University, Tianjin 300387, China Corresponding author.*To whom correspondence should be addressed. Tel: Phone: +1 704 678 7996; Fax: +1 704 678 6610; Email: zcsu/at/uncc.edu Received November 25, 2008; Revised March 16, 2009; Accepted April 2, 2009. The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Although cis-regulatory binding sites (CRBSs) are at least as important as the coding sequences in a genome, our general understanding of them in most sequenced genomes is very limited due to the lack of efficient and accurate experimental and computational methods for their characterization, which has largely hindered our understanding of many important biological processes. In this article, we describe a novel algorithm for genome-wide de novo prediction of CRBSs with high accuracy. We designed our algorithm to circumvent three identified difficulties for CRBS prediction using comparative genomics principles based on a new method for the selection of reference genomes, a new metric for measuring the similarity of CRBSs, and a new graph clustering procedure. When operon structures are correctly predicted, our algorithm can predict 81% of known individual binding sites belonging to 94% of known cis-regulatory motifs in the Escherichia coli K12 genome, while achieving high prediction specificity. Our algorithm has also achieved similar prediction accuracy in the Bacillus subtilis genome, suggesting that it is very robust, and thus can be applied to any other sequenced prokaryotic genome. When compared with the prior state-of-the-art algorithms, our algorithm outperforms them in both prediction sensitivity and specificity. INTRODUCTION While a biological function of a cell is the result of specific interactions of a set of gene products—proteins and RNAs expressed in the cell under certain physiological and environmental conditions, the controlling programs that specify when, where, how much, and how fast a specific set of proteins and RNAs should be expressed are mainly defined in the non-coding functional sequences, in particular, the cis-regulatory binding sites (CRBSs) through their interactions with specific transcription factors (TFs). In prokaryotes, several adjacent genes on the same strand of DNA often form an operon and are co-transcribed as a polycistronic mRNA. Genes in an operon generally share the same transcription initiation and termination control signals and machinery. In eubacteria, gene transcription initiation is controlled by the σ-factor of the RNA polymerase together with other specific TFs that respectively bind to the promoter and CRBSs located in the upstream region of an operon. Typically, a genome encodes far fewer TFs than the number of operons, therefore each TF usually regulates multiple operons (for the convenience of this discussion, we also call a singleton gene an operon). The collection of the operons that are regulated by a TF is called the regulon of the TF. As some operons are regulated by more than one TF, an operon can belong to different regulons. The set of similar CRBSs recognized by a TF is called its cis-regulatory motif, or binding site motif. Although great advances have been made in identifying the coding sequences in prokaryotic genomes using computational methods alone, it remains an unsolved task for both the experimental and computational biology communities to efficiently and accurately identify all the CRBSs in a genome. Therefore, no single organism has so far had most of its cis-regulatory systems characterized; and even for the most well-studied prokaryotic model organism E. coli K12, researchers have only characterized partial CRBSs for 125 of the ~314 estimated TFs in its genome through decades of research (1). As a result, except for a handful of strains, such as E. coli K12 (1) and B. subtilis (2), we know very little about the cis-regulatory systems in all sequenced prokaryotic genomes (3). The lack of a holistic understanding of the cis-regulatory systems in these organisms has hindered our understanding of many important biological processes such as development, differentiation, evolution, disease and specialized biological functions of many organisms. Hence, there is an urgent need in the biological research community for an efficient and accurate computational method for predicting all possible cis-regulatory systems in sequenced prokaryotic genomes. Prediction of CRBSs has been a consequence of the development of computational methods for modeling CRBSs over the past almost three decades (4,5). The early attempts to predict new CRBSs started by compiling known binding sites of interest, and then the sequence profile of these known CRBSs was used to search for additional ones in the genome of interest (6,7). With the advent of microarray gene expression profiling technologies and availability of increasing numbers of sequenced genomes, numerous motif-finding algorithms have been developed to identify overrepresented segments of sequences as potential CRBSs from a set of regulatory regions of a few co-regulated genes (8,9). Later, Gelfand et al. introduced the phylogenetic footprinting technique (10) to predict CRBSs of a TF whose regulon members are at least partially known, in a group of related genomes (11,12). This method and its variants have been widely adapted to predict the CRBSs and the regulon of a TF in related bacterial or archaeal species (13–22). However, these methods cannot be scaled up at genome scale for all possible CRBSs because the regulon information is largely unknown for most of sequenced genomes. Structure-based algorithms have also been developed to predict new CRBSs for a TF whose structure is known (23,24); nevertheless, these methods have only had limited application, since accurate structures of most TFs are not available yet. To our knowledge, the first genome-wide CRBS and regulon prediction was carried by van Nimwegen et al. (25). They used Monte Carlo sampling of the putative binding sites to partition thousands of short conserved DNA sequences into clusters, which were identified by phylogenetic footprinting methods and each cluster was predicted as a cis-regulatory motif. However, this approach only predicted ~100 motifs/regulons in E. coli K12 (25). Later, Qin et al. (26) used a Bayesian clustering algorithm to group similar putative binding sites predicted in E. coli K12 by phylogenetic footprinting in an earlier work (27), and predicted 192 motifs covering only 438 operons. More recently, Alkema et al. (28) proposed yet another phylogenetic footprinting-based algorithm using a rather simple algorithm to cluster putative CRBSs into clusters or motifs which were then used to scan the genome for additional ones. One of the major problems of all these algorithms is that they assume that the input motifs predicted by motif-finding algorithms are true binding sites. However, this assumption may not be valid, as recent studies have shown that, of the surveyed popular motif-finding programs including those used in these studies, the best predicted at most 40% known binding sites in the input intergenic sequences, with high false positive prediction rates (29,30). This would partially explain their generally low prediction accuracy. Although Pritsker et al. (31) have used multiple motifs predicted by a motif-finding tool from pooled orthologous intergenic sequences in fungi strains to predict CRBSs in a genome scale, their coverage was not high either, because a limited number of genomes were used, and a rather simple motif clustering method was employed. In another early study, Li et al. (32) attempted to identify clusters of overrepresented bipartite patterns in the intergenic sequences in E. coli K12 as possible cis-regulatory motifs, but this method is limited as not all binding sites are bipartite, and the power of comparative genomics was not explored. As a result, only one third of known CRBSs were predicted by this method (32). The PhyloNet algorithm is probably the most recent development for genome-wide de novo prediction of CRBSs in simple eukaryotic (33) and prokaryotic (34) genomes. PhyloNet finds binding site motifs through clustering multiple motifs identified by a motif finder in the orthologous intergenic sequences of closely related genomes. However, to speed up the motif comparison process, PhyloNet reduces the continuous motif profile space to a discrete one (33), which would sacrifice the sensitivity to detect highly degenerate CRBSs. Furthermore, sub-motifs of the same TF are not effectively clustered by PhyloNet to form a unique motif (33,34). In our opinion, the difficulty of genome-wide de novo prediction of CRBSs has three causes. First, CRBSs are short with a length of 6–30 base pairs (bp), and the sequences are degenerate (9), while residing in usually long non-coding sequences where the chance for the random occurrence of a sequence similar to a CRBS is high. Second, there is no general pattern in CRBSs; any segment of a sequence can be potentially a CRBS as long as there is a TF that can specifically bind it. Third, these sequence-based cis-regulatory motif prediction algorithms attempt to model the 3D protein–DNA interaction events with a sequence pattern finding problem, which cannot possibly capture all of the biophysical aspects of the protein–DNA interactions, thus a rather high false positive prediction rate is almost unavoidable. This might explain why all the surveyed motif-finding programs can only predict at most 40% known binding sites, although different algorithms may have complementary predictions (29,30). In this article, we have developed a novel algorithm called ‘GLECLUBS’ (GLobal Ensemble CLUsters of Binding Sites) for the genome-wide de novo prediction of CRBSs in prokaryotic genomes by circumventing these difficulties. We have applied it to the E. coli K12 and B. subtilis genomes, where a relatively large number of CRBSs are known for validation purposes. The algorithm has achieved rather high prediction accuracy and robustness, and it outperforms the prior algorithms compared. The software package is freely available upon request. MATERIALS AND METHODS Materials The protein sequences, genome sequences and their annotation files of a total of 139 γ-proteobacteria and 124 firmicutes were downloaded from the NCBI RefSeq database at (ftp://ftp.ncbi.nih.gov/genomes). The known CRBSs of E. coli K12 and B. subtilis were downloaded from RegulonDB v6.0 (35) and DBTBS release 5 (2), respectively. Known and predicted TFs were downloaded from the DBD database (36). A compendium of microarray dataset from E. coli K12 collected under 380 experiments using the Affymetrix® platform were downloaded from the M3D database (37). The design of the algorithm The GLECLUBS algorithm is based on a comparative genomics approach and its flowchart is shown in Figure 1
Predictions of orthologs and operons Orthologous proteins and their genes between two genomes were predicted by the bidirectional best hits method (38) using the BLASTP algorithm with an E-value cutoff 10–20 for both searches. Operons in each genome were predicted by the algorithm described in (39), which performed best among all other predictors when evaluated by Brouwer et al. (40). Selection of reference genomes Since gene transcription regulatory networks tend to evolve very rapidly (41,42), we selected a genome as a reference genome not only based on its evolutionary relationship to the target genome, but also based on its shared gene transcription regulatory networks with the target genome. To this end, we represented the distribution of a total of n known and predicted TFs of the target genome (downloaded from DBD database (36)) in a related genome Gj by a bit vector Gj(b1, b2, …, bi, …, bn), where bi = 1, if the i-th TF of the target genome has an ortholog in Gj, otherwise bi = 0. The Hamming distance between each pair of these TF distribution vectors was used to construct a neighbor-joining tree. We typically selected a monophyletic group including the target genome in this tree as reference genomes with too closely related genomes removed, and each genome has orthologs of at least 25% TFs in the target genome. We have selected this cutoff as it can generally include moderately related genomes according to other phylogenetic information (e.g. 16S RNA genes trees, data not shown), but exclude closely related parasitic genomes that are known to have a simpler gene transcriptional regulatory network due to their tremendous genome reductions. Using this criterion, we selected 55 from 139 sequenced γ-proteobacteria, and 17 and 33 from 124 sequenced firmicutes as the reference genomes for the CRBS predictions in E. coli K12 (Figure S1), B. subtilis (Figure S2) and S. oneidensis (Figure S3) respectively.Prediction of input motifs Let Gt be our target genome and GR be the set of m reference genomes G1, G2, … , Gm. Each gene g in Gt and its orthologs in all of the genomes in GR consist of an orthologous group Og. For each Og, we extract up to 800 bases upstream inter-operonic region of each gene in Og to form a sequence set Jg according to the predicted operon structures in Gt and GR, and we say that Jg is associated with gene g. The union of all {jg} associated with each gene g in an operon o of the target genome forms a larger inter-operonic sequence set Io if it contains at least five sequences, and we call that Io is associated with the operon o. That is, Io= g o Jg. We apply k motif-finding tools to each Io, and the i-th tool returns its top mi predicted motifs. All the tools return the same length motifs at this point. Thus, there are motifs identified from each Io. If there are n operons in the target genome, we will end up with a total of input motifs. In order to distinguish the true motifs from the spurious ones in these input motifs based on the assumption that a true motif is likely to have multiple similar copies in these input motifs, whereas a spurious one will not, we need to compute the similarity between a pair of input motifs. Although several metrics have been developed previously to quantify the similarity of sequence motifs (43–46), none of them resulted in a satisfactory results for our purpose (see ‘Results’ section), therefore we define the following metric.Computing the similarity between two sequence motifs Let M be a sequence motif containing n sequences with length L, andFM = (fM (b, i))4 × L be the base frequency matrix of the motif M. The profile matrix of M is defined as,
i) is the probability of base b appearing at position i of the motif M, and q(b) is the probability of base b appearing in the background sequences. A pseudo-count is added when computing these probabilities.The information content of column i of the profile matrix PM is defined as,
A1, 2, is defined as the alignment of the columns of F2 and P1 that maximize the number of columns {i} satisfying Σb (f2 (b, s(i))Prf1 (b, i) ≥ 0 (Figure S4). We define the likelihood score for P1 to generate F2 as
We computed the similarity between any two motifs from different inter-operonic sequence sets. For the motifs from the same inter-operonic sequence set, we only calculated the similarity between the pair of motifs whose sequences from the target genome have a large overlap (≥50%). To compute the similarity scores between sub-motifs of a known motif that has n known binding sites (we only consider the motifs that have at least three binding sites), we randomly selected (n − k + 1) sub-sets (sub-motifs) of size k with replacement from the n binding sites, k = 1, … , n. Therefore, there are n(n + 1)/2 sub-motifs for each known motif. Pair-wise similarity scores among these sub-motifs were then computed for each known motif. Prediction of cis-regulatory motifs We predicted all possible CRBSs in the target genome through the following algorithm. Step 1. Construct the motif similarity graph Given the computed similarity scores between pairs of input motifs, we constructed the motif similarity graph using the input motifs as the nodes. We connected any two nodes if the similarity score between their corresponding motifs was greater than a preset cutoff β, and assigned the similarity score as the weight of the edge. Step 2. Cut the motif similarity graph into smaller subgraphs The above constructed motif similarity graph was usually very large. To efficiently cut this graph into smaller condensed subgraphs, we applied the Markov clustering (MCL) algorithm (47) to the graph. MCL iteratively computes random walks determined by a Markov chain through alternately executing two operators (expansion and inflation) on a stochastic matrix. We kept the resulting clusters that contained at least three input motifs for further analysis, and discarded the rest. Step 3. Find cliques from each of the resulting subgraphs obtained by MCL For each node in a subgraph obtained by MCL, we found a clique associated with it. This can be done by repeatedly deleting its neighbor node with the minimum-degree, until a clique is formed (Figure S5). If at least two nodes have the same minimum degree, we break the tie by deleting the node with the minimum sum of weights of its incident edges. Although finding all the cliques with maximal nodes in a large graph is impossible because the Maximum Clique Problem is NP-hard (48), this greedy algorithm searches for exactly one clique associated with each node, and thus is rather fast (for a node v with degree dv, its time complexity is O(dv2); and since the graph is sparse, v is usually small). We discarded the nodes/motifs that were not included in a clique. Note that a node could appear in multiple different cliques identified by this algorithm. Step 4. Construct quasi-cliques by merging cliques We noted that cliques were too strict for clustering the binding sites of the same motif, as many known binding sites of the same motif were separated into different cliques due to their low similarity; therefore we needed to combine them. To this end, we first deleted the redundant cliques, and computed the overlapping rate of two cliques Ca and Cb, defined as
> δ and Rab > ε, where δ and ε are two preset cutoff values, and δ > ε, then we merged Ca and Cb into a so-called quasi-clique Qab (δ = 0.9 and ε = 0.7 in our current applications). Notably, a node could appear in different quasi-cliques due to its appearance in multiple cliques.Step 5. Construct target genome specific non-overlapping sequence sets For each quasi-clique, we extracted the sequences from the target genome, and merged the overlapping sequences to form a target genome-specific sequence set. Step 6. Predict renewed motifs We applied a motif-finding tool (MEME) to each of the constructed target genome specific sequence sets, and kept the best motif, which we called a renewed motif and discarded the rest of sequences in the set. Step 7. Cluster the renewed motifs We computed the similarity scores between pairs of renewed motifs and repeated steps 1 and 2 to group these motifs into new clusters. Step 8. Merge and extend sequences in each cluster We first merged the sequences in each new cluster into a new non-overlapping sequence set. To fix the drawbacks of using a fixed length in our motif-finding processes so far, we then extended each sequence on both ends by a fixed length (10 bases) by padding its flanking genome sequences. Step 9. Repeat Steps 6 and 7 For each extended non-overlapping sequence set, we used the motif-finding tool MEME to find the best motif with motif length being automatically determined in the region 6–22 bp, and then grouped these motifs into clusters by repeating Steps 6 and 7.Step 10. Refine clusters We applied MEME again to each cluster obtained in Step 9 with motif length being automatically determined in the region 6–22 bp. The sequences recovered by the top 10 motifs by MEME in each cluster formed our final predicted motifs in that cluster, since we noted that MEME and other motif-finding tools tended to find different parts of the same binding site motif in its different top-ranked predictions.Step 11. Rank the predicted motifs/clusters The resulting clusters from Step 10 varied in terms of the quality of the putative motif that each contained, and thus the likelihood of their correspondence to a true cis-regulatory binding motif. In addition, the same sequence could appear in different clusters, we needed to determine the most possible cluster/motif to which it should belong. To this end, we ranked the clusters according to the similarity of the sequences in a cluster. For this purpose, we computed a cluster quality score for each cluster defined as
RESULTS The accuracy of operon predictions is a constraint on phylogenetic footprinting based cis-regulatory motif predictions To insure the robustness of our algorithm and to popularize it for other less well-studied genomes which usually have no ample experimental data, we did not use the experimentally determined operon structures in E. coli K12. Instead, we predicted a total of 2396 operons including 1556 singleton and 840 multi-gene operons in the E. coli K12 genome, which cover 84.6% of the known operon structures (39). Based on these operon predictions as well as those in the 55 reference genomes, we constructed 2313 inter-operonic sequence sets {Io} associated with the same number of operons in E. coli K12, each contains at least five sequences (see ‘Materials and Methods’ section). To evaluate the effect of the accuracy of operon predictions on the extraction of inter-operonic sequences, and thus, the CRBS predication, we used all of the 1642 known CRBSs (the 30 known binding sites of the RNA genes were excluded for analysis) in RegulonDB (v 6.0) to scan the predicted inter-operonic sequences in the E. coli K12 genome. We found that 1411 (86%) known CRBSs could be mapped to the predicted inter-operonic sequences (Table 1), suggesting that under the current state-of-the-art operon prediction accuracy, about 14% of possible true binding sites will be missed, simply because of incorrect operon predictions. This conclusion is therefore in agreement with the operon prediction accuracy, as well as the finding by a recent survey study that current operon prediction algorithm can only predict about 80% known operon structures in E. coli K12 (40). Therefore, the accuracy of operon predictions is a limiting factor for identifying all possible CRBSs in a prokaryotic genome using phylogenetic footprinting techniques.
Optimization of the combination and outputs of motif-finding tools Based on the recent survey studies on the performance of the available motif-finding tools (29,30,50), our preliminary experiments on more than a dozen of these tools for their complementarities and efficiency, as well as the type of algorithm that they are based upon, we selected six well-regarded ones for further evaluation of their performance on recovering the 1411 known binding sites in the extracted 2313 inter-operonic sequence sets {Io} of E. coli K12, including CUBIC (51), BioProspector (49), MotifSampler (52), MEME (53), CONSENSUS (54) and MDscan (55). Although different motifs may have different lengths, most of these tools require specifying the length of motifs to be predicted. To find the optimal motif length used in these programs, we have tested different lengths from 8 to 22 bp with all these programs, and found that the motif length 16 bp performed best in recovering the known CRBSs/motifs in our extracted inter-operonic sequences sets {Io} (Figure S6). However, the other motif lengths 14–22 bp performed almost equally well (Figure S6). Thus, the motif length parameter is rather robust in the range of 14–22 bp. We also noted that the distribution of the motif lengths of the known CRBSs in both RegulonDB and DBTBS were rather similar (Figure S7), with 12–22 bp being the most predominate lengths. We thus selected 16 bp as the fixed motif length for these motif-finding tools in all our applications.Note that we did not use any more recently developed motif-finding tools that incorporate phylogenetic information of the input inter-operonic sequences, because these algorithms are mainly designed to predict CRBSs in eukaryotic genomes (56–60), and they require multiple sequence alignments or co-regulated genes as the inputs in addition to a phylogenetic tree of the input intergenic sequences. However, all these three pieces of information are not easily obtained for prokaryotes, because orthologous intergenic sequences from most related prokaryotic genomes cannot be reliably aligned, co-regulated genes are usually unknown for most sequenced prokaryotic genomes, and it is difficult to construct a phylogenetic tree that describes the evolution of all the inter-operonic sequences in prokaryotes due to massive horizontal gene transfer events during the course of their evolution. As shown in the second column of Table 2, of the 1411 known CRBSs that were correctly extracted in the inter-operonic sequences, only 168–355 (12–25%) could be identified by these six programs as their best predictions. However, these programs did show complementary prediction effect, as 731 (52%) of these 1411 CRBSs could be jointly predicted by these six programs as their best predictions, even though this coverage was still not high enough. However, when multiple top motifs found by each tool were considered, the coverage of the 1411 CRBSs increased remarkably (Table 2). For instance, if each tool returned its top 25 motifs, then 1389 (98.4%) of these 1411 CRBSs could be recovered. Clearly, the more predicted motifs each tool returns, the more these 1411 CRBSs can be recovered. Nevertheless, too many motifs returned by each tool would also tremendously increase the spurious predictions, thus complicating the sequential steps of the algorithm. Furthermore, the number of the recovered known CRBSs actually entered a saturation phase when each tool returned more than 15 motifs (Table 2). We also noted that although these tools were in general complementary to one another, they did not perform equally well (Table 2). Considering all of these factors and by comparing different combinations of the number of output motifs for each tool as shown in Table 3, we selected a total of 40 motifs from the outputs of five of the six tools for each inter-operonic sequence set Io, which included the top 15 of MEME, the top 10 of BioProspector, and the top five of CUBIC, MDscan, and MotifSampler, respectively. The results from the CONSENSUS program were not used since almost all of its predictions were covered by other programs (Table 3). Therefore, we had a total of 2313 (Io)× 40 = 92 520 input motifs for the E. coli K12 genome, which contained 1316 (93%) of the 1411 known CRBSs in the extracted inter-operonic sequences (Table 3). These 1316 identified known binding sites belong to 119 motifs (Table 1). Obviously, most of the 92 520 input motifs were spurious predictions; thus, the objective of our algorithm was to identify the true binding sites from the spurious ones. We used these 1316 known CRBSs identified by the five tools in the whole set of 92 520 input motifs containing a very large number of sequences (~106) to evaluate the performance of our algorithm.
Our motif similarity metric outperforms the exiting metrics in separating relevant motifs from irrelevant ones There are typically about 105 putative motifs in the set of input motifs. In order to facilitate the separation of true motifs from spurious ones in the motif similarity graph, we need a motif similarity metric that not only accurately measures the similarity between pairs of input motifs, but also can be efficiently computed. Specifically, we sought for a motif similarity metric that gives a high score for two relevant motifs, i.e. two sub-motifs of the motif of a TF, but a low score for two irrelevant motifs, i.e. two motifs for evolutionarily unrelated TFs or two spurious motifs. To this end, we designed a metric, and have compared it with six existing metrics for their capability of differentiating between relevant motifs and irrelevant ones. These compared existing metrics include Pearson correlation coefficient (PCC), average Kullback–Leibler (AKL, or relative entropy), average log-likelihood ratio (ALLR), 1 −P-value of Chi-square (pCS), sum of squared distances (SSD) [for a survey of these metrics, see (45)] and asymptotic covariance (AC) (46) (see Supplementary Method for the calculation of these metrics). As shown in Figure 3 = 0.05 contained only 1.6% of all possible edges of the motif similarity graph. In contrast, with the PCC metric, 85% of the sub-motifs of the known motifs have a raw similarity score greater than 0.35, but the graph constructed with this cutoff β = 0.35 contained 6.5% of all possible edges. Therefore, the motif similarity graph constructed using our metric will facilitate the our purpose to separate true motifs from spurious one, as it is easier to identify the highly connected subgraphs as possible true binding site motifs in a sparsely connected graph than in a densely connected one. Notably, although Mahony and coworkers (45) found that PCC and SSD were more efficient than the others to detect the similarities between familiar binding motifs, they are clearly not suitable for our purpose to separate true motifs from the spurious ones. This is because these two metrics in addition to AKL are biased to the correlation between the columns of the two compared motifs, whereas most of our predicted true motifs in the input motifs were partial, thus these metrics tend to score them low.
Our method for selecting reference genomes facilitates the separation of relevant motifs from irrelevant ones As shown in Figure 3 Selection of the motif similarity score cutoff for the construction of motif similarity graphs However, even using our motif similarity metric and these 55 selected reference genomes, the distribution of the similarity scores among the input motifs for the E. coli K12 genome still has a considerable overlap with that of the similarity scores of the sub-motifs of a known motif (Figure 3 < 0 due to the too high density of the resulting graph, which is defined as the number of nodes divided by the number of edges in the graph. This observation is not surprising, because as shown in Figure S8, the density of the similarity graph increases rapidly when β becomes less than 0. Nonetheless, our algorithm performed almost equally well with β in the range of [0, 0.1], though the recovered known CRBSs dropped sharply when β > 0.1 (Figure S8), suggesting that the parameter β is also very robust in the range of [0, 0.1]. Accordingly, we chose β = 0.05 in this study to construct the initial motif similarity graph, which is a rather low cutoff, since it includes 99.7% of input motifs with at least one neighbor in the similarity graph of E. coli K12 (Figure S8). However, as mentioned earlier, the graph constructed contains only 1.6% of all possible edges, thus, is rather sparse.Prediction of CRBSs in E. coli K12—sensitivity and specificity of the algorithm The final output of our algorithm is a list of ranked clusters of putative CRBSs. Each cluster presumably corresponds to a cis-regulatory motif recognized by a TF encoded in the target genome. Operons that are presumably regulated by the binding sites in each cluster are predicted to form the regulon of the TF. Ideally, the higher the rank of a cluster/motif/regulon, the higher confidence we have for the prediction. Furthermore, if the target genome encodes a total of T TFs, then the top T clusters/motifs of our prediction should largely cover the binding sites of these T TFs. In order to evaluate our algorithm according to these criteria, we first applied it to the E. coli K12 genome using the 55 reference genomes (Figure S1). We first computed the recovery of the 1316 known binding sites in the input motifs by our top-ranked clusters. As shown in Figure 4
To estimate the specificity of our predictions, we plotted the number of cumulative unique predicted binding sites (the overlap between any two sequences is fewer than eight bases) as a function of the rank of our top 1000 clusters. As shown in Figure 4 520 putative binding sites from the E. coli K12 genome in the input motifs and its saturation indicate that most sequences in our input motifs have been filtered out by our algorithm, and that most of them are likely spurious predictions. Therefore, our algorithm could effectively separate the true binding sites from the spurious ones. Furthermore, the fact that these 6662 unique putative binding sites recovered 1065 of 1316 known binding sites in the 92 520 putative binding site from the E. coli K12 genome (P < 10–13, according to a hyper-geometric distribution) strongly suggests that our algorithm has likely achieved high prediction specificity, although it is difficult to estimate this number accurately. The clusters that were ranked after 400 were generally small in size as shown in Figure 4
To evaluate the correspondence of the top-ranked clusters and the known binding site motifs, we first counted the number of known motifs that have their binding sites recovered by a top-ranked cluster. As shown in Figure 6 − 125 = 189 TFs, for which we still do not know the binding sites. Based on the performance of our algorithm on the known binding site motifs, we further argue that the majority of these 238 of the top 400 clusters that contained no known binding sites are likely to correspond to new true binding site motifs, which is supported by gene expression data shown later.
We then counted the number of the top-ranked clusters that contain the binding sites of a known motif. As shown in Figure 6
Lastly, we analyzed the distribution of the predicted binding sites of the top 400 clusters in the predicted inter-operonic regions. Figure 4 ~ 10 different TF binding sites (Figure 4F) except that the latter is left-shifted by one binding site relative to the former. However, these two distributions are likely to become more overlapped as more binding sites of more TFs are characterized. Furthermore, the putative binding sites of the top 400 clusters are distributed in the upstream regions of 2224 (96%) of the 2313 operons, thus we have predicted CRBSs for the most of the predicted operons in the genome, which is the largest coverage of operons achieved so far.Validation of the CRBS and regulon predictions in E. coli K12 using a compendium of microarray gene expression datasets To further validate our predicted CRBSs and regulons in E. coli K12, we computed the PCC score between the expression vectors for each pair of genes in each of the top 400 clusters/regulons, as well as for each pair of genes in each randomly selected 400 gene groups with the corresponding number of genes in the top 400 clusters/regulons (see ‘Materials and Methods’ section), using a compendium of 380 microarray gene expression datasets in E. coli K12 (37). As shown in Figure 3 < 10–15, χ2 test). Moreover, the distribution of the absolute values of PCC scores of the genes in the clusters that do not contain known binding sites is almost the same as that of genes in the clusters that contain known binding sites (Figure 3Prediction of CRBSs in B. subtilis—robustness of the algorithm To further test the robustness of our algorithm, we applied it to the B. subtilis genome with exactly the same parameter settings as used for the E. coli K12 genome. In this case, we selected 17 firmicutes as the reference genomes that form a monophyletic clade in the tree of the 124 sequenced firmicutes (Figure S2). There are 568 known CRBSs in B. subtilis as documented in DBTBS (2), belonging to 99 motifs. We extracted a total of 2,400 inter-operonic sequence sets according to the predicted operon structures in B. subtilis and the reference genomes. Of the 568 known binding sites in B. subtilis, 481 (85%) are located in the inter-operonic sequences according to our operon predictions, belonging to 98 motifs; and 450 (94%) of the 481 binding sites were correctly predicted by the motif-finding tools in the 96 000 (2400 × 40) input motifs, belonging to 98 motifs (Table 1). Interestingly, the similarity scores among these 96 000 input motifs and that of the sub-motifs of the known motifs in B. subtilis have similar distributions to those of the input motifs and sub-motifs of the known motifs in E. coli K12, respectively (Figure 3
Comparison of our algorithm with other state-of-the-art methods To further evaluate our algorithm, we have compared the performance of GLECLUBS to other prior state-of-the-art methods that have been applied to E. coli K12 or B. subtilis, where enough CRBSs are known for more objective evaluations. As shown in Table S1 in Supplementary Data, GLECLUBS clearly recovered more known motifs, and covered more operons than any of these algorithms in both the genomes. Furthermore, as we have mentioned earlier, PhyloNet (33), which was designed to predict CRBSs at a genome scale in simple eukaryotic (33) and prokaryotic (34) genomes, is probably the most comparable algorithm to GLECLUBS in terms of the scope of predictions that they both can achieve. However, the outputs of PhyloNet are a set of redundant motifs that need to be further clustered in an ad-hoc manner (33,34), and it has not been applied to either E. coli K12 or B. subtilis by its authors. Therefore, to compare the two algorithms, we applied GLECLUBS to S. oneidensis using 33 reference genomes selected from the 124 sequenced firmicutes (Figure S3). The similarity scores of the 91 000 input motifs from 2275 predicted operons in S. oneidensis have a similar distribution to those of the input motifs from E. coli K12 and B. subtilis (Figure 3As shown in Figure 4 DISCUSSION Our algorithm has achieved high prediction accuracy and robustness Genome-wide experimental characterization of CRBSs in all the sequenced genomes remains an open problem due to the tedious and laborious work required by even the most high-throughput experimental methods such as the ChIP-chip technique (67). Furthermore, the ChIP-chip technique is also limited by the conditions that allow the TF to bind to its binding sites as well as the low resolution nature of the technique, as it can only locate the possible binding sites in a region of hundreds to thousands bp length sequences. With the availability of increasing numbers of sequenced prokaryotic genomes, comparative genomics-based computational methods will become more and more powerful in deciphering the CRBSs in all the sequenced prokaryotic genomes. In this study, we have developed an algorithm called GLECLUBS for genome-wide de novo prediction of CRBSs in prokaryotic genomes based on the principles of comparative genomics. We have designed several novel features into our algorithm to address the three identified difficulties associated with the CRBS prediction problem as follows. First, since any sequence segment can be potentially a binding site, motif-finding tools generally work by identifying the overrepresented sequences in a set of input sequences as the possible binding sites. Therefore, the quality of the inter-operonic sequences greatly affects the performance of motif-finding tools. An ideal high quality inter-operonic sequence set should contain as many as possible sequences that contain the binding sites of the orthologous TFs, and these binding sites should be conserved enough yet their flanking sequences should be divergent enough, so that the binding sites can be readily identified. In order to increase the quality of the input inter-operonic sequences for the phylogenetic footprinting procedure, we have designed a new method for selecting reference genomes. When compared with the reference genomes selected by the conventional method, those selected by our method are more likely to facilitate the separation of true CRBSs from spurious ones, as indicated by the left-shifted distribution curve of the similarity scores based on our method (Figure 3 o}, which also facilitates the separation of true binding sites and spurious ones as indicated by the left-shifted distribution curve of the motif similarity scores based on the sets Io} compared to that based on {Jg} (Figure 3When applied to the E. coli K12 and B. subtilis genomes with the same parameter settings, GLECLUBS can rapidly recover by its top-ranked predictions ~81% known CRBSs in both genomes identified by the five motif-finding tools. More importantly, the recovery rates of known binding sites as well as the number of unique putative CRBSs saturated around the top 400 and 300 motifs for E. coli K12 and B. subtilis, respectively. These saturation points are in excellent agreement with the numbers of TFs possible encoded in the genomes. Further validation of our predictions in E. coli K12 using a compendium of microarray gene expression dataset indicates that we have achieved the same level of accuracy for the predicted new motifs as for those that contain known binding sites. Therefore, GLECLUBS was neither over trained on the known binding sites, nor biased to the E. coli K2 genome. Taking together, our algorithm has achieved high sensitivity as well as high specificity in both genomes in identifying the true binding sites in the input motifs predicted by multiple motif-finding tools and is also very robust, therefore can be applied to any prokaryotic genomes. One possible explanation for the robustness of GLECLUBS is that it only contains two parameters needed to be optimized, i.e. the motif length L as one of the inputs of the motif-finding tools and the motif similarity cutoff β for the construction of motif similarity graphs. However, both the motif length and the similarity between two sub-motifs recognized by the same TF are mainly governed by the physical and chemical principles of protein DNA interactions, and thus they are not likely to be species specific. In other words, the range of motifs length and the level of similarity of binding sites of a motif should be very similar in at least bacterial genomes. This conclusion is strongly supported by the similar distributions of the length of known CRBSs in E. coli K12 and B. subtilis (Figure S7), as well as the similar distributions of the similarity scores of the sub-motifs of the known motifs in E. coli K12 and B. subtilis (Figure 3 When compared with other state-of-the-art genome-wide CRBS prediction algorithms that have been applied to E. coli K12 and/or B. subtilis, GLECLUBS outperformed all of them in terms of the number of known motifs recovered and the number of operons covered in both the genomes (Table S1). In addition, GLECLUBS out-performed the more recently developed PhyloNet algorithm (33) in terms of motif clustering efficiency, and prediction sensitivity and specificity when evaluated on the S. oneidensis genome. Furthermore, PhyloNet seems to require intergenic sequences from very closely related genomes (e.g., genomes from the same genus) for better performance (34); however, such a requirement cannot be always met. Therefore, our algorithm is more applicable, as its output requires no further process, and it only needs moderately related reference genomes for accurate predictions. With the availability of exponentially increasing number of sequenced prokaryotic genomes, it is highly possible to identify enough number of such reference genomes for any sequenced prokaryotic genomes. The bottlenecks of genome-wide CRBS predictions The performance of our algorithm depends on two pieces of information: (i) the operon structures in the target genome as well as in the reference genomes; and (ii) the input motifs found by multiple motif-finding tools. In the foreseeable future, operon structures in sequenced genomes will be mainly provided by computational predictions instead of experimental determination due to the enormous work that may incur. Therefore, we did not even use the known operon structures in both E. coli K12 and B. subtilis to make sure that our algorithm is robust enough to be applied to other less well-studied genomes. However, we have found that the accuracy of operon prediction is a major limiting factor for the performance of our algorithm. Even with the most accurate operon prediction algorithm developed so far (39,40), only about 84.6% and 83.3% known operons in E. coli K12 and B. subtilis, respectively, can be correctly predicted (40). Based on such predictions, the location of only 85 ~ 86% known CRBSs in both genomes are correctly extracted as inter-operonic regions, and thus can be potentially identified by the motif-finding tools in the phylogenetic footprinting procedure (Table 1). The rest 14 ~ 15% known CRBSs were missed by the procedure simply because they were not in the extracted inter-operonic regions. Further improvement of operon prediction algorithms or the development of other strategies will certainly increase the sensitivity of our algorithm.Another limiting factor for the performance of our algorithm is the prediction accuracy of the motif-finding tools used in our algorithm to predict the input motifs. Although our algorithm was designed to overcome the problem of the low prediction sensitivity and specificity of current motif-finding tools by using an optimized combination of multiple outputs of multiple tools, for the sake of computational efficiency, the number of the input motifs can not be too large. This requirement affects the performance of our algorithm to some extent. Therefore, improvement of motif-finding tools and their combination in the future are likely to increase the sensitivity of our algorithm further, as our algorithm is very flexible to include any new motif-finding tools. Furthermore, our graph clustering algorithm also has room for further improvement, although it has achieved ~81% sensitivity and possibly high specificity to predict true binding sites in a large number of input motifs in both the E. coli K12 and B. subtilis genomes. Therefore, when all of these factors are considered, and the results are evaluated on the all known binding sites, we can only predict ~64% of them in both E. coli K12 and B. subtilis (Table 1). In order to achieve higher prediction sensitivity and specificity and to identify all possible CRBSs encoded in a genome, all these three bottlenecks in our prediction pipeline need to be well addressed in the future. Lastly, the binding sites of some different TFs were clustered in the same cluster due to the large overlap or high similarity of these binding sites; on the other hand, some binding sites of the same TF were clustered into different clusters due to the dissimilarity of these distinct sub-motifs. These phenomena had also been noted by an earlier study (26), suggesting that gene regulation is a far more complicated problem than previously imagined in that the same TF can bind to distinct motifs by adapting different configurations, and different TFs of the same family can binding to very similar binding sites. These observations might indicate the limitations of the sequence based genome-wide motif-finding algorithms, which assume that the same TF recognizes similar binding sites, and different TFs recognize distinct motifs. Solving these problems might require additional information such as the 3-dimensional structures of TFs encoded in the genomes. Biological insights of our predictions into the cis-regulatory systems in E. coli K12 and B. subtilis E. coli K12 and B. subtilis are the most extensively studied model organisms for Gram-negative and Gram-positive bacteria, respectively, for all aspects of bacterial biology, including gene transcription regulation. Although many important gene transcription machineries have been derived from the studies of these two organisms, so far we only know fewer than half of the cis-regulatory systems in both organisms after decades of research (1–3). Hence, we are still far away from having a holistic view of the gene transcription regulatory networks in any of the studied organisms. In this study, we have provided so far the most extensive lists of high quality candidates of new cis-regulatory binding motifs as well as regulons in both E. coli K12 and B. subtilis for further experimental characterization. Intriguingly, in both the genomes, the predicted new cis-regulatory binding motifs are close to the number of (putative) TFs whose binding sites are unknown. As our algorithm has likely achieved high prediction specificity, it would be reasonable to believe that most of these predictions are likely to be the binding site motifs of these (putative) TFs whose binding sites as well as regulons are largely unknown. Furthermore, since our predictions have tremendously narrowed down the candidates of CRBSs in the voluminous genome sequences, it becomes feasible to experimentally verify these predictions and, at the same time, to map each predicted motif to its cognate binding TF. For instance, one can use a double-stranded oligo-DNA containing the consensus sequence of a predicted motif to pull down the cognate TF from a pooled lysates of bacterial cells cultured under different conditions, presumably at least one of which can activate the TF; and then identify the bound protein using mass spectrometry analysis (69). Thus, combining our predictions with a high throughput DNA affinity capture and a protein identification technique can greatly facilitate the elucidation of the entire gene transcription regulatory networks in these model organisms in particular, and in any other sequenced prokaryotic genome in general. Implications for current genome annotation efforts One of the major objectives of current genome annotation is to define all of the functional sequence elements in the sequenced genomes. For practical reasons, this can only be done by highly accurate computational predictions. However, due to the aforementioned reasons, current genome annotation efforts are mainly focused on coding sequences, and little has been achieved on the annotation of CRBSs in most sequenced prokaryotic genomes (3). The relatively high prediction accuracy and robustness of our algorithm imply that it can be used to annotate the CRBSs in any sequenced prokaryotic genome as long as a few moderately related reference genomes are available. With more prokaryote genomes sequenced, this restriction will no longer exist for any sequenced genomes in the near future. Of course, to apply our algorithm to all the sequenced prokaryotic genomes, we need to further improve its computational efficiency, which we believe, is highly doable. First, our current algorithm only focuses on the target genome, the information about the CRBSs in dozens of reference genomes are not fully utilized. Full utilization of this information will possibly lead to the prediction of the CRBSs not only in the target genome, but also in all of the reference genomes as well. This will speed up the algorithm dozens of times. Second, the program can be easily parallelized, which can speed up the algorithm further. We are in the process of constructing a relational database to store our predicted CRBSs from the genomes to which our algorithm has been or will be applied. We hope that the database will become a valuable resource to the community to elucidate the CRBSs in all sequenced prokaryotic genomes. FUNDING University of North Carolina at Charlotte grant and CMC-UNCC Collaborative Research Fund (to Z.S.). Funding for open access charge: The University of Carolina at Charlotte. Conflict of interest statement. None declared. Supplementary Data are available at NAR Online. [Supplementary Data]
ACKNOWLEDGEMENTS We would like to thank Drs. Larry Mays, Dennis Livesay and Michael Hudson for their critical reading of this manuscript and suggestions. We would also like to thank the two anonymous reviewers for their comments and suggestions that greatly improve the quality of this manuscript. REFERENCES 1. Martinez-Antonio A, Collado-Vides J. Identifying global regulators in transcriptional regulatory networks in bacteria. Curr. Opin. Microbiol. 2003;6:482–489. [PubMed] 2. Sierro N, Makita Y, de Hoon M, Nakai K. DBTBS: a database of transcriptional regulation in Bacillus subtilis containing upstream intergenic conservation information. Nucleic Acids Res. 2007;36 (Database issue):D93–96. [PubMed] 3. Montgomery SB, Griffith OL, Sleumer MC, Bergman CM, Bilenky M, Pleasance ED, Prychyna Y, Zhang X, Jones SJ. ORegAnno: an open access database and curation system for literature-derived promoters, transcription factor binding sites and regulatory variation. Bioinformatics. 2006;22:637–640. [PubMed] 4. Stormo GD, Hartzell GW., III Identifying protein-binding sites from unaligned DNA fragments. Proc. Natl Acad. Sci. USA. 1989;86:1183–1187. [PubMed] 5. Stormo GD. DNA binding sites: representation and discovery. Bioinformatics. 2000;16:16–23. [PubMed] 6. Stormo GD, Schneider TD, Gold LM. Characterization of translational initiation sites in E. coli. Nucleic Acids Res. 1982;10:2971–2996. [PubMed] 7. Stormo GD, Schneider TD, Gold L, Ehrenfeucht A. Use of the ‘Perceptron’ algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res. 1982;10:2997–3011. [PubMed] 8. Das MK, Dai HK. A survey of DNA motif finding algorithms. BMC Bioinformatics. 2007;8 (Suppl. 7):S21. [PubMed] 9. GuhaThakurta D. Computational identification of transcriptional regulatory elements in DNA sequence. Nucleic Acids Res. 2006;34:3585–3598. [PubMed] 10. Tagle DA, Koop BF, Goodman M, Slightom JL, Hess DL, Jones RT. Embryonic epsilon and gamma globin genes of a prosimian primate (Galago crassicaudatus). Nucleotide and amino acid sequences, developmental regulation and phylogenetic footprints. J. Mol. Biol. 1988;203:439–455. [PubMed] 11. Gelfand MS. Recognition of regulatory sites by genomic comparison. Res. Microbiol. 1999;150:755–771. [PubMed] 12. Mironov AA, Koonin EV, Roytberg MA, Gelfand MS. Computer analysis of transcription regulatory patterns in completely sequenced bacterial genomes. Nucleic Acids Res. 1999;27:2981–2989. [PubMed] 13. Gerdes SY, Scholle MD, Campbell JW, Balazsi G, Ravasz E, Daugherty MD, Somera AL, Kyrpides NC, Anderson I, Gelfand MS, et al. Experimental determination and system level analysis of essential genes in Escherichia coli MG1655. J. Bacteriol. 2003;185:5673–5684. [PubMed] 14. Rodionov DA, Vitreschak AG, Mironov AA, Gelfand MS. Comparative genomics of the methionine metabolism in Gram-positive bacteria: a variety of regulatory systems. Nucleic Acids Res. 2004;32:3340–3353. [PubMed] 15. Vitreschak AG, Rodionov DA, Mironov AA, Gelfand MS. Regulation of riboflavin biosynthesis and transport genes in bacteria by transcriptional and translational attenuation. Nucleic Acids Res. 2002;30:3141–3151. [PubMed] 16. Panina EM, Mironov AA, Gelfand MS. Comparative analysis of FUR regulons in gamma-proteobacteria. Nucleic Acids Res. 2001;29:5195–5206. [PubMed] 17. Laikova ON, Mironov AA, Gelfand MS. Computational analysis of the transcriptional regulation of pentose utilization systems in the gamma subdivision of Proteobacteria. FEMS Microbiol. Lett. 2001;205:315–322. [PubMed] 18. Rodionov DA, Mironov AA, Gelfand MS. Transcriptional regulation of pentose utilisation systems in the Bacillus/Clostridium group of bacteria. FEMS Microbiol. Lett. 2001;205:305–314. [PubMed] 19. Makarova KS, Mironov AA, Gelfand MS. Conservation of the binding site for the arginine repressor in all bacterial lineages. Genome Biol. 2001;2 RESEARCH0013. 20. Tan K, Moreno-Hagelsieb G, Collado-Vides J, Stormo GD. A comparative genomics approach to prediction of new members of regulons. Genome Res. 2001;11:566–584. [PubMed] 21. Bulyk ML, McGuire AM, Masuda N, Church GM. A motif co-occurrence approach for genome-wide prediction of transcription-factor-binding sites in Escherichia coli. Genome Res. 2004;14:201–208. [PubMed] 22. McGuire AM, Hughes JD, Church GM. Conservation of DNA regulatory motifs and discovery of new motifs in microbial genomes. Genome Res. 2000;10:744–757. [PubMed] 23. Kono H, Sarai A. Structure-based prediction of DNA target sites by regulatory proteins. Proteins. 1999;35:114–131. [PubMed] 24. Robertson TA, Varani G. An all-atom, distance-dependent scoring function for the prediction of protein-DNA interactions from structure. Proteins. 2007;66:359–374. [PubMed] 25. van Nimwegen E, Zavolan M, Rajewsky N, Siggia ED. Probabilistic clustering of sequences: inferring new bacterial regulons by comparative genomics. Proc. Natl Acad. Sci. USA. 2002;99:7323–7328. [PubMed] 26. Qin ZS, McCue LA, Thompson W, Mayerhofer L, Lawrence CE, Liu JS. Identification of co-regulated genes through Bayesian clustering of predicted regulatory binding sites. Nat. Biotechnol. 2003;21:435–439. [PubMed] 27. McCue L, Thompson W, Carmack C, Ryan MP, Liu JS, Derbyshire V, Lawrence CE. Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. Nucleic Acids Res. 2001;29:774–782. [PubMed] 28. Alkema WB, Lenhard B, Wasserman WW. Regulog analysis: detection of conserved regulatory networks across bacteria: application to Staphylococcus aureus. Genome Res. 2004;14:1362–1373. [PubMed] 29. Hu J, Li B, Kihara D. Limitations and potentials of current motif discovery algorithms. Nucleic Acids Res. 2005;33:4899–4913. [PubMed] 30. Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, et al. Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotechnol. 2005;23:137–144. [PubMed] 31. Pritsker M, Liu YC, Beer MA, Tavazoie S. Whole-genome discovery of transcription factor binding sites by network-level conservation. Genome Res. 2004;14:99–108. [PubMed] 32. Li H, Rhodius V, Gross C, Siggia ED. Identification of the binding sites of regulatory proteins in bacterial genomes. Proc. Natl Acad. Sci. USA. 2002;99:11772–11777. [PubMed] 33. Wang T, Stormo GD. Identifying the conserved network of cis-regulatory sites of a eukaryotic genome. Proc. Natl Acad. Sci. USA. 2005;102:17400–17405. [PubMed] 34. Liu J, Xu X, Stormo GD. The cis-regulatory map of Shewanella genomes. Nucleic Acids Res. 2008;36:5376–5390. [PubMed] 35. Gama-Castro S, Jimenez-Jacinto V, Peralta-Gil M, Santos-Zavaleta A, Penaloza-Spinola MI, Contreras-Moreira B, Segura-Salazar J, Muniz-Rascado L, Martinez-Flores I, Salgado H, et al. RegulonDB (version 6.0): gene regulation model of Escherichia coli K-12 beyond transcription, active (experimental) annotated promoters and Textpresso navigation. Nucleic Acids Res. 2008;36:D120–124. [PubMed] 36. Kummerfeld SK, Teichmann SA. DBD: a transcription factor prediction database. Nucleic Acids Res. 2006;34:D74–81. [PubMed] 37. Faith JJ, Driscoll ME, Fusaro VA, Cosgrove EJ, Hayete B, Juhn FS, Schneider SJ, Gardner TS. Many Microbe Microarrays Database: uniformly normalized Affymetrix compendia with structured experimental metadata. Nucleic Acids Res. 2008;36:D866–870. [PubMed] 38. Mushegian AR, Koonin EV. A minimal gene set for cellular life derived by comparison of complete bacterial genomes. Proc. Natl Acad. Sci. USA. 1996;93:10268–10273. [PubMed] 39. Dam P, Olman V, Harris K, Su Z, Xu Y. Operon prediction using both genome-specific and general genomic information. Nucleic Acids Res. 2007;35:288–298. [PubMed] 40. Brouwer RW, Kuipers OP, Hijum SA. The relative value of operon predictions. Brief Bioinform. 2008;9:367–375. [PubMed] 41. Madan Babu M, Teichmann SA. Evolution of transcription factors and the gene regulatory network in Escherichia coli. Nucleic Acids Res. 2003;31:1234–1244. [PubMed] 42. Lozada-Chavez I, Janga SC, Collado-Vides J. Bacterial regulatory networks are extremely flexible in evolution. Nucleic Acids Res. 2006;34:3434–3445. [PubMed] 43. Schones DE, Sumazin P, Zhang MQ. Similarity of position frequency matrices for transcription factor binding sites. Bioinformatics. 2005;21:307–313. [PubMed] 44. Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS. Quantifying similarity between motifs. Genome Biol. 2007;8:R24. [PubMed] 45. Mahony S, Auron PE, Benos PV. DNA familial binding profiles made easy: comparison of various motif alignment and clustering strategies. PLoS Comput. Biol. 2007;3:e61. [PubMed] 46. Pape UJ, Rahmann S, Vingron M. Natural similarity measures between position frequency matrices with an application to clustering. Bioinformatics. 2008;24:350–357. [PubMed] 47. van Dongen S. 2000. National Research Institute for Mathematics and Computer Science in the Netherlands, Amsterdam. 48. Garey MR, Johnson DS. Computers and Intractability: A Guide to the Theory of NP-Completeness. Gordonsville, VA: W H Freeman & Co; 1979. A cluster algorithm for graphs. 49. Liu X, Brutlag DL, Liu JS. BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac. Symp. Biocomput. 2001:127–138. [PubMed] 50. Hu J, Yang YD, Kihara D. EMD: an ensemble algorithm for discovering regulatory motifs in DNA sequences. BMC Bioinformatics. 2006;7:342. [PubMed] 51. Olman V, Xu D, Xu Y. CUBIC: identification of regulatory binding sites through data clustering. J. Bioinform. Comput. Biol. 2003;1:21–40. [PubMed] 52. Thijs G, Lescot M, Marchal K, Rombauts S, De Moor B, Rouze P, Moreau Y. A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics. 2001;17:1113–1122. [PubMed] 53. Bailey TL, Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol. 1994;2:28–36. [PubMed] 54. Hertz GZ, Stormo GD. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics. 1999;15:563–577. [PubMed] 55. Liu XS, Brutlag DL, Liu JS. An algorithm for finding protein-DNA binding sites with applications to chromatin immunoprecipitation microarray experiments. Nat. Biotechnol. 2002;20:835–839. [PubMed] 56. Liu Y, Liu XS, Wei L, Altman RB, Batzoglou S. Eukaryotic regulatory element conservation analysis and identification using comparative genomics. Genome Res. 2004;14:451–458. [PubMed] 57. Sinha S, Blanchette M, Tompa M. PhyME: a probabilistic algorithm for finding motifs in sets of orthologous sequences. BMC Bioinformatics. 2004;5:170. [PubMed] 58. Li X, Wong WH. Sampling motifs on phylogenetic trees. Proc. Natl Acad. Sci. USA. 2005;102:9481–9486. [PubMed] 59. Siddharthan R, Siggia ED, van Nimwegen E. PhyloGibbs: a Gibbs sampling motif finder that incorporates phylogeny. PLoS Comput. Biol. 2005;1:e67. [PubMed] 60. Newberg LA, Thompson WA, Conlan S, Smith TM, McCue LA, Lawrence CE. A phylogenetic Gibbs sampler that yields centroid solutions for cis-regulatory site prediction. Bioinformatics. 2007;23:1718–1727. [PubMed] 61. Perez-Rueda E, Collado-Vides J. The repertoire of DNA-binding transcriptional regulators in Escherichia coli K-12. Nucleic Acids Res. 2000;28:1838–1847. [PubMed] 62. Shen-Orr SS, Milo R, Mangan S, Alon U. Network motifs in the transcriptional regulation network of Escherichia coli. Nat. Genet. 2002;31:64–68. [PubMed] 63. Gelfand MS. Evolution of transcriptional regulatory networks in microbial genomes. Curr. Opin. Struct. Biol. 2006;16:420–429. [PubMed] 64. Sandelin A, Wasserman WW. Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics. J. Mol. Biol. 2004;338:207–215. [PubMed] 65. Moreno-Campuzano S, Janga SC, Perez-Rueda E. Identification and analysis of DNA-binding transcription factors in Bacillus subtilis and other Firmicutes – a genomic approach. BMC Genomics. 2006;7:147. [PubMed] 66. Tan K, McCue LA, Stormo GD. Making connections between novel transcription factors and their DNA motifs. Genome Res. 2005;15:312–320. [PubMed] 67. Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E, et al. Genome-wide location and function of DNA binding proteins. Science. 2000;290:2306–2309. [PubMed] 68. Wang T, Stormo GD. Combining phylogenetic data with co-regulated genes to identify regulatory motifs. Bioinformatics. 2003;19:2369–2380. [PubMed] 69. Forde CE, Gonzales AD, Smessaert JM, Murphy GA, Shields SJ, Fitch JP, McCutchen-Maloney SL. A rapid method to capture and screen for transcription factors by SELDI mass spectrometry. Biochem. Biophys. Res. Commun. 2002;290:1328–1335. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||||||||
Curr Opin Microbiol. 2003 Oct; 6(5):482-9.
[Curr Opin Microbiol. 2003]Nucleic Acids Res. 2008 Jan; 36(Database issue):D93-6.
[Nucleic Acids Res. 2008]Bioinformatics. 2006 Mar 1; 22(5):637-40.
[Bioinformatics. 2006]Proc Natl Acad Sci U S A. 1989 Feb; 86(4):1183-7.
[Proc Natl Acad Sci U S A. 1989]Bioinformatics. 2000 Jan; 16(1):16-23.
[Bioinformatics. 2000]Nucleic Acids Res. 1982 May 11; 10(9):2971-96.
[Nucleic Acids Res. 1982]Nucleic Acids Res. 1982 May 11; 10(9):2997-3011.
[Nucleic Acids Res. 1982]BMC Bioinformatics. 2007 Nov 1; 8 Suppl 7():S21.
[BMC Bioinformatics. 2007]Nucleic Acids Res. 2006; 34(12):3585-98.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2005; 33(15):4899-913.
[Nucleic Acids Res. 2005]Nat Biotechnol. 2005 Jan; 23(1):137-44.
[Nat Biotechnol. 2005]Nucleic Acids Res. 2008 Jan; 36(Database issue):D120-4.
[Nucleic Acids Res. 2008]Nucleic Acids Res. 2008 Jan; 36(Database issue):D93-6.
[Nucleic Acids Res. 2008]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D74-81.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2008 Jan; 36(Database issue):D866-70.
[Nucleic Acids Res. 2008]Nucleic Acids Res. 2005; 33(15):4899-913.
[Nucleic Acids Res. 2005]Nat Biotechnol. 2005 Jan; 23(1):137-44.
[Nat Biotechnol. 2005]Proc Natl Acad Sci U S A. 1996 Sep 17; 93(19):10268-73.
[Proc Natl Acad Sci U S A. 1996]Nucleic Acids Res. 2007; 35(1):288-98.
[Nucleic Acids Res. 2007]Brief Bioinform. 2008 Sep; 9(5):367-75.
[Brief Bioinform. 2008]Nucleic Acids Res. 2003 Feb 15; 31(4):1234-44.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2006; 34(12):3434-45.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D74-81.
[Nucleic Acids Res. 2006]Bioinformatics. 2005 Feb 1; 21(3):307-13.
[Bioinformatics. 2005]Genome Biol. 2007; 8(2):R24.
[Genome Biol. 2007]PLoS Comput Biol. 2007 Mar 30; 3(3):e61.
[PLoS Comput Biol. 2007]Bioinformatics. 2008 Feb 1; 24(3):350-7.
[Bioinformatics. 2008]Pac Symp Biocomput. 2001; ():127-38.
[Pac Symp Biocomput. 2001]Nucleic Acids Res. 2007; 35(1):288-98.
[Nucleic Acids Res. 2007]Brief Bioinform. 2008 Sep; 9(5):367-75.
[Brief Bioinform. 2008]Nucleic Acids Res. 2005; 33(15):4899-913.
[Nucleic Acids Res. 2005]Nat Biotechnol. 2005 Jan; 23(1):137-44.
[Nat Biotechnol. 2005]BMC Bioinformatics. 2006 Jul 13; 7():342.
[BMC Bioinformatics. 2006]J Bioinform Comput Biol. 2003 Apr; 1(1):21-40.
[J Bioinform Comput Biol. 2003]Pac Symp Biocomput. 2001; ():127-38.
[Pac Symp Biocomput. 2001]Genome Res. 2004 Mar; 14(3):451-8.
[Genome Res. 2004]BMC Bioinformatics. 2004 Oct 28; 5():170.
[BMC Bioinformatics. 2004]Proc Natl Acad Sci U S A. 2005 Jul 5; 102(27):9481-6.
[Proc Natl Acad Sci U S A. 2005]PLoS Comput Biol. 2005 Dec; 1(7):e67.
[PLoS Comput Biol. 2005]Bioinformatics. 2007 Jul 15; 23(14):1718-27.
[Bioinformatics. 2007]PLoS Comput Biol. 2007 Mar 30; 3(3):e61.
[PLoS Comput Biol. 2007]Bioinformatics. 2008 Feb 1; 24(3):350-7.
[Bioinformatics. 2008]Nucleic Acids Res. 2003 Feb 15; 31(4):1234-44.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2000 Apr 15; 28(8):1838-47.
[Nucleic Acids Res. 2000]Nat Genet. 2002 May; 31(1):64-8.
[Nat Genet. 2002]Nucleic Acids Res. 1999 Jul 15; 27(14):2981-9.
[Nucleic Acids Res. 1999]Curr Opin Struct Biol. 2006 Jun; 16(3):420-9.
[Curr Opin Struct Biol. 2006]J Mol Biol. 2004 Apr 23; 338(2):207-15.
[J Mol Biol. 2004]Nat Biotechnol. 2003 Apr; 21(4):435-9.
[Nat Biotechnol. 2003]Nucleic Acids Res. 2000 Apr 15; 28(8):1838-47.
[Nucleic Acids Res. 2000]Nat Biotechnol. 2003 Apr; 21(4):435-9.
[Nat Biotechnol. 2003]Nucleic Acids Res. 2008 Jan; 36(Database issue):D866-70.
[Nucleic Acids Res. 2008]Nucleic Acids Res. 2008 Jan; 36(Database issue):D93-6.
[Nucleic Acids Res. 2008]Nucleic Acids Res. 2000 Apr 15; 28(8):1838-47.
[Nucleic Acids Res. 2000]BMC Genomics. 2006 Jun 13; 7():147.
[BMC Genomics. 2006]Proc Natl Acad Sci U S A. 2005 Nov 29; 102(48):17400-5.
[Proc Natl Acad Sci U S A. 2005]Nucleic Acids Res. 2008 Sep; 36(16):5376-90.
[Nucleic Acids Res. 2008]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D74-81.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2000 Apr 15; 28(8):1838-47.
[Nucleic Acids Res. 2000]BMC Genomics. 2006 Jun 13; 7():147.
[BMC Genomics. 2006]Nucleic Acids Res. 2008 Sep; 36(16):5376-90.
[Nucleic Acids Res. 2008]Genome Res. 2005 Feb; 15(2):312-20.
[Genome Res. 2005]Science. 2000 Dec 22; 290(5500):2306-9.
[Science. 2000]Bioinformatics. 2003 Dec 12; 19(18):2369-80.
[Bioinformatics. 2003]Nat Biotechnol. 2003 Apr; 21(4):435-9.
[Nat Biotechnol. 2003]Bioinformatics. 2005 Feb 1; 21(3):307-13.
[Bioinformatics. 2005]Genome Biol. 2007; 8(2):R24.
[Genome Biol. 2007]PLoS Comput Biol. 2007 Mar 30; 3(3):e61.
[PLoS Comput Biol. 2007]Proc Natl Acad Sci U S A. 2005 Nov 29; 102(48):17400-5.
[Proc Natl Acad Sci U S A. 2005]Nucleic Acids Res. 2008 Sep; 36(16):5376-90.
[Nucleic Acids Res. 2008]Nucleic Acids Res. 2007; 35(1):288-98.
[Nucleic Acids Res. 2007]Brief Bioinform. 2008 Sep; 9(5):367-75.
[Brief Bioinform. 2008]Nat Biotechnol. 2003 Apr; 21(4):435-9.
[Nat Biotechnol. 2003]Curr Opin Microbiol. 2003 Oct; 6(5):482-9.
[Curr Opin Microbiol. 2003]Nucleic Acids Res. 2008 Jan; 36(Database issue):D93-6.
[Nucleic Acids Res. 2008]Bioinformatics. 2006 Mar 1; 22(5):637-40.
[Bioinformatics. 2006]Biochem Biophys Res Commun. 2002 Feb 1; 290(4):1328-35.
[Biochem Biophys Res Commun. 2002]Bioinformatics. 2006 Mar 1; 22(5):637-40.
[Bioinformatics. 2006]