• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of genoresGenome ResearchCSHL PressJournal HomeSubscriptionseTOC AlertsBioSupplyNet
Genome Res. Apr 2001; 11(4): 566–584.
PMCID: PMC311042

A Comparative Genomics Approach to Prediction of New Members of Regulons


Identifying the complete transcriptional regulatory network for an organism is a major challenge. For each regulatory protein, we want to know all the genes it regulates, that is, its regulon. Examples of known binding sites can be used to estimate the binding specificity of the protein and to predict other binding sites. However, binding site predictions can be unreliable because determining the true specificity of the protein is difficult because of the considerable variability of binding sites. Because regulatory systems tend to be conserved through evolution, we can use comparisons between species to increase the reliability of binding site predictions. In this article, an approach is presented to evaluate the computational predicitions of regulatory sites. We combine the prediction of transcription units having orthologous genes with the prediction of transcription factor binding sites based on probabilistic models. We augment the sets of genes in Escherichia coli that are expected to be regulated by two transcription factors, the cAMP receptor protein and the fumarate and nitrate reduction regulatory protein, through a comparison with the Haemophilus influenzae genome. At the same time, we learned more about the regulatory networks of H. influenzae, a species with much less experimental knowledge than E. coli. By studying orthologous genes subject to regulation by the same transcription factor, we also gained understanding of the evolution of the entire regulatory systems.

The number of complete microbial genome sequences is increasing at an unprecedented rate. To date, 29 bacterial genomes have been determined, 11 more are in annotation stage, and 83 are in progress. This surge of sequence information provides an enormous amount of data for comparative genomics analysis. During the earlier stage of genomic analysis, most of the effort was devoted to analyses of protein-coding regions because, in the course of evolution, protein-coding sequences change much slower than the noncoding sequences (Koonin et al. 1997, 1998). These comparative genomics studies have proved highly informative, allowing functional assignments for many putative proteins in poorly studied organisms (Overbeek et al. 1999). One surprising result from these analyses was the lack of long-range conservation of gene order in bacterial genomes, with the exception of species within the same genus (Tatusov et al. 1996; Himmelreich et al. 1997). For species of intermediate phylogenetic distance, such as in Escherichia coli and Haemophilus influenzae, many clusters are conserved, but their orders are less conserved (Dandekar et al. 1998). However, a more recent study shows a clear conservation of pairs of orthologs to genes within an operon, as opposed to genes at the boundaries of transcription units (TU) (G. Moreno-Hagelsieb et al. 2001).

Besides knowing individual protein functions, knowledge about the transcriptional regulatory network is an indispensable prerequisite for an adequate understanding of cellular functions. The computational identification of regulatory proteins from a bacterial genome sequence is more solid given the limited number of transcription factor families and the conservation of the helix–turn–helix motif in bacteria (Perez-Rueda and Collado-Vides 2000). For the set of regulatory proteins, we want know the entire set of genes whose expression is regulated by each of these regulators, its regulon (Salgado et al. 2000a). The first step in this direction is the identification of transcription factor binding sites, which then can help to predict transcription units regulated by these proteins. Although the problem of regulatory site prediction has been studied for >20 years, it is still far from being solved (Gelfand 1995; Thieffry et al. 1998a). The major reasons for this are small training set size (often <20 sequences) and poor understanding of the biophysics of protein–DNA interaction, making it very difficult to deduce a proper set of rules for pattern recognition algorithms.

Besides identifying regulatory sites, the other key to predicting new members of a regulon is to have a good estimate of transcription units in a given genome. However, even for an organism as extensively studied as E. coli, the set of known TUs is far from complete (Salgado et al. 2000a). Also, predicting TUs is a nontrivial problem that has not been studied extensively. Recently, three groups have published new methods to predict TUs (Yada et al. 1999; Craven et al. 2000; Salgado et al. 2000b). These studies represent a promising first step toward a more accurate prediction of TUs. In this article, transcription units are defined as sets of genes (one or more) that are cotranscribed. Operons are defined as the polycistronic subset (more than one gene) of all transcripition units.

We have adopted a combined approach to identifying new members of regulons. We find that high scoring matches to binding patterns for transcription factors are likely to represent real regulatory sites based on the distribution of such sites. The predictions of lower scoring sites are less reliable, so we add evidence from a comparative analysis with other species, based on the premise that regulons tend to be conserved. If we find that orthologous genes in two or more species appear to be controlled by the same factor, that provides added confidence in the prediction of even the lower scoring sites. However, because many prokaryotic genes are transcribed as operons, the transcriptional control regions may be far removed from a particular gene. Therefore, the analysis of TUs is essential to the identification of pairs of orthologous genes belonging to common regulons. Therefore, the overall approach combines the prediction of TUs in each species, the identification of orthologous genes, and the prediction of transcription factor binding sites based on probabilistic models, such as weight matrices.

In this article, we predict new members of the cAMP receptor protein (CRP) and fumarate and nitrate reduction regulatory protein (FNR) regulons in E. coli and H. influenzae. We chose these two genomes because E. coli transcription regulation is by far the best understood among all bacterial species, and H. influenzae is the only complete genome (as of this writing) that is close enough so that many TUs are conserved. The CRP and the FNR are two global transcriptional regulators that occur in many bacteria. Genes regulated by them (CRP and FNR regulons) have a wide range of functions. Our overall strategy is shown in Figure Figure1.1. Briefly, binding patterns derived from known E. coli CRP- and FNR-binding sequences are used to predict novel binding sites for these two proteins. Predicted binding sites are combined with our knowledge of orthologous genes and predictions of TUs in both genomes. This combined information is used to predict novel members of CRP and FNR regulons.

Figure 1
Flowchart depicting our overall strategy for predicting additional members of CRP and FNR regulons. The approach is divided into three stages. In the first stage (I), raw data sets from RegulonDB are filtered for strong binding sites, and weight matrices ...

Other groups previously have used comparative analyses to predict new sets of regulated genes. McGuire et al. (2000) recently examined 17 completely sequenced microbial genomes to identify regulatory sites for groups of related genes. They used a pattern discovery approach to find putative sites and then used various filtering techniques to diminish the number of false predictions. They used known E. coli regulons as positive controls and showed that the method worked well to identify known sites. They even showed that the method could be applied to archaebacterial species, as in another article by Gelfand et al. (2000). However, they did not use the patterns for E. coli regulatory sites to expand the set of genes likely to be regulated by specific factors, which is the main purpose of this article. Mironov et al. (1999) also used a comparative analysis to predict regulatory sites in other species for a few regulons in E. coli. In addition, they did predict a few new sites in E. coli for the PurR and ArgR regulatory proteins. Our approach in this study was similar, but by incorporating TU prediction and using two well-studied regulons, we were able to predict many more new members of the CRP and FNR regulons.


Conserved Recognition Patterns by CRP and FNR in Both Genomes

The cocrystal structure of an E. coli CRP–DNA complex has been solved at 2.5 Å resolution (Parkinson et al. 1996). The principal specificity-conferring interactions are those between the first two residues of the recognition helix, R180 and E181, and the two G : C pairs in the deduced consensus half-site TGTGA (Ebright et al. 1989; Gunasekera et al. 1992). Residue R185 in the recognition helix also contributes to binding specificity although to a lesser extent (Fig. (Fig.2).2). We aligned 10 CRP orthologs from various bacterial genomes (Fig. (Fig.3).3). The first two residues of the specificity-conferring motif, RE—R are 100% conserved across the species. The third residue in the motif is conserved except for CRP_MTB in which the second arginine is replaced by a lysine. This high level of conservation in the DNA-binding domain implies a CRP-binding pattern similar to that of E. coli. CRP also exists in these bacterial genomes with a cognate CRP protein.

Figure 2
Schematic representation of the specificity-conferring interactions between the recognition helices (helix 2) of E. coli CRP and FNR proteins and their consensus half-site binding motifs. CRP, cAMP receptor protein; FNR, fumarate and nitrate reduction ...
Figure 3
Multiple sequence alignment of CRP and FNR proteins from various bacterial genomes. Only sequences around the second helix of the helix–turn–helix motif are shown. The boundaries of the second helix are labelled with a solid line. The ...

Both CRP and FNR belong to the CRP/FNR helix–turn–helix transcription factor superfamily. In E. coli, the two proteins are 23% identical and 36% similar, with the conservation concentrated in the domain containing the HTH motif in which they have 27% identity and 43% similarity (using BLAST and BestFit). The FNR consensus half-site motif (Spiro et al. 1990), TTGAT, is analogous to that of CRP half-site (TGTGA). In fact, a common site that can bind both FNR and CRP has been reported (Jennings and Beacham 1993). In E. coli, the proposed specificity-conferring interactions for FNR are those between E209 and the G-C base pair common to both core motifs and a discriminatory interaction between S212 and the first T-A base pair in the FNR site, which replaces that between R180 and the common G-C base pair in the CRP site. Another conserved interaction involves R213 and the common G-C base pair (Fig. (Fig.2).2). From the multiple alignment of eight FNR orthologs (Fig. (Fig.3),3), we see that the first and third residues of the specificity-conferring motif, E–SR, are absolutely conserved across the species whereas the second residue is highly but not absolutely conserved. Again, this high degree of sequence conservation implies a conserved recognition pattern for FNR binding to its operators.

CRP and FNR Weight Matrices Obtained by Aligning Characterized Binding Sequences in E. coli

Using the program CONSENSUS (Hertz and Stormo 1999), we aligned the training set sequences to generate weight matrices used by the program PATSER. Specifically, a mononucleotide matrix was used to represent the binding specificity of a transcription factor. The assumption in using such a matrix is that contributions to binding specificity are additive across all positions of the site. We tested this assumption by using the program MIXY (Gutell et al. 1992) that can identify covariation(s) between any two positions across the binding sites. No significant covariation was observed between positions for CRP and FNR. Thus, we believe that a mononucleotide matrix is a valid representation of the binding specificities of CRP and FNR.

CONSENSUS calculates a P value for an ungapped multiple alignment, so different alignments can be compared and the most significant one identified (Hertz and Stormo 1999). For each protein, we compared site lengths ranging from 14 to 28 nucleotides and compared symmetric models with asymmetric ones. Symmetric models are clearly more significant than asymmetric ones, with expectation values at least 102 times lower at all even lengths tested. The expectation values for different lengths were not very different over the entire range of lengths tested, consistent with the proteins having a core conserved region of 16 or 14 bp, for CPR and FNR, respectively, surrounded by more weakly conserved sequences. We used 22 nucleotides for the length of each protein's binding site based on previous work (Kolb et al. 1993) and for consistency with previous analyses (Salgado et al. 2000a). The CRP protein has the half-site consensus of TGTGA with a separation of six nucleotides between the two half-sites. The FNR protein has a half-site consensus of TTGAT with a separation of four nucleotides between the half-sites (Fig. (Fig.4;4; Table Table1).1).

Figure 4
Sequence logos for the CRP- and FNR-binding motifs. It was generated based on the multiple sequence alignments by CONSENSUS by using the training sequences. (horizontal axis) Position in the binding motif; (vertical axis) information content in bits. ...
Table 1
Positional Weight Matrices for CRP and FNR Binding Motifs

Determination of Cutoff Scores for CRP and FNR Sites

To determine the appropriate weight matrix for each transcription factor and the cutoff scores to be used for strong and weak predicted sites, we needed to identify a trusted set of example sites and the score distribution for those sites as well as potential sites in the genome. One of the difficulties arises because transcription factors may bind to DNA cooperatively so that a particular experimentally determined site would not, in fact, be a high-affinity site for the factor in another context without a neighboring site. To eliminate such potential artifacts, we picked only the highest scoring site for each transcription unit (see Methods), assuming that at least one of the sites should be high affinity on its own. This still may result in a few intrinsically low affinity sites in our training set but should minimize that number. We then set thresholds for high scoring sites based on the scores of the training sets and taking into account the distribution of scores in the background (i.e., genomic) sequence.

To determine cutoff scores for CRP sites, we used the following procedure. First, we determined the range and mean score for the following two sets of sequences: (1) training sequences; and (2) all 22 mers in the E. coli genome. As shown in Table Table2,2, training sequences scored between 8.77 and 20.62 bits with a mean of 14.4 bits and a standard deviation of 3.6 bits. The mean score and the standard deviation of all 22 mers in the E. coli genome were −15.84 and 8.53 bits, respectively. Such a negative mean score is expected because most of the genomic sequence contains no CRP–binding sites. Next, for each site with a score between 7 and 23 bits from the whole-genome scan, we determined its location relative to the TUs downstream from or encompassing it. Functional regulatory sites usually are located upstream of TUs (in the regulatory region) although there are a few known cases where the sites are located within TUs (8 of 361 in RegulonDB). Given this observation, we can approximate the false-positive rate of our site predictions based on the fraction of predicted sites that are located within transcription units. Figure Figure5a5a shows the fraction of CRP-binding sites located either upstream of or within a TU in the E. coli genome. At low cutoff scores, almost all sites are located within transcription units, indicating a high false-positive rate. The size of all upstream regions is ~1.23 Mb, ~27% of the genome size of E. coli (4.63 Mb). Thus, sites with random localization occur ~73% of the time within TUs. Raising the cutoff score decreases the fraction of predicted sites located within TUs and thus decreases false-positive rates. Using a cutoff of 17 bits, only 6% of all sites are located within transcription units, indicating a low false-positive rate at this cutoff. Thus, we used 17 as the cutoff score for strong sites. To increase the sensitivity of our search, we also chose a cutoff score for weak sites. We decided to use a cutoff at which greater than half of all sites are located upstream of TUs. As shown in Figure Figure5a,5a, at a cutoff of 10 bits, 56% of all sites are located upstream of rather than within TUs. Thus, we chose 10 as the cutoff score for weak sites. Using this weak site cutoff, we only missed two training sequences (glnALG and rpoS; Table Table3).3).

Table 2
Site Score Distributions of Training Sequences and All 22-Mers in Both Genomes Scanned by PATSER
Figure 5
Fraction of sites located either upstream of or within TUs in E. coli. All sites in E. coli genome above certain cutoff are divided into two groups according to their locations relative to TUs: upstream of or within. (a) CRP sites; (b) FNR sites. TU, ...
Table 3
Training Set TUs Regulated by CRP

We applied the same criteria described above to determine the cutoff scores for FNR-binding sites. The training sequences had a score range of 12 to 25.84 bits and a mean of 19.8 bits with a standard deviation of 4.5 bits (Table (Table2).2). Based on Figure Figure5b,5b, we chose 20 as the cutoff for strong sites. At this cutoff, only 9% of all sites are located within TUs. As for the weak site cutoff, we chose 14 because at this cutoff greater than half of all sites (57%) are located upstream of rather than within TUs (Fig. (Fig.5b).5b). Using the weak site cutoff, we only missed one training sequences (dmsA; Table Table4).4).

Table 4
Training Set TU in Escherichia coli Regulated by FNR

New Members of the CRP Regulon

The sets of upstream sequences from both genomes were scanned by PATSER by using the CRP weight matrix. Putative sites were filtered using the two cutoffs for CRP sites described above. For each CRP site scored above 10 bits, we predicted the TU downstream from it. Orthologs (if any) to all genes in a predicted TU were identified. Based on the two cutoffs for CRP-binding sites, we first partitioned our predictions into the following two categories: (1) TUs having at least one strong site; and (2) TUs having only weak site(s). Because the cutoff for strong sites is 2.6 bits higher than the mean score of training sequences and are likely to have few false-positives (Fig. (Fig.5a),5a), we were confident of those category I predictions even without orthology information. Predictions in category II have only weak binding sites and are less reliable than those in category I. However, for some category II predictions, additional evidence exists to support them. The first type of evidence is orthology information. If a category II TU shares orthologous member(s) with a TU from the other genome and the latter also has CRP-binding site(s) (either weak or strong), we put such a TU in category IIA. The second type of evidence is the presence of two or more weak binding sites in the regulatory region of a TU. The probability that two or more sites occur in close proximity by chance is fairly low. We examined all weak CRP sites in the E. coli genome. For all sites located upstream of TUs, 12% are within 100 nucleotides apart. Conversely, only 2% of all sites located within TUs are within 100 nucleotides apart. Thus, closely positioned tandem sites in the regulatory region are more likely to be true binding sites than a single weak site in the regulatory region. We put all category II predictions with two or more sites but without orthology information in category IIB. The rest of category II predictions, TUs having only one weak binding site and no orthology information, are labeled category IIC. This category has the least evidence to support them. Thus, we expect a high false-positive rate among category IIC predictions.

For clarity, the 46 training set TUs and H. influenzae TUs having orthologs to genes in the training set were put in a separate category. For the 46 E. coli TUs, our predictions of CRP-binding sites largely agreed with the data in RegulonDB except for a few cases in which our method predicted extra binding sites (Table (Table3).3). We identified 23 H. influenzae TUs that have orthologs to genes in the training set TUs. Of these 23 TUs, only seven contain CRP-binding sites in their upstream regions (Fig. (Fig.6;6; Table Table5)5)

Figure 6
Training set TUs regulated by CRP and FNR and H. influenzae TUs containing orthologs to genes in the training set. Genes in a TU are represented by rectangular boxes. Binding sites are represented by square boxes with gray box representing a weak site ...
Table 5
Haemophilus influenzae TUs Predicted to Belong to the CRP Regulon

In category I, we predicted 62 and 49 TUs in E. coli and H. influenzae, respectively. In category IIA, we predicted 30 and 21 TUs in E. coli and H. influenzae, respectively. For both categories, predicted CRP sites, their scores, and locations relative to the transcription start are tabulated in Table Table66 (E. coli) and Table Table55 (H. influenzae). Category IIB contains 25 and 12 TUs in E. coli and H. influenzae, respectively. This is a total of 117 and 82 new TUs in E. coli and H. influenzae, respectively, that we are reasonably confident belong to the CRP regulon. Category IIC contains 319 and 150 TUs in E. coli and H. influenzae, respectively. These predictions are less reliable but probably contain some true regulated TUs. Because of space limitation, we are unable to display results in categories IIB and IIC in this article. These data are available as supplementary material at http://www.genome.org.

Table 6
Escherichia coli TUs Predicted to Belong to the CRP Regulon

In Figure Figure7,7, we depict structures of predicted TUs that share orthologous members. They are from categories I and IIA in both genomes. Strong and weak binding sites are represented by black and gray squares, respectively. Thus, one can identify the category to which a TU belongs by the colors of binding site squares.

Figure 7Figure 7Figure 7
Orthologous TU pairs from categories I and IIA that are predicted to be regulated by CRP and FNR. CRP-regulated TUs are shown first followed by FNR-regulated TUs. Symbols and drawing schemes are as described for Figure Figure6.6. CRP, cAMP receptor ...

New Members of the FNR Regulon

The same procedures (FNR-binding site predictions, predictions of downstream TUs, and categorization) were performed to identify new members of the FNR regulon. We have nine training set TUs (Table (Table4)4) and four H. influenzae TUs that have ortholgs to genes in the training set. Among these four H. influenzae TUs, only one still maintains FNR regulation (Fig. (Fig.6;6; Table Table7).7). The other five E. coli TUs do not have detectable orthologs in H. influenzae.

Table 7
Haemophilus influenzae TUs Predicted to Belong to the FNR Regulon

Category I contains 10 and eight TUs in E. coli and H. influenzae, respectively, each with at least one strong site. Category IIA contains 0 and 2 TUs in E. coli and H. influenzae, respectively. For both categories, predicted FNR sites, their scores, and distances relative to the transcription start are tabulated in Table Table77 (H. influenzae) and Table Table88 (E. coli). We did not find any TU in category IIB in E. coli. In H. influenzae, category IIB contains 2 TUs. Thus, this is a total of 10 and 12 new TUs in E. coli and H. influenzae, respectively, that we are fairly confident belong to the FNR regulon. In category IIC, we predicted 70 E. coli and 79 H. influenzae TUs, all of which have only one weak binding site and no orthology information. Categories IIB and IIC are available as supplementary material at http://www.genome.org. Again, the structures of predicted TUs that share orthologous members are depicted in Figure Figure7.7.

Table 8
Escherichia coli TUs Predicted to Belong to the FNR Regulon


We have described a method to systematically search for additional members of bacterial regulons based on information both intrinsic and extrinsic to a given genome. The intrinsic information consists of transcription factor binding sites and structures of downstream TUs. The extrinsic information is the orthology relationship between TUs obtained by comparing the respective complete sets of gene products. Our comparative approach consists of the following three major steps: (1) obtaining DNA recognition pattern for a given regulatory protein; in this study, we used weight matrices to represent binding site patterns; (2) prediction of transcription factor binding sites using the recognition pattern obtained in step one; and (3) prediction of TUs downstream from binding sites from step two and identification of any orthologs to members of the predicted TUs. At low thresholds, transcription factor binding site predictions by any present-day computer algorithm are expected to have a relatively high false-positive rate due to small training set size and poor conservation of noncoding sequences. However, incorporation of orthology information in step three increases the reliability of our inferences. Another reinforcement to the prediction of regulatory sites is the use of information on TUs. The correspondence between predicted TUs and the assignment of putative regulatory sites will help to establish other means to score the predictions and make them as more reliable. Certainly, we do not have a statistical model to evaluate how much the probability of a site increases when the site is present in front of orthologous TUs. However, qualitatively, our confidence does increase in the presence of orthology information. In this way, we are at least confident that predictions in categories IIA and IIB have a lower false-positive rate compared with those in category IIC.

The sensitivity and specificity of our predictions are difficult to determine because we do not know the complete set of genes that are regulated by CRP and FNR in either species. In E. coli we have a set of genes that are known to be regulated by each protein, based on genetic and biochemical criteria, but that set is certainly incomplete. For most genes, we simply do not know whether they are regulated by CRP or FNR, and the primary purpose of this article was to identify new TUs that are likely to be regulated by these factors. We can estimate sensitivity and specificity measures by using both the known set of regulated genes and some assumptions about the distribution of sites with various scores. For example, we know that functional regulatory sites usually occur in the region we have defined as the upstream region, between −400 and +50 bp of the start of translation of the first gene in the TU. Rarely, although occasionally, functional sites occur either farther upstream of or within the TU. We also assume that if a binding site for a regulatory protein occurs within that upstream region then it is very likely to be involved in the regulation of the adjacent TU. We set the threshold based on those assumptions for strong sites to be such that >90% of the sites occur in the upstream regions, and therefore we expect very few false-positive among category I predictions. This gives us confidence in the new predicted category I CRP-regulated TUs, 62 and 49 in E coli and H. influenzae, respectively, even without additional evidence. The category I new predictions for FNR are 10 and 8. However, of the known E. coli TUs regulated by these proteins, only 9 and 6 have strong sites, so the sensitivity based on strong site cutoffs alone is only 16.1% and 46.2% for CRP and FNR, respectively.

The threshold for weak sites was chosen such that >50% occur in the upstream regions. Remember that even these weak sites have much higher scores than the average background site (Table (Table2),2), and that most upstream regions do not have them, and any randomly chosen sites would occur only 27% of the time in the upstream regions, based on the sizes of the two sequence sets. Therefore, even such weak sites, category II predictions, are likely to contain many functional sites but undoubtedly contain false-positives as well. Therefore, we look for additional evidence before considering them reliable. One type of additional evidence is if TUs in H. influenzae that contain orthologous genes also appear to be regulated by the factor with either a strong or a weak site, which we call category IIA. Another type is if there are two weak sites near each other, which we call category IIB. We know that CRP can bind cooperatively, so nearby weak sites may have a combined affinity comparable to single strong sites. Furthermore, nearby pairs of weak sites occur infrequently within TUs, but relatively frequently in the upstream regions, as is expected of functional regulatory sites. Combining categories IIA and IIB, we predict 55 and 33 new CRP regulated TUs in E. coli and H. influenzae. An additional 319 and 150 TUs are in category IIC, some of which are probably real and some false. For FNR, we predict 0 and 4 new TUs in E. coli and H. influenzae from categories IIA and IIB, and there are an additional 70 and 79 category IIC TUs.

We can estimate the sensitivity of our approach by scoring the known TUs for each regulon. For the 56 TUs in the CRP regulon that we extracted from RegulonDB, only nine of them score in category I. An additional 41 have weak sites and therefore are put in category II, resulting in a combined sensitivity of 89.3% (50/56). However, among the 41 TUs with only weak sites, only 16 are in categories IIA and IIB (six and 10, respectively), with the remaining 25 in category IIC. Therefore, our confident predictions, combining categories I, IIA, and IIB, account for only 25 known sites, a sensitivity of only 44.6%. If the same proportions exist in the whole genome, then many of the category IIC sites will be functional CRP regulatory sites; however, we cannot determine which are true and which are false from the current data. Similar results are obtained for FNR, in which categories I and II together account for 11 of the 13 known TUs, for a senstivity of 84.6%, but five of those are in category IIC.

The net result of our analysis is the prediction of 116 and 10 new CRP and FNR TUs in E. coli that we consider highly reliable because they fall into categories I, IIA, and IIB. These are clearly not all of the genes regulated by these factors because some of the known TUs are missing from such predictions. Functional sites may be missed because these factors bind cooperatively with some other factor that is not included in the analysis, or because the weight matrix is not a good enough descriptor of the proteins' binding specificity to get all of the functional sites. Many of the missing sites can be found in category IIC predictions, but those predictions probably also contain many false predictions, and we do not include them in our reliable set. Nonetheless, the computational approach we have applied in this article has greatly increased the set of TUs likely to be regulated by these factors in E. coli, with high but not perfect sensitivity. In addition, we make 82 and 12 reliable predictions of TUs regulated by CRP and FNR in H. influenzae, most of which had not been previously identified as members of those regulons.

Interestingly, one experimentally verified CRP site exists in RegulonDB for the E. coli operon glpABC. However, this is a weak site (6.2 bits). In this study, we detected another strong CRP site for this operon (17.25 bits; Table Table6).6). Another interesting case is the E. coli operon fucAO. Before this study, only genetic evidence existed to support the regulation of this operon by CRP. In our study, we identified three CRP-binding sites upstream of fucA (Fig. (Fig.7;7; Table Table6),6), providing further evidence for previous observations.

In E. coli, the gene ansB is under the dual regulation of CRP and FNR (Scott et al. 1995). This joint regulation by both transcription factors might be important in achieving optimal gene expressions. Based on our analysis, ansB may be regulated only by CRP in H. influenzae because the highest scored FNR site in the regulatory region of H. influenzae ansB was only 9.74 bits. This is much lower than the weak site cutoff for FNR but might still be a functional site. Two other TUs, ung and yfiD (its ortholog in H. influenzae is HI0017), seem to be dually reguated in both genomes. Interestingly, for both TUs, the CRP site is the same as the FNR site in both genomes with E. coli TUs having an additional FNR site. It is possible that those sites are true only for one of the regulators and are false-positives for the other regulator. Conversely, we cannot rule out the possibility that those sites are truly recognized by both regulators, because some sites that can bind CRP also can bind FNR (Sawers et al. 1997).

Negative autoregulation is quite dominant in E. coli, and it can be viewed as playing a homeostatic role for the regulatory genes (Thieffry et al. 1998b). Based on our results, CRP seems not to be autoregulated in H. influenzae (the highest scoring CRP site had a score of 4.6 bits). Conversely, FNR does seem to be autoregulated in H. influenzae.

Based on our comparative analysis of the CRP and FNR regulons in the two genomes, we noticed three types of structural changes in operons that are subject to the same mode of regulation. The first type involves insertion or deletion of individual genes in otherwise conserved operons. Examples in E. coli includes operons glpTQ (glpT in H. influenzae, Fig. Fig.6),6), fnr (fnr-HI1426 in H. influenzae, Fig. Fig.6),6), and b2736-b2737 (HI1010-HI1011-HI1012-HI1013 in H. influenzae, Fig Fig77.).

The second type of change involves breakup of an operon in one genome into several smaller ones in the other genome. Not all of the smaller operons retain their regulation by the same regulator. For instance, the E. coli xylFGHR operon is broken in H. influenzae into two operons, xylFGH and xylR (Fig. (Fig.7).7). Only xylFGH maintains CRP regulation in H. influenzae. The protein products of genes xylF, G, and H constitute the high-affinity xylose transport system in both genomes and that of xylR encodes a regulatory protein (Sumiya et al. 1995). In E. coli, xylR acts as a transcriptional activator for the xylFGHR operon and the expression of itself is regulated by CRP (Song and Park 1997). In H. influenzae, the regulation of xylR might be taken over by a different regulator. Alternatively, it could be autoregulated. If this is the case, it is another example of uncoupled versus coupled transcription regulations in two bacteria, an organization with different dynamic consequences (Hlavacek and Savageau 1996). Another example of this second type of change involves the E. coli galETKM operon. The same operon is broken up into two pieces in H. influenzae : galE and galTKM (Fig (Fig6).6). Again, only galTKM still is regulated by CRP in H. influenzae.

Third, also the most common type of change during regulon evolution is the loss of E. coli regulon members in the H. influenzae genome. Examples include operons caiTABCDE, malEFG, and narGHJI. Tatusov et al. (1996) suggested that the common ancestor of E. coli and H. infuenzae could have a genome of intermediate size. The subsequent evolution may have proceeded in opposite directions—toward the reduction of the genome size by deletion of genes and entire transcription units in the Haemophilus lineage and toward the diversification of regulatory and transport functions via gene duplication in the E. coli lineage (Tatusov et al. 1996). As a result, the decrease in CRP and FNR regulon members may be the result of degenerative evolution of H. influenzae. The parasitic lifestyle of H. influenzae might require a less complicated metabolism to cope with enviromental changes. However, as a fraction of the total number of genes, both species appear to have similar sized regulons.

The location of regulatory sites along the genome has a clear influence on how regulation through these sites occurs (Gralla and Collado-Vides 1996). An interesting question to ask is whether regulatory sites of orthologous genes have identical or close positions, that is, whether the distance between regulatory sites and their regulated promoters remain more or less unchanged between bacterial species. To obtain such information, we would need to have a reasonably accurate method to predict promoters in those organisms. Unfortunately, current promoter prediction methods are not satisfactory in this regard. Future work is needed to address this very interesting question.

We noticed that some of our predicted TUs have quite distal binding site(s). Because we report the position of a binding site relative to the translation start of the first downstream gene, these large distances could simply result from the existence of a long 5′ untranslated region. Conversely, they could be true distal sites even if our measurement were based on transcription start. Because of the global nature of regulatory functions, CRP and FNR regulated TUs often have another local, dedicated regulator, such as LacI for the lac operon and GalR for the gal operon. Thus, we suspect TUs predicted here with distal sites will show regulation by additional proteins.

The approach we have used in this article has identified many new genes that we predict are regulated by the CRP and FNR proteins in E. coli and H. influenzae. Combined evidence from site scores and comparative analyses gives us high confidence in many of these predictions. But this is clearly just a first step. More bacterial species can be included and many more regulons can be studied, although regulons with few known members are more problematic because of the small sample size. The accurate prediction of transcription units is critical to the success of such an approach, as operons are often rearranged in evolution and common regulatory sites may be located at long and variable distances from orthologous pairs of genes. In this work, many steps were performed manually, in that careful examination of some results were used to constrain further analyses. Experience gained from this work will allow us to develop more fully automated procedures that can be applied to more regulatory systems in more species in a rapid and reliable approach.


Sequence Data and Programs

Experimentally characterized (mostly by DNA footprinting technique) E. coli CRP- and FNR-binding sequences were extracted from the RegulonDB (Salgado et al. 2000a) database. Complete genome sequences of E. coli and H. influenzae were downloaded from GenBank (Benson et al. 1999). Weight matrices were constructed by CONSENSUS (Hertz and Stormo 1999), which generates optimal ungapped multiple sequence alignments with predefined width. In addition, the program reports the statistical significance of the generated multiple sequence alignment. Given a weight matrix, searches for transcription factor binding sites were performed using PATSER (Hertz et al. 1990). PATSER scores each possible binding site position in a sequence by using the designated weight matrix and returns the scores and positions of all sites above a user-defined threshold. Multiple alignments of protein sequences were constructed using the program CLUSTALX (Thompson et al. 1997). Protein sequence database searches were performed using the gapped BLASTP program (Altschul et al. 1997). All searches were performed against the National Center for Biotechnology Information nonredundant protein sequence database. Sequence comparisons between E. coli CRP and FNR were performed using the BestFit program (Wisconsin Package Version 10.0; Genetics Computer Group). Sequence logos were constructed using the web interface (S.E. Brenner, http://www.bio.cam.ac.uk/cgi-bin/seqlogo/logo.cgi) to the MAKELOGO program by Schneider (Schneider and Stephens 1990). The rest of the analysis was performed by using ad hoc PERL scripts (Wall et al. 1996).

Preparation of Training Set Sequences

The current version of the RegulonDB database (version 3.0) contains 80 experimentally verified E. coli CRP-binding sequences from 56 TUs (because some of these 56 TUs have multiple CRP-binding sites the number of sites exceeds the number of TUs). We expect that some of these 80 sites are weak CRP-binding sites. Presumably, CRP binds these weak sites through cooperativity with other regulatory proteins. Weak sites were filtered out from our training set by using the following procedures. In step one, we ran CONSENSUS on the 80 binding sequences and generated an initial weight matrix. PATSER then was used to score the original 80 binding sequences by using the weight matrix generated in step one. After this initial step, we only chose the highest-scoring sequence from each TU for further processing. This gave us 48 sites representing 53 TUs (all sites from three of the 56 TUs were rejected by CONSENSUS and thus not included). Because of the existence of divergent TUs, the number of sites is less than the number of TUs. The mean and standard deviation of the scores of these 48 sites were 13.1 and 3.6 bits, respectively. For our final training set, we excluded, from the 48 sites, any sites with scores that are more than one standard deviation below the mean, that is, 9.5 bits. We ended up with 42 sequences in the training set, representing 46 TUs.

The current version of RegulonDB database contains 17 experimentally verified E. coli FNR-binding sequences from 13 TUs. We applied the same procedures to these 17 sequences to generate the training set. We ended up with nine sequences in our training set, representing nine TUs. The mean and standard deviation of these nine sequences were 19.8 and 4.5 bits, respectively.

Prediction of Transcription Factor Binding Sites

During the first step of our analysis, weight matrices for both CRP and FNR binding sites were generated by CONSENSUS by using our training set sequences (42 for CRP and nine for FNR). Subsequently, the published annotations of all the open reading frames (ORFs) in E. coli (Blattner et al. 1997) and H. influenzae (Fleischmann et al. 1995) were used to generate two sets of putative regulatory sequences (one for each genome), covering 400 nt upstream of and 50 nt downstream from the beginning of each ORF. This length was chosen from the known distribution of a large collection of regulatory sites in ς70 promoters (Gralla and Collado-Vides 1996). Then, PATSER was used to scan the sets of regulatory sequences to identify potential binding sites by using the weight matrices generated in step one (Hertz and Stormo 1999; Hertz et al. 1990). Potential binding sites scored above the chosen cutoffs were reported. Eventually, binding site information was combined with orthology relationship between TUs to predict new members of the CRP and FNR regulons. We classified binding sites into two categories based on their locations relative to the TUs downstream from or encompassing it (1) sites located in the regulatory region of a TU; and (2) sites located within a TU. The latter category includes two cases: within genes of a TU and within the upstream region of an internal gene.

Determination of Orthology between E. coli and H. influenzae Genes

Fitch first introduced the term ortholog for genes derived from speciation events (Fitch 1970). At present, there is not a simple and perfect method for detecting orthology relationship because of complicating events during genome evolution, such as gene duplication, gene loss, and horizontal gene transfer (Huynen and Bork 1998). For our study, we used the minimal definition of orthology described by Huynen and Bork (1998): (1) orthologous ORFs between two genomes compared must be the most similar ORF reciprocally; (2) sequence similarity between the ORFs has to be statistically significant. In this article, sequence similarity was calculated by the BLASTP program (version 2.0; Altschul et al. 1997). Any alignment with an E-value of 1e-15 was considered significant for our purpose. and (3) sequence similarity extends to at least 60% of one of the genes.

Prediction of Transcription Units

The prediction of TUs was described for E. coli by Salgado et al. (2000b). The method is based on the differences between pairs of adjacent genes in operons and pairs of adjacent genes at the borders of TUs. The differences studied were distances between genes and their functional relationships, the latter ones being an update of the functional classification described by Monica Riley (Riley 1993; Riley and Labedan 1996). Here, to apply the method to H. influenzae, we inherited the functional classification for E. coli genes and then applied the prediction method to the whole H. influenzae genome, dividing it into putative TUs. In this way, we obtained sets of TUs that can be compared between organisms when a regulatory site was found close to orthologous genes that may in turn lie inside analogous TUs.


We thank members of the Stormo and Collado-Vides labs for insightful discussions. We thank three anonymous reviewers for their comments. This work was supported by Grant HG-00249 from National Institutes of Health (G.D.S.), Grant 0028 from Conacyt (J.C.-V.), and Grant DE-FG02-98ER62558 from U.S. Department of Energy (J.C.-V.).

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.


E-MAIL ude.ltsuw.laru@omrots; FAX (314) 362-7855.

Article and publication are at www.genome.org/cgi/doi/10.1101/gr.149301.


  • Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article] [PubMed]
  • Benson DA, Boguski MS, Lipman DJ, Ostell J, Francis Ouellette BF, Rapp BA, Wheeler DL. GenBank. Nucleic Acids Res. 1999;27:12–17. [PMC free article] [PubMed]
  • Blattner FR, Plunkett G, III, Bloch CA, Perna NT, Burland V, Riley M, Collado-Vides J, Glasner JD, Rode CK, Mayhew GF, et al. The complete genome sequence of Escherichia coli K-12. Science. 1997;277:1453–1474. [PubMed]
  • Craven M, Page D, Shavlik J, Bockhorst J, Glasner J. A probabilistic learning approach to whole-genome operon prediction. Proc Int Conf Intell Syst Mol Biol. 2000;8:116–127. .. [PubMed]
  • Dandekar T, Snel B, Huynen M, Bork P. Conservation of gene order: A fingerprint of proteins that physically interact. Trends Biochem Sci. 1998;23:324–328. [PubMed]
  • Ebright RH, Ebright YW, Gunasekera A. Consensus DNA site for the Escherichia coli catabolite gene activator protein (CAP): CAP exihibits a 450-fold higher affinity for the consensus than for the E. coli lac DNA site. Nucleic Acids Res. 1989;17:10295–10305. [PMC free article] [PubMed]
  • Fitch WM. Distinguishing homologous from analogous proteins. Syst Zool. 1970;19:99–100. [PubMed]
  • Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM, et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science. 1995;269:496–512. [PubMed]
  • Gelfand MS. Prediction of function in DNA sequence analysis. J Comput Biol. 1995;2:87–115. [PubMed]
  • Gelfand MS, Koonin EV, Mironov AA. Prediction of transcription regulatory sites in Archaea by a comparative genomic approach. Nucleic Acids Res. 2000;28:695–705. [PMC free article] [PubMed]
  • Gralla JD, Collado-Vides J. Organization and function of transcription regulatory elements. In: Neidhardt FC, editor. Cellular and molecular biology: Escherichia coli and Salmonella. 2nd ed. Washington, DC.: American Society for Microbiology; 1996. pp. 1232–1245.
  • Gunasekera A, Ebright YW, Ebright RH. DNA sequence determinants for binding of the Escherichia coli catabolite gene activator protein. J Biol Chem. 1992;267:14713–14720. [PubMed]
  • Gutell PR, Power A, Hertz GZ, Putz E, Stormo GD. Identifying constraints on the higher-order structure of RNA: Continued development and application of comparative analysis methods. Nucleic Acids Res. 1992;20:5785–5795. [PMC free article] [PubMed]
  • Hattori T, Takahashi K, Nakanishi T, Ohta H, Fukui K, Taniguchi S, Takigawa M. Novel FNR homologs identified in four representative oral facultative anaerobes: Capnocytophaga Ochracea, Capnocytophaga sputigena, Haemophilus aphrophilus, and Actinobacillus actinomycetemcomitans. FEMS Microbiol Lett. 1996;137:213–220. [PubMed]
  • Hertz GZ, Stormo GD. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics. 1999;15:563–577. [PubMed]
  • Hertz GZ, Hartzell GW, III, Stormo GD. Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Comput Appl Biosci. 1990;6:81–92. [PubMed]
  • Himmelreich R, Plagens H, Hilbert H, Reiner B, Herrmann R. Comparative analysis of the genomes of the bacteria Mycoplasma pneumoniae and Mycoplasma genitalium. Nucleic Acids Res. 1997;25:701–712. [PMC free article] [PubMed]
  • Hlavacek MS, Savageau MA. Rules for coupled expression of regulator and effector genes in inducible circuits. J Mol Biol. 1996;255:121–139. [PubMed]
  • Huynen MA, Bork P. Measuring genome evolution. Proc Natl Acad Sci USA. 1998;95:5849–5856. [PMC free article] [PubMed]
  • Jennings MP, Beacham IR. Co-dependent positive regulation of the ansB promoter of Escherichia coli by CRP and the FNR protein: A molecular analysis. Mol Microbiol. 1993;9:155–164. [PubMed]
  • Kolb A, Busby S, Buc H, Garges S, Adhya S. Transcriptional regulation by cAMP and its receptor protein. Annu Rev Biochem. 1993;62:749–795. [PubMed]
  • Koonin EV, Galperin MY. Prokaryotic genomes: The emerging paradigm of genome-based microbiology. Curr Opin Genet Dev. 1997;7:757–763. [PubMed]
  • Koonin EV, Tatusov RL, Galperin MY. Beyond complete genomes: From sequence to structure and function. Curr Opin Struct Biol. 1998;8:355–363. [PubMed]
  • McGuire AM, Hughes JD, Church GM. Conservation of DNA regulatory motifs and discovery of new motifs in microbial genomes. Genome Res. 2000;10:744–757. [PubMed]
  • Mironov AA, Koonin EV, Roytberg MA, Gelfand MS. Computer analysis of transcription regulatory patterns in completely sequenced bacterial genomes. Nucleic Acids Res. 1999;27:2981–2989. [PMC free article] [PubMed]
  • Moreno-Hagelsieb, G., Trevino, V., Perez-Rueda, E., Smith, T.F., and Collado-Vides, J. 2001. Transcription unit conservation in the three domains of life: A perspective from Escherichia coli. Trends Genet. (in press). [PubMed]
  • Overbeek R, Fonstein M, D'Souza M, Pusch GD, Maltsev N. The use of gene clusters to infer functional coupling. Proc Natl Acad Sci. 1999;96:2896–2901. [PMC free article] [PubMed]
  • Parkinson G, Wilson C, Gunasekera A, Ebright YW, Ebright RE, Berman HM. Structure of the CAP-DNA complex at 2.5 A resolution: A complete picture of the protein–DNA interface. J Mol Biol. 1996;260:395–408. [PubMed]
  • Perez-Rueda E, Collado-Vides J. The repertoire of DNA-binding transcriptional regulators in Escherichia coli K-12. Nucleic Acids Res. 2000;28:1838–1847. [PMC free article] [PubMed]
  • Riley M. Functions of the gene products of Escherichia coli. Microbiol Rev. 1993;57:862–952. [PMC free article] [PubMed]
  • Riley M, Labedan B. Escherichia coli gene products: Physiological functions and common ancestries. In: Neidhardt FN, et al., editors. Escherichia coli and Salmonella: Cellular and Molecular Biology. Washington, DC.: American Society for Microbiology; 1996. pp. 2118–2202.
  • Salgado H, Santos-Zavaleta A, Gama-Castro S, Millán-Zárate D, Blattner FR, Collado-Vides J. RegulonDB (version 3.0): Transcriptional regulation and operon organization in Escherichia coli. Nucleic Acids Res. 2000a;28:65–67. [PMC free article] [PubMed]
  • Salgado H, Moreno-Hagelsieb G, Smith TF, Collado-Vides J. Operons in Escherichia coli: Genomic analyses and predictions. Proc Natl Acad Sci. 2000b;97:6652–6657. [PMC free article] [PubMed]
  • Sawers G, Kaiser M, Sirko A, Freundlich M. Transcriptional activation by FNR and CRP: Reciprocity of binding site recognition. Mol Microbiol. 1997;23:835–845. [PubMed]
  • Schneider TD, Stephens RM. Sequence logos: A new way to display consensus sequences. Nucleic Acids Res. 1990;18:6097–6100. [PMC free article] [PubMed]
  • Scott S, Busby S, Beacham I. Transcriptional co-activation at the ansB promoters: Involvement of the activating regions of CRP and FNR when bound in tandem. Mol Microbiol. 1995;18:521–531. [PubMed]
  • Song S, Park C. Organization and regulation of the D-xylose operons in Escherichia coli K-12: xylR acts as a transcriptional activator. J Bacteriol. 1997;179:7025–7032. [PMC free article] [PubMed]
  • Spiro S, Gaston KL, Bell AI, Robers RE, Busby SJ, Guest JR. Interconvention of the DNA-binding specificities of two related transcription regulators, CRP and FNR. Mol Microbiol. 1990;4:1831–1838. [PubMed]
  • Sumiya M, Davis EO, Packman LC, McDonald TP, Henderson PJF. Molecular genetics of a receptor protein for D-xylose, encoded by the gene xylF in Escherichia coli. Receptors Channels. 1995;3:117–128. [PubMed]
  • Tatusov RL, Mushegian AR, Bork P, Brown NP, Hayes WS, Borodovsky M, Koonin EV. Curr Biol. 1996;6:279–291. [PubMed]
  • Thieffry D, Salgado H, Huerta AM, Collado-Vides J. Prediction of transcriptional regulatory sites in the complete genome sequence of Escherichia coli K-12. Bioinformatics. 1998a;14:391–400. [PubMed]
  • Thieffry D, Huerta AM, Perez-Rueda E, Collado-Vides J. From specific gene regulation to genomic networks: A global analysis of transcriptional regulation in Escherichia coli. Bioessays. 1998b;20:433–440. [PubMed]
  • Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG. The CLUSTAL_X windows interface: Flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 1997;25:4876–4882. [PMC free article] [PubMed]
  • Wall L, Christiansen T, Schwartz RL. Programming Perl. Sebastopol, CA: O'Reilly and Associates; 1996.
  • Yada T, Nakao M, Totoki Y, Nakai K. Modeling and predicting transcription units of Escherichia coli genes using hidden Mokov models. Bioinformatics. 1999;15:987–993. [PubMed]

Articles from Genome Research are provided here courtesy of Cold Spring Harbor Laboratory Press
PubReader format: click here to try


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...