![]() | ![]() |
Formats:
|
|||||||||||||||||||||||||||||||||||
Copyright : © 2007 Faith et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Large-Scale Mapping and Validation of Escherichia coli Transcriptional Regulation from a Compendium of Expression Profiles 1 Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America 2 Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America 3 Boston University School of Medicine, Boston, Massachusetts, United States of America 4 Department of Computer and Systems Science A. Ruberti, University of Rome, La Sapienza, Rome, Italy 5 Cellicon Biotechnologies, Boston, Massachusetts, United States of America Andre Levchenko, Academic Editor Johns Hopkins University, United States of America #Contributed equally. * To whom correspondence should be addressed. E-mail: tgardner/at/bu.edu Received May 3, 2006; Accepted November 7, 2006. This article has been cited by other articles in PMC.Abstract Machine learning approaches offer the potential to systematically identify transcriptional regulatory interactions from a compendium of microarray expression profiles. However, experimental validation of the performance of these methods at the genome scale has remained elusive. Here we assess the global performance of four existing classes of inference algorithms using 445 Escherichia coli Affymetrix arrays and 3,216 known E. coli regulatory interactions from RegulonDB. We also developed and applied the context likelihood of relatedness (CLR) algorithm, a novel extension of the relevance networks class of algorithms. CLR demonstrates an average precision gain of 36% relative to the next-best performing algorithm. At a 60% true positive rate, CLR identifies 1,079 regulatory interactions, of which 338 were in the previously known network and 741 were novel predictions. We tested the predicted interactions for three transcription factors with chromatin immunoprecipitation, confirming 21 novel interactions and verifying our RegulonDB-based performance estimates. CLR also identified a regulatory link providing central metabolic control of iron transport, which we confirmed with real-time quantitative PCR. The compendium of expression data compiled in this study, coupled with RegulonDB, provides a valuable model system for further improvement of network inference algorithms using experimental data. Author Summary Organisms can adapt to changing environments—becoming more virulent, for example, or activating stress responses—thanks to a flexible gene expression program controlled by the dynamic interactions of hundreds of transcriptional regulators. To unravel this regulatory complexity, multiple computational algorithms have been developed to analyze gene expression profiles and detect dependencies among genes over different conditions. It has been difficult to judge whether these algorithms can generate accurate global maps of regulatory interactions, however, because of the absence of a model organism with both a compendium of gene expression data and a corresponding network of experimentally determined regulatory interactions. To address this issue, we assembled 445 Escherichia coli microarrays, applied four classes of inference algorithms to the dataset, and validated the predictions against 3,216 experimentally determined E. coli interactions. The top-performing algorithm identifies 1,079 regulatory interactions at a confidence level of 60% or higher. Of these predicted interactions, 741 are novel and illuminate the regulation of amino acid biosynthesis, flagella biosynthesis, osmotic stress response, antibiotic resistance, and iron regulation. By defining the capabilities and limitations of network inference algorithms for large-scale mapping of prokaryotic regulatory networks, our work should facilitate their application to the mapping of novel microbes. Introduction High-throughput genome sequencing and bioinformatics technologies have dramatically eased the task of genomic annotation, producing parts lists of living organisms as simple as Mycoplasmas and as complex as mammals. Further progress in the understanding of an organism's biology requires development and refinement of techniques to determine the dynamic interactions among an organism's molecular parts [1]. A major difficulty of this task is the context-specific nature of gene regulation. The total space of possible transcriptional regulatory interactions for an organism is the number of transcription factors multiplied by the number of genes multiplied by the number of environmental contexts in which the cell might find itself. Methods to identify regulatory interactions must efficiently determine the thousands of true regulatory interactions out of the billions of possible ones. Pioneering efforts to identify regulatory interactions on a genome scale have used machine-learning algorithms to identify cis-regulatory motifs or transcription factor target genes using a large set of expression arrays [2–18], genome-wide location analysis chromatin immunoprecipitation (ChIP-Chip) [19,20], or a combination of these and other high-throughput methods [21–26]. In general, the precision of these methods has been evaluated by testing for functional enrichment of co-regulated genes, experimental confirmation of a few selected regulatory relationships, or cross-validation within the training dataset. However, experimental validation of the precision of these methods at the genome scale has remained elusive due to the lack of a model organism with both a known regulatory structure and compatible experimental data. Therefore, the relative merits and broader utility of these approaches remain difficult to judge. Here we demonstrate an unsupervised network inference method, context likelihood of relatedness (CLR), which uses transcriptional profiles of an organism across a diverse set of conditions to systematically determine transcriptional regulatory interactions. We take advantage of the extensive knowledge of transcriptional regulation in Escherichia coli to assess the performance of the CLR algorithm and several other algorithms on a genome scale. In E. coli, a set of 3,216 experimentally confirmed regulatory interactions among 1,211 genes have been curated in the RegulonDB database [27], which can be used for performance assessment. We assembled a compendium of 445 new and previously published E. coli Affymetrix Antisense2 microarray expression profiles collected under various conditions including pH changes, growth phases, antibiotics, heat shock, different media, varying oxygen concentrations, and numerous genetic perturbations (Figure 1
Results/Discussion The CLR Algorithm The CLR algorithm is an extension of the relevance networks approach [28,31] for identifying transcriptional regulatory interactions. Although the relevance networks algorithm was written primarily for clustering, its authors also suggest its utility for identifying regulatory networks [29]. The original relevance networks method used mutual information for scoring the similarity between the expression levels of two genes in a set of microarrays. A gene and a transcription factor are predicted to interact if the mutual information between the expression levels of the gene and its potential regulator is above some set threshold. Like correlation, mutual information is a metric that detects statistical dependence between two variables. But unlike correlation, it does not assume linearity, continuity, or other specific properties of the dependence [32,33]. As such, mutual information possesses the flexibility to detect regulatory interactions that might be missed by linear measures such as the correlation coefficient. In the relevance networks algorithm, there are tradeoffs between true positive and false positive rates in choosing a threshold for the identification of significant regulatory interactions. A high threshold results in a smaller network with fewer false positives, but it also eliminates potential novel interactions. Conversely, a low threshold will often capture false positive interactions due to a number of factors, including background correlation and misinterpretation of indirect dependence as direct interaction. The CLR algorithm builds upon the relevance network but applies an adaptive background correction step to eliminate false correlations and indirect influences (Figure 2
We applied CLR to the 4,345 genes on the E. coli Antisense2 microarray using the 445 profiles in the compendium to identify the gene targets of the E. coli transcription factors. For comparison, we ran several variants of commonly used network inference algorithms on the compendium data; the top-performing variant of each algorithm is shown in Figure 2 To score our results, we compared our predicted interactions with the set of 3,216 interactions in the RegulonDB database [27]. We computed two measures: recall, which is the fraction of the 3,216 known E. coli interactions that CLR successfully identified; and precision, which is the fraction of identified interactions that are true positives (Figure 2 CLR outperforms all other algorithms run on the compendium (Figure 2
Although we identified hundreds of known regulatory interactions correctly at high precision, this represents only a fraction of known interactions in E. coli. The recall of the algorithm depends on several factors, including the number and diversity of expression profiles. As discussed in Protocol S1, the CLR algorithm can achieve maximum recall and precision using as few as 60 expression profiles selected for maximum diversity (Figure 2 Discovery of Novel Regulatory Pathways We used the regulatory network predicted by CLR to explore and validate newly inferred regulatory interactions. We confirmed the algorithm's performance using two additional approaches (Figure 1 Further validation of CLR by sequence analysis of regulatory motifs. Using the set of gene targets predicted for each transcription factor, we applied sequence analysis algorithms to infer the sequence motif bound by each regulator. Not all transcription factors have enough targets to allow reliable motif detection, but for those that do, the motif provides a specific location for the regulatory interaction. A significant sequence motif for a group of genes provides an additional level of validation, as it is unlikely that the group of genes would share a common motif but not a common transcription factor regulator. To detect sequence motifs, we selected all transcription factors predicted to regulate five or more operons with at least a 60% confidence (61 total). For each group of operons regulated by the same transcription factor, we analyzed approximately 150 base pairs upstream of the transcription start site with the MEME multiple alignment system [34]. LexA, a major regulator of DNA repair, is known to have a single well-conserved DNA-binding motif. It is one of the best-perturbed regulators in the microarray compendium due to the compendium's emphasis on DNA-damaging conditions. Consequently, the LexA protein has a large set of correctly predicted targets and exhibits a highly significant motif almost identical to the known canonical LexA motif (Figure S3A). Five out of eight promoters containing the LexA motif in Figure S3A are known LexA targets according to RegulonDB. The other three promoters for dinI, dinP, and yebG are confirmed LexA targets [35] but are not catalogued in RegulonDB. Figures S3C and S3D illustrate this approach applied to two putative regulators, YmfN and YnaE. YmfN is a putative DNA-binding protein homologous to a phage terminase. We found a strong motif (p value ≈ 0.0061) in all six of the operons inferred for this transcription factor by CLR (Figure S3D). The gene ymfN attains its highest levels of expression in our compendium upon exposure to norfloxacin, a DNA-damaging bactericidal agent, and its inferred targets show enrichment in prophage and DNA repair categories (Table S1). YnaE (Rac prophage) is another putative DNA-binding protein. The latest computational annotation for YnaE available in EcoCyc (http://ecocyc.org/) suggests that its function is also phage-related [36]. There is enrichment for cold-shock response proteins in the predicted YnaE regulon (Table S2). Also present are rhsE, a stationary-phase survival-related protein, and b1374, a putative transposon resolvase. In the compendium, ynaE was highly expressed when Lon protease or YoeB toxin was genetically up-regulated, and when either norfloxacin antibiotic or mussel defensin protein was present. Based on our analysis, YnaE may control a small, specialized stress response network in E. coli. Overall, we were able to detect a significant (one-tailed p value < 0.05) binding motif for 28 out of the 61 transcription factors (Table S3). Of these regulators, 13 had a known motif in PRODORIC (http://prodoric.tu-bs.de/). We compared the predicted motifs for the 13 regulators to all known E. coli motifs. For seven of the 13 MEME-predicted motifs (54%), we identified the known motif in the top ten best matches (Table S3). Some correctly reconstructed regulators made a relatively poor match to their correct motif. In the majority of these six cases, including Lrp (rank = 15) (Figure S3B), this was due to the presence of combinatorial or conditional regulation. For Lrp, two motifs in the database (± leucine) were incorrectly collapsed into one by motif analysis. In other cases, a gene was regulated by multiple transcription factors, and the MEME motif analysis picks up the stronger motif in the sequence, which may not be the motif of the transcription factor we are looking for. This ambiguity in cases of combinatorial regulation is a limitation of motif analysis, because it only looks for statistical enrichment of regions of sequence and does not consider the actual binding properties of the transcription factor. In vivo confirmation of new regulatory interactions. We performed ChIP with quantitative PCR (ChIP-qPCR) to obtain physical confirmation for many of the regulatory interactions inferred by CLR and to verify that performance estimates based on the known subset of interactions in RegulonDB extrapolate beyond this subset. In particular, we studied three transcription factors (Lrp, PdhR, and FecI) with substantial connectivity in the network mapped by CLR. For each transcription factor, 26–35 operons with at least one inferred interaction were tested by ChIP-qPCR for a total of 93 tested operons (244 genes). We tested 24 known regulatory interactions as a positive control to verify that known interactions were detectable by ChIP for a total of 268 tested interactions. Network inference results are typically verified for a few hand-picked samples that are studied in detail. Thus, the biological intuition of the experimenter may play a role in the success of a hand-picked verification. RegulonDB provides a way to overcome this problem in E. coli. Because we have no control over the interactions present in RegulonDB, our algorithm performance estimates are unbiased by our selective validation. Likewise, when performing ChIP validation experiments, we wanted to be as unbiased as possible. Although choosing the three transcription factors is a form of hand-picking, the targets tested for those transcription factors were selected in a systematic way, by choosing approximately 30 of the highest-scoring targets of each of the three transcription factors. Choosing this many targets results in many interactions with confidence levels below 20% in our ChIP-qPCR experiments, allowing us to verify that our confidence estimates are reliable across the entire range of precision. Figure 5
A combinatorial link between central metabolism and iron transport. The inferred regulatory network revealed new combinatorial regulation at many promoters. We explored these combinatorial regulation schemes, first across the entire network (Protocol S1) and second by detailed real-time quantitative PCR analysis of the novel PdhR-fecA interaction, which is an interaction that links central metabolism to the control of iron import—a link of potential importance in bacterial virulence and stress protection. The presence of iron is essential for the survival of most organisms, because it plays a critical role in the tricarboxylic acid cycle, in electron transport, in reducing oxygen radicals, in DNA synthesis, and in amino acid synthesis [37]. Iron, however, is scarce in many environments because of the low solubility of its ferric form. Consequently, many organisms have developed elaborate mechanisms for scavenging soluble forms of the element. In E. coli K12, there are six different siderophore receptors, each representing a different chelator that is capable of capturing extracellular iron and converting it to a soluble form that may be transported into the cell [38]. Excess iron can be toxic to cells; iron uptake must therefore be carefully dictated by the need for cellular iron. fecABCDE is an operon that encodes a ferric citrate transporter and plays a central role in the import of cellular iron. Existing literature described only two regulators of fecABCDE—FecI and Fur. The Fur regulation is not apparent in the compendium (Figure 6
Inspection of the compendium data suggested that FecI and PdhR might regulate the fecA operon using AND-like logic, where both proteins must be activated for expression of the fecA operon (Figure 6 Such an explicit regulatory link between central metabolism and iron transport has not, to our knowledge, been previously identified in microbes. This link is perhaps not surprising, given that iron is a critical component of several proteins involved in both the tricarboxylic acid cycle (aconitase and succinate dehydrogenase) and electron transport (cytochromes and ferredoxin); the magnitude of carbon/electron flux through the citric acid cycle and electron transport chain thus plays a major role in determining the cellular need for iron. It is possible that an increase in intracellular pyruvate, which is the inducer for PdhR, may signal the need for increased flow through respiratory pathways. This novel role for pyruvate is plausible given that pyruvate serves as a common catabolite for a diverse collection of carbon sources and stands just one enzymatic step away from entering the tricarboxylic acid cycle itself. A Platform for Mapping Transcriptional Regulation This work rigorously assesses the genome-scale performance of multiple network inference algorithms using an experimental compendium of 445 expression profiles and a “gold standard” of known regulatory interactions. Using the top-performing CLR algorithm, we predicted 1,079 regulatory interactions in E. coli with 60% precision or higher (and 426 interactions at 80% precision or higher). ChIP experiments confirmed 21 CLR-inferred regulatory interactions, and real-time PCR analysis combined with the inferred regulatory map suggested that iron regulation and central metabolism are linked at the level of transcriptional regulation. We also showed that CLR could infer an equally precise network map using as few as 60 expression profiles, and our results help to address persistent questions concerning the optimal design of experiments for network mapping based on machine learning (Protocol S1). In recent years, ChIP techniques, particularly ChIP-Chip, have offered hope for systematic characterization of transcription factor binding in vivo. ChIP is particularly prone to errors in prokaryotes, necessitating a large number of expensive replicates [39]. Moreover, the results are condition-dependent, i.e., inactive transcription factors may not be identified because they may not bind to DNA. Finding the appropriate conditions for ChIP-Chip can be costly and time-consuming, making the comprehensive mapping of microbial transcriptional networks difficult. By generating a compendium of microarrays, we show that it is possible to infer a high-precision regulatory map and simultaneously obtain rich data on condition-specific regulation. With this conditional regulatory information, we can also make a more informed decision about when a transcription factor might be active in any follow-up ChIP, mass-spectrometry, or real-time PCR experiments. We suggest that E. coli, a long-standing model organism for the detailed study of small-scale regulatory circuits, can become a valuable model organism for large-scale regulatory network studies, by virtue of the availability of (1) a large and curated set of experimentally determined regulatory interactions; (2) a tested expression compendium; and (3) a reliable platform for the acquisition of additional expression data. Materials and Methods Microarray profiling. To explore pathways of particular importance to antibiotic resistance, we assayed 121 conditions using 266 microarrays, including more than 50 genetic perturbations (overexpression or knockout) during norfloxacin-induced DNA damage response, overexpression of the ccdB toxin, and growth to stationary phase on low and high glucose (Table S4). Bacterial strains. Fifty-three E. coli genes of interest were overexpressed in E. coli strain MG1655 (E. coli Genetic Stock Center, CGSC 6300) using a modified pBAD30 vector, pBADx53 [40]. pBADx53 has a low copy SC101 origin of replication, does not contain araC, and yields low and consistent levels of expression, generally increasing gene expression 2- to 10-fold above native expression levels. The 53 genes were PCR amplified from MG1655 genomic DNA. A ribosomal binding site was included at the start of the forward primer. The cloned genes were transformed into strain MG1655. Gene deletions were constructed from E. coli strain MG1655 by replacing the coding sequence from start codon to stop codon [41]. Gene deletion strains and overexpression plasmids were confirmed by DNA sequencing. Note that there is a known deletion around the fnr gene in strain CGSC 6300 [42]. Steady-state experiments. Gene deletion strains and pBADx53 overexpression strains were grown in 96 square-well plates containing 1.6 ml LB (Miller). LB media for the overexpression strains contained 0.125% arabinose to induce cloned gene expression and appropriate antibiotics to maintain the plasmid. Plates were incubated at 37 °C with shaking at 300 revolutions per minute (rpm). DNA damage responses were induced by growing perturbation strains for 3 h in norfloxacin (25–100 ng/ml). Cells were harvested when the optical density (OD600) for the cultures was between 0.25 and 0.40. Time-course experiments. For an antibiotic time-course experiment, cultures were grown in 250-ml flasks at 37 °C with shaking at 250 rpm. Each culture was grown in 75 ml of LB to 0.4 OD600. DNA damage was induced with 10 μg/ml of norfloxacin. Samples were taken before and 12, 24, 36, and 60 min after the addition of norfloxacin. For the glucose time series, E. coli EMG2 were diluted 1:1,000 into 150 ml LB (Miller) in 1-l baffled flasks supplemented with 0.2% or 0.4% glucose. Samples were taken 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6, and 8 h post-incubation. To examine the effect of overexpression of the F-plasmid encoded toxin CcdB, we used a plasmid-borne riboregulation system that enables precise control of gene expression through highly specific RNA-RNA interactions [43]. A riboregulation system overexpressing LacZ was included as a control. Cells were diluted 1:1,000 in 50 ml LB (Miller) with appropriate antibiotics to maintain the plasmid. Samples were taken immediately before induction and then 30, 60, and 90 min after induction of CcdB or LacZ expression. Preparation of RNA and hybridization. RNA was prepared using Qiagen RNeasy kits (Valencia, California, United States). For time-course experiments, cultures were immediately added to 2 volumes of Qiagen RNAprotect reagent. For steady-state experiments, 1.5-ml cultures in multiwell plates were centrifuged at 3,000 g for 5 min at 4 °C. Media was poured off, and 500 μl of RNA protect was then immediately added to each cell pellet. cDNA was prepared and hybridized to the Affymetrix Antisense2 microarrays according to the standard Affymetrix prokaryotic sample and array processing protocol. External data. A literature search was performed to locate microarray datasets to expand the phenotypic diversity of the compendium. Preference was given to larger datasets (>10 chips). In total, raw Affymetrix CEL files for 179 microarrays were compiled from nine different publications (Figure 1 Microarray normalization. Raw probe intensities were normalized to gene expression levels using MAS5 (Affymetrix), RMA [44], GCRMA [45], and Dchip PM [46]. All methods were run using the default parameters. For GCRMA, the ad hoc algorithm was used instead of the full empirical Bayes method due to memory constraints arising from the size of the dataset. In our experience, RMA was the single best normalization method of the four that we tried, and the results presented for all algorithms use this normalization unless otherwise indicated. Data availability. The 212 Affymetrix CEL files generated from our own experiments have been submitted to Gene Expression Ominbus (GEO), the NCBI microarray database (http://ncbi.nlm.nih.gov/geo/). Raw and normalized data for all 445 microarrays are available at M3D (http://m3d.bu.edu). M3D provides a web interface for visualizing heat plots, histograms, and scatterplots for any subset of the genes and experiments in the compendium using any of the four normalization methods mentioned above. It also allows the download of the CLR-inferred network in several different formats at both 60% and 80% precision. Network inference algorithms. We adapted several existing methods suitable for whole-genome network mapping from expression data. These methods were relevance networks [29], ARACNe [2], and Bayesian networks [7]. In addition, we developed a novel method, CLR, which constitutes a background-corrected approach to relevance networks. All tested algorithms are available at the M3D Web site. Relevance networks, ARACNe, and CLR use mutual information as a metric of similarity between the expression profiles of two genes. The mutual information for two discrete random variables X and Y is defined as:
The performance of the ARACNe, Bayesian, and linear regression network algorithms improved if we inferred four networks, each network calculated from the compendium data that was normalized with a different one of the four normalization methods. For each edge, we then computed the mean of scores resulting from the four networks (Protocol S1). This averaging approach did not improve the results for relevance networks or CLR. Relevance networks. The relevance networks algorithm identifies a potential biological association as any regulator-target gene pair with a mutual information score between their expression profiles that is above a set threshold. Although originally intended as a form of clustering, we applied the algorithm to network inference by only keeping associations between transcription factors and genes. The original relevance networks algorithm generated one network at one threshold; for the algorithm comparison, we applied a range of thresholds to generate a precision versus recall curve. ARACNe. We obtained the authors' original implementation of the ARACNe algorithm [30] for Linux (http://amdec-bioinfo.cu-genome.org/html/aracneregistration.html). As recommended in the algorithm documentation, we restricted the tolerance threshold, τ, to between 0 and 0.15 and sampled this parameter evenly. As used by the ARACNe authors [2] and recommended in the distribution, we built mutual information tables using the “fast” method (sliding window/“naïve” estimator). In the original ARACNe algorithm, an edge is pruned when it falls outside of the tolerance threshold of every interaction triangle formed by applying the data processing inequality. We created a modified implementation of the ARACNe algorithm that uses the B-spline mutual information estimate and a probabilistic threshold to improve performance (Protocol S1). For the probabilistic threshold, we computed the frequencies of keeping each edge, on the basis of all of the data processing inequality comparisons in which it participated, and we pruned the network using these frequencies. We computed the mutual information matrix using every probe set (as we also did for CLR), including the intergenic regions, to make probabilistic scores using the largest possible distribution. CLR. Bias from uneven condition sampling, upstream regulation, and inter-laboratory variations in microarrays complicate network inference, because indirect regulatory influences and direct (physical) regulatory interactions may not be easily distinguishable from their expression profiles. Our new algorithm, CLR, increases the contrast between the physical interactions and the indirect relationships by taking the network context of each relationship into account. CLR uses the local network context to compute a significance estimate for any statistical metric of similarity between gene expression profiles. We have shown that CLR performs best with mutual information but also performs well with Pearson correlation (Protocol S1). Like relevance networks and ARACNe, CLR uses the matrix of mutual information values between all probe sets on the Affymetrix array. The CLR algorithm estimates a likelihood of the mutual information (MI) score for a particular pair of genes, i and j, by comparing the MI value for that pair of genes to a background distribution of MI values (the null model). The background distribution is constructed from two sets of MI values: {MIi}, the set of all the mutual information values for gene i (in row or column i), and {MIj}, the set of all the mutual information values for gene j (in row or column j) (Figure 2
, where Zi and Zj are the z-scores of MIij from the marginal distributions, and f(Zi,Zi) is the joint likelihood measure. We experimented with other approximations of the background MI distribution, including the generalized extreme-value distribution, the Rayleigh distribution, and the Gaussian kernel density estimator (empirical), always achieving similar results and sacrificing speed of execution for the more expensive distribution fits.
Our method of estimating the likelihood of mutual information differs from the conventional ways of estimating significance. For example, both the analytical Roulston metric of significance [33] and shuffling [47] calculate the statistical significance given a random model of the interaction in question (Protocol S1). In contrast, CLR calculates the likelihood of mutual information given the observed network context, which consists of the background distribution formed by the mutual information for all possible incoming and outgoing edges for one gene in the network. Bayesian networks and linear regression networks. Unlike the pairwise algorithms described above, Bayesian networks exhaustively or heuristically search through the multivariate space of possible graphs (i.e., regulatory networks), scoring each, and keeping either the best-scoring network or a network constructed by averaging over all the searched graphs and weighting them by their score. This multivariate approach comes at a computational cost. None of the publicly available Bayesian network learning software that we tried were designed to infer a network of the size we attempted with this study, nor would they run to completion. Those that did not crash due to memory problems were unable to generate a network after several weeks of computational time. For computational tractability, we wrote a Bayesian network algorithm that implemented a series of constraints. Every gene was restricted to having at most two regulators, and interactions were only allowed between transcription factors and genes. We tested several scoring functions for the algorithm: discrete (two-state, genes are OFF or ON), linear, logistic, polynomial approximation, and hill function. Scores for the linear function were estimated with linear least-squares fitting. Scores for nonlinear functions were estimated with nonlinear least-squares fitting using the Levenberg-Marquardt algorithm. All scores were adjusted for the number of parameters using Bayesian information criterion. Of the tested scoring methods, the linear function offered the best balance between speed and quality of reconstruction. We used a model averaging procedure to score the likelihood of each edge in the regulatory network. Transcription factor/gene interactions were exhaustively scored as follows. For a particular gene Ai in a regulatory network allowing only one regulator per gene, the likelihood of being regulated by a given transcription factor Bj was calculated as
where k is indexed over all identified transcription factors. This function was generalized to the case of two transcription factors per gene to account for the scoring of the same transcription factor/gene interactions in multiple network models.
Transcription factor/transcription factor interactions were initially scored using a different approach. We sampled directed acyclic graphs of the transcription factor only network using Metropolis Coupled Markov Chain Monte Carlo. We then applied the averaging approach above for the sampled networks. However, we found that in practice, both the speed and precision of the algorithm improved if the transcription factors were scored in the same way as for transcription factor/gene interactions (Protocol S1). This resulted in networks that were no longer directed acyclic graphs, and thus the algorithm was no longer a true Bayesian network and is referred to as a linear regression network. Construction of the reference set of interactions. We obtained all known regulatory interactions catalogued in RegulonDB version 4 (http://regulondb.ccg.unam.mx/html/Data_Sets.jsp) [27]. Of the interactions, 2% could not be matched to probe sets on the expression array or had only self-regulation. We removed these interactions from the reference set. In addition, several genes thought to be E. coli regulators were present in our reference set but were not known to regulate anything within it. Therefore, we also removed these genes from all inferred networks when estimating the algorithm's performance, leaving 3,216 non-self interactions among 1,211 genes as the reference network. We also obtained a list of 328 putative and known transcription factors from RegulonDB. Measurement of algorithm performance. To evaluate the performance of all algorithms, we constrained the resulting network maps to include only the genes available in our RegulonDB control set. We computed the precision and recall of the inferred networks by comparing the inferred network to the reference network. Precision is the fraction of predicted interactions that are correct [TP/(TP + FP)], and recall is the fraction of all known interactions that are discovered by the algorithm [TP/(TP + FN)], where TP is the number of true positives, FP is the number of false positives, and FN is the number of false negatives. Precision and recall were computed over a range of pruning thresholds; interactions with scores below the pruning threshold were removed from the inferred network. Both precision and recall are reported as percentages. Figure S1: The Transcriptional Regulatory Map Inferred by CLR with an Estimated 80% Precision The precision of the network is obtained by measuring the percentage of correctly inferred edges (blue lines) out of all the predicted edges for genes with known connectivity (blue lines and green lines). The green edges represent a mixture of false and novel predictions, making 80% an underestimate. The red edges are to genes without a previously identified regulator or from regulators without a previously known target. Transcription factor nodes are colored light gray. (162 KB PDF) Click here for additional data file.(163K, pdf) Figure S2: Transcription Factor Recall Transcription factors, with at least two inferred interactions (blue bar), have high recall (47% on average) of the their known targets (green bar versus red bar); this suggests that when the transcription factors in the compendium are perturbed by the appropriate condition, much of that transcription factor's regulon is correctly identified. (88 KB PDF) Click here for additional data file.(89K, pdf) Figure S3: Motifs Detected for Four of the Transcription Factors with Five or More Target Operons (A) The canonical LexA regulatory motif was detected in the promoters of eight out of the 13 genes inferred to be LexA targets. (B) The canonical Lrp regulatory motif was also detected with high significance. (C) A novel motif was found for YnaE, a transcription factor that may play a role in the regulation of a prophage or DNA repair. (D) YmfN, another prophage-related transcription factor with no known regulatory targets, had a strong motif conserved in all of its predicted targets. (153 KB PDF) Click here for additional data file.(153K, pdf) Figure S4: Estimating the Distribution of Mutual Information The distribution of mutual information for both genes of a potential regulatory interaction is used to estimate the significance of mutual information. The distribution of mutual information for one gene, lexA, illustrates different types of fit. Normal fit, while not the best approximation to the empirical distribution, penalizes the distal network neighborhood. (661 KB PDF) Click here for additional data file.(661K, pdf) Table S1: Functional Enrichment of YmfN Targets (34 KB DOC) Click here for additional data file.(34K, doc) Table S2: Functional Enrichment of YnaE Targets (25 KB DOC) Click here for additional data file.(25K, doc) Table S3: z-Scores of Motifs for Transcription Factors in the 60% Precision Network with ≥5 Predicted Operon Targets (194 KB DOC) Click here for additional data file.(182K, doc) Table S4: The Clustered Microarrays of the E. coli Compendium (1.2 MB DOC) Click here for additional data file.(1.1M, doc) Protocol S1 Click here for additional data file.(810K, doc) Acknowledgments We thank Daniel Dwyer and Iris Keren for providing unpublished microarray data; Gwo-Yu Chuang for developing the motif comparison algorithm; Jadwiga Bienkowska, Martin Steffen, and Zoltan Szallasi for critically reading the manuscript; and David Lorenz for advice regarding the ChIP experiments. Abbreviations
Footnotes Competing interests. A portion of this work was conducted in collaboration with Cellicon Biotechnologies. JJC and TSG are founders and shareholders in the company. GC and JW are also shareholders in the company. All data, results, and algorithms from this collaboration have been made publicly available. Author contributions. JJF, BH, JJC, and TSG conceived and designed the experiments. JJF, BH, JTT, IM, JW, and GC performed the experiments. JJF, BH, IM, SK, and TSG analyzed the data and contributed reagents/materials/analysis tools. JJF, BH, JTT, JJC, and TSG wrote the paper. Funding. This work was supported by the Pharmaceutical Research and Manufacturers of America Foundation, the United States Department of Energy Office of Science (BER) grant number DE-FG02-04ER63803, the National Institutes of Health, National Science Foundation FIBR Award EF-0425719, the National Heart, Lung, and Blood Proteomics Initiative (HHSN268200248178C), the Whitaker Foundation, and Cellicon Biotechnologies, Inc. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
||||||||||||||||||||||||||||||||||
Cell. 2005 May 20; 121(4):511-3.
[Cell. 2005]Nat Genet. 2005 Apr; 37(4):382-90.
[Nat Genet. 2005]Genome Biol. 2006; 7(5):R36.
[Genome Biol. 2006]Nature. 2004 Sep 2; 431(7004):99-104.
[Nature. 2004]Science. 2002 Oct 25; 298(5594):799-804.
[Science. 2002]Nat Biotechnol. 2003 Nov; 21(11):1337-42.
[Nat Biotechnol. 2003]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D394-7.
[Nucleic Acids Res. 2006]Proc Natl Acad Sci U S A. 2000 Oct 24; 97(22):12182-6.
[Proc Natl Acad Sci U S A. 2000]BMC Bioinformatics. 2006 Mar 20; 7 Suppl 1():S7.
[BMC Bioinformatics. 2006]J Comput Biol. 2000; 7(3-4):601-20.
[J Comput Biol. 2000]Proc Natl Acad Sci U S A. 1998 Dec 8; 95(25):14863-8.
[Proc Natl Acad Sci U S A. 1998]Proc Natl Acad Sci U S A. 2000 Oct 24; 97(22):12182-6.
[Proc Natl Acad Sci U S A. 2000]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D394-7.
[Nucleic Acids Res. 2006]Nat Genet. 2005 Apr; 37(4):382-90.
[Nat Genet. 2005]Proc Int Conf Intell Syst Mol Biol. 1994; 2():28-36.
[Proc Int Conf Intell Syst Mol Biol. 1994]Mol Microbiol. 2000 Mar; 35(6):1560-72.
[Mol Microbiol. 2000]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D334-7.
[Nucleic Acids Res. 2005]FEBS Lett. 2002 Oct 2; 529(1):78-85.
[FEBS Lett. 2002]Proc Natl Acad Sci U S A. 2002 Apr 2; 99(7):4632-7.
[Proc Natl Acad Sci U S A. 2002]Science. 2003 Jul 4; 301(5629):102-5.
[Science. 2003]Proc Natl Acad Sci U S A. 2000 Jun 6; 97(12):6640-5.
[Proc Natl Acad Sci U S A. 2000]J Bacteriol. 2003 Sep; 185(18):5611-26.
[J Bacteriol. 2003]Nat Biotechnol. 2004 Jul; 22(7):841-7.
[Nat Biotechnol. 2004]Nucleic Acids Res. 2003 Feb 15; 31(4):e15.
[Nucleic Acids Res. 2003]Proc Natl Acad Sci U S A. 2001 Jan 2; 98(1):31-6.
[Proc Natl Acad Sci U S A. 2001]Proc Natl Acad Sci U S A. 2000 Oct 24; 97(22):12182-6.
[Proc Natl Acad Sci U S A. 2000]Nat Genet. 2005 Apr; 37(4):382-90.
[Nat Genet. 2005]J Comput Biol. 2000; 7(3-4):601-20.
[J Comput Biol. 2000]BMC Bioinformatics. 2004 Aug 31; 5():118.
[BMC Bioinformatics. 2004]BMC Bioinformatics. 2006 Mar 20; 7 Suppl 1():S7.
[BMC Bioinformatics. 2006]Nat Genet. 2005 Apr; 37(4):382-90.
[Nat Genet. 2005]BMC Bioinformatics. 2004 Aug 31; 5():118.
[BMC Bioinformatics. 2004]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D394-7.
[Nucleic Acids Res. 2006]