![]() | ![]() |
Formats:
|
||||||||||||||||
Copyright © 2005 Yeang et al.; licensee BioMed Central Ltd. Validation and refinement of gene-regulatory pathways on a network of physical interactions 1Center for Biomolecular Science and Engineering, Baskin School of Engineering, University of California at Santa Cruz, Santa Cruz, CA 95064, USA 2Department of Bioengineering, University of California at San Diego, La Jolla, CA 92093, USA 3Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA Corresponding author.#Contributed equally. Chen-Hsiang Yeang: chyeang/at/soe.ucsc.edu; H Craig Mak: cmak/at/biomail.ucsd.edu; Scott McCuine: scott/at/bioeng.ucsd.edu; Christopher Workman: cworkman/at/bioeng.ucsd.edu; Tommi Jaakkola: tommi/at/csail.mit.edu; Trey Ideker: trey/at/bioeng.ucsd.edu Received March 9, 2005; Revised May 3, 2005; Accepted June 3, 2005. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract As genome-scale measurements lead to increasingly complex models of gene regulation, systematic approaches are needed to validate and refine these models. Towards this goal, we describe an automated procedure for prioritizing genetic perturbations in order to discriminate optimally between alternative models of a gene-regulatory network. Using this procedure, we evaluate 38 candidate regulatory networks in yeast and perform four high-priority gene knockout experiments. The refined networks support previously unknown regulatory mechanisms downstream of SOK2 and SWI4. Background Recent advances in genomics and computational biology are enabling construction of large-scale models of gene-regulatory networks. High-throughput technologies such as automated sequencing [1], gene-expression arrays [2], chromatin immunoprecipitation [3], and yeast two-hybrid assays [4], each probe different aspects of the gene-regulatory system through genome-wide datasets. These data have spawned a variety of methods to infer the structure of gene-regulatory networks or to study their high-level properties, as recently reviewed [5]. Regulatory network models generated thus far in Escherichia coli and budding yeast (Saccharomyces cerevisiae) have been most often validated against functional databases or previous literature [6,7]. In contrast, only a few studies have attempted to validate or refine models systematically [8-11]. However, if we are to accurately model large gene networks in complex organisms, including fly, worm, mouse, and human, automated procedures will be essential for analyzing the network, choosing the best new experiments to test the model, conducting the experiments, and integrating the resulting data. The problem of choosing the best experiments to estimate a model, termed 'experimental design' or 'active learning', has been a significant area of research in statistics and machine learning [12-14]. Automating the experimental design process can greatly accelerate data collection and model building, leading to substantial savings in time, materials, and human effort. For these reasons, many industries such as electronic circuit fabrication and airplane manufacturing incorporate experimental design as an integral step in the design process [15,16]. A promising application of experimental design for biological systems was presented by King et al. [17], who integrated computational modeling and experimental design to reconstruct a small, well studied metabolic pathway. Whether automated experimental design can be useful in a large and poorly characterized biological system with noisy data remains an open question. We recently reported a procedure for inferring gene-regulatory network models by integrating gene-expression profiles with high-throughput measurements of protein interactions [18]. Here we extend this procedure to incorporate automated design of new experiments. First, we use the previously described modeling procedure to generate a library of models corresponding to different gene-regulatory systems in yeast. Many of these models contain transcriptional interactions for which the regulatory effects (inducer versus repressor) are ambiguous and cannot be determined from publicly available expression profiles. Next, to address these ambiguities we implement a score function that ranks possible genetic perturbation experiments on the basis of their projected information content over the models. We perform four of the highest-ranking perturbations experimentally and integrate the data back into the model. The new data support two out of three novel regulatory pathways predicted to mediate expression changes downstream of the yeast transcriptional regulator SWI4. Results Summary of physical regulatory models We applied a previously described network-modeling procedure [18] to integrate three complementary sources of gene-regulatory information in yeast: 5,558 promoter-binding interactions for 106 transcription factors measured using chromatin immunoprecipitation followed by microarray chip hybridization (ChIP-chip) [3]; the set of all 15,116 pairwise protein-protein interactions recorded in the Database of Interacting Proteins as of April 2004 [19]; and a panel of mRNA expression profiles for 273 individual gene-deletion experiments [20]. Software for performing the network-modeling procedure is available as a plug-in to the Cytoscape package [21,22] on our supplementary website [23]. For each gene-deletion experiment, the modeling procedure identified the most probable paths of protein-protein and promoter-binding interactions that connect the deleted gene (the perturbation) to genes that were differentially expressed in response to the deletion (the effects of perturbation). Thus, a path represented one possible physical explanation by which a deleted gene regulates a second gene downstream. From the expression data, each interaction on a path was annotated with its probable direction of information flow and its probable regulatory effect as an inducer or repressor. For example, the model in Figure Figure1a1a
In total, the modeling process generated 4,836 paths, each explaining expression changes for a particular gene in one or more knockout experiments. Of the 965 interactions covered by paths, 194 had regulatory effects that were uniquely determined by the data, while regulatory effects of the remaining 771 interactions were ambiguous. For example, Figure Figure1b1b Paths with ambiguous interactions were partitioned into 37 independent network models (numbered 1-37), where each model contained a distinct region of the physical network (see Materials and methods and Additional data file 1). The remaining non-ambiguous paths were grouped into a single model (Model 0). As shown in Table 1, 21 of the models (55%) contained pathways that are well documented in the literature or are significantly enriched for genes belonging to specific Munich Information Center for Protein Sequences (MIPS) [25] functional categories. Of 132 protein-DNA interactions incorporated into Model 0, we found that 50 had been confirmed in classical (low-throughput) assays as reported in the Proteome BioKnowledge Library [26]. Moreover, the inferred regulatory roles (induction or repression) for 48 out of 50 of these interactions agreed with their experimentally determined roles (96%, binomial p-value < 1.22 × 10-7). Wiring diagrams for Models 0 and 1 are given in Figure Figure1;1
Experiment selection As shown in Figure Figure2,2
Among the highest-priority experiments, Model 1 (Figure (Figure1b)1b Model validation Knockout strains corresponding to the high-ranking perturbations sok2Δ, yap6Δ, hap4Δ, and msn4Δ were grown in quadruplicate under conditions identical to those for the initial 273 knockouts by Hughes et al. [20]. Gene-expression profiles were obtained for each knockout culture versus wild type using yeast genome microarrays. We sought to test the three regulatory cascades leading from SWI4 to SOK2 to either MSN4, HAP4, or YAP6 (Figure (Figure1b).1b As shown in Figure Figure3a,3a
Results were qualitatively similar for the HAP4 pathway (Figure (Figure3b).3b Automated model refinement We used our modeling procedure to construct a new physical network model using the original 273 knockout gene-expression experiments of Hughes et al. combined with the new sok2Δ, hap4Δ, msn4Δ, and yap6Δ profiles. Overall, 60 protein-DNA interactions were disambiguated by our data: 50 interactions were resolved as definite inducers or repressors, whereas ten interactions were removed from the model because the expression of downstream genes did not change as a result of the knockout. In the updated Model 1, MSN4 and HAP4 were unambiguously annotated as inducers of downstream genes, SOK2 was annotated as a repressor of MSN4 and HAP4, and SWI4 was annotated as an inducer of SOK2 (Figure (Figure3e).3e Learning-curve analysis We quantified the efficiency of our information-based approach by comparing it to two other methods of prioritizing gene knockout experiments: prioritizing hubs and prioritizing genes randomly. First, we generated a 'reference' model by fixing each ambiguous interaction in Models 1-37 to be an inducer or repressor. Assignments were chosen arbitrarily from the set of annotations that were consistent with the original knockout data. Next, we used each method (information, hub, or random) to iteratively 'learn' these assignments. In each iteration, we selected the highest-priority knockout experiment, simulated the resulting expression changes (up/down) using the reference model, updated the inferred model, and recorded the number of ambiguous interactions that were resolved. This iterative learning procedure was repeated 100 times. As shown in Figure Figure4,4
Discussion We have used global expression profiles to validate models of transcriptional regulation inferred from protein-protein interactions, genome-wide location analysis, and expression data. A previously described network inference algorithm [18] identifies probable paths of physical interactions connecting a gene knockout to genes that are differentially expressed as a result of that knockout. The proposed validation strategy uses information gain as a criterion for choosing optimal knockouts to profile using microarray experiments. This strategy agrees with intuition, in that optimal knockouts typically target intermediate genes along the pathways under consideration. If an intermediate gene knockout fails to affect downstream genes in a pathway, that pathway is removed from the model. The validated pathways point to a combination of previously documented and novel findings. First, in agreement with previous literature, we confirm that MSN4 and HAP4 are inducers [27,28] and that SOK2 is a repressor [29]. For instance, SOK2 is known to act downstream of protein kinase A (PKA) to repress genes involved in stress response, glycogen storage, and pseudohyphal growth [29]. However, although SOK2 is thought to control these pathways via a transcriptional cascade, the components of this cascade have remained unclear. Here, we provide evidence for a model in which SOK2 acts as a negative regulator upstream of MSN4 and HAP4. Interestingly, MSN4 has been shown to activate stress-response genes [28], and HAP4 has been shown to activate genes involved in energy conservation and oxidative carbohydrate metabolism [27]. Thus, we have identified a candidate model for the transcriptional cascade downstream of PKA signaling that mediates stress response. This model includes two novel regulatory pathways from SWI4 to SOK2 to MSN4 and from SWI4 to SOK2 to HAP4. The validation experiments do not support the third predicted pathway from SWI4 to SOK2 to YAP6. In model simulations, choosing new gene knockout experiments with an information-theoretic approach significantly outperformed both random and hub-based selection. It also outperformed the observed experimental results: approximately 280 interactions were disambiguated after four simulated knockouts (Figure (Figure4),4 An important limitation of the single-gene knockout approach is that single perturbations do not identify pathway intermediates for which loss of function can be compensated by another gene. Furthermore, our approach may not identify regulatory pathways in which several transcription factors independently activate gene expression. Applying knockouts in combination may prove fruitful in these cases. For instance, approximately 4,000 double knockouts have been reported in yeast that lead to synthetic lethality: that is, a lethal phenotype that is not observed in either of the single knockouts individually [30]. These interactions suggest regulatory relationships which could be incorporated into future work. Conclusion Scientific discovery is an iterative process of building models to explain experimental observations and validating models with new experiments [31]. Experimental design is the essential link between these two aspects. Here we have explored a framework for modeling transcriptional networks in which experimental design and validation are central features. This framework is based on computational analysis and expression microarrays, both of which are amenable to automation, suggesting a high-throughput strategy for mapping gene-regulatory pathways. Materials and methods Model building and inference Physical mechanisms of transcriptional regulation were modeled using an approach described previously [18]. Briefly, we postulated that the regulatory effects of deleting a gene are propagated along paths of physical interactions (protein-protein and protein-DNA). We formalized the properties of these paths and interactions using a factor graph [32] and found the most probable set of paths using the max-product algorithm [32]. The resulting set of paths was partitioned into independent network models, also as described previously [18]. The raw data used in the modeling procedure included 5,558 promoter-binding interactions (at p-value < 0.001) for 106 transcription factors [3], the set of all 15,166 pairwise protein-protein interactions recorded in the Database of Interacting Proteins as of April 2004 [19], and mRNA expression profiles for 273 individual gene deletion experiments [20]. Expression changes with a p-value < 0.02 were considered significant. Experiment scoring We calculated the expected information gain for each of the 4,756 possible non-lethal single-gene deletion experiments that were not included in the set of 273 deletions used to generate our network models. Intuitively, information gain measures (the logarithm of) the number of ambiguous annotations in the model that are likely to be determined after generating a yeast-genome expression profile in response to a particular gene deletion under consideration. Each gene-deletion experiment predicts a distinct expression profile given a particular configuration of model annotations. Experiments with high information gain are those for which the predicted expression profiles are highly variable over the set of possible annotations. In these cases, only one (or at most a few) of the predicted profiles will match the true observed profile, efficiently constraining the space of possible model annotations. The information gain discussed above arises from the expected value of information calculations in statistical decision theory [12]. Here we describe the score more directly in terms of reduction of model entropy. The entropy of a set of ambiguous model annotations is given by: ![]() The expected information gain is the difference between the entropies before and after a hypothetical experiment: ![]() where Ye denotes the vector of predicted expression changes for each gene in the model under experiment e. The conditional entropy H(M|Ye) requires us to consider all possible models and corresponding outcomes resulting from experiment e. Direct enumeration of all values of M and Ye is impractical; instead, we make several simplifying approximations as described at [23]. Expression profiling Expression profiling experiments were based on the wild-type diploid BY4743 and homozygous gene knockout strains derived from this parent [33] (Invitrogen), with cultures grown identically to those of Hughes et al. [20]. Labeled cDNA from each gene knockout strain was co-hybridized versus wild type cDNA in quadruplicate two-color microarray hybridizations. Total RNA was isolated by hot acid phenol extraction, purified to mRNA (Ambion PolyAPure kits), and labeled with Cy3 or Cy5 by direct incorporation (Amersham CyScribe First-Strand cDNA Labeling Kit). DNA microarrays were spotted from the Yeast Genome Oligo Set v1.1 (Qiagen) on Corning UltraGAPS slides using an OmniGrid 100 robot (Genomic Solutions). Lyophilized Cy3- and Cy5-labeled samples were resuspended in 50 μl buffer (5× SSC, 0.1% SDS, 1× Denhardt's solution, 25% formamide) and co-hybridized at 42°C beneath a coverslip for 15 h. Arrays were imaged at 10 μm resolution using a ScanArray Lite instrument (PerkinElmer). Raw quantitated background intensities were smoothed using a 7 × 7 median filter, separately for the Cy3 and Cy5 channels, and data were corrected for cyanine-dye dependent bias using a Qspline normalization [34]. The VERA/SAM package [34] was used to assign a log-likelihood statistic λ with each gene, indicating its significance of differential expression in each experiment. Microarray expression data are deposited in the ArrayExpress database [35] under accession numbers A-MEXP-217 (Arrays) and E-MEXP-351 (Experiments). Expression coherence The expression coherence of a set of genes measures whether the expression levels of these genes behave similarly in a particular experiment. Each gene i in gene-deletion experiment e has an expression ratio rie (versus wild type) and associated p-value pie of differential expression. First, we filter out insignificant expression changes with a p-value > 0.5. Then, we use the inverse Gaussian cumulative distribution function, Φ-1, to convert each remaining p-value into a z-score [36,37]: zie = Φ-1 (1 - pie) Next, we compute a 'signed z-score' by multiplying z by +1 if the expression level is increasing and by -1 if it is decreasing. The average signed z-score for a gene subset of size N is computed as: ![]() Gene sets with expression changes that are significant and in the same direction result in large Z-values. A distribution of Z values obtained from random gene sets of size N was used to determine a p-value for each expression coherence score. Additional data files Additonal data is available with the online version of this paper. Additional data file 1 contains Tables S1-S4 and wiring diagram illustrations for Models 0-44. Table S1 gives the internal validation for 17 out of 24 restricted network models; Table S2 lists the correlations between swi4δ and gcn4δ data and Rosetta and the new experiments; Table S3 gives the restricted subsets used to evaluate the reproducibility; and Table S4 gives the gene sets for external validation. Additional File 1 Table S1 gives the internal validation for 17 out of 24 restricted network models; Table S2 lists the correlations between swi4δ and gcn4δ data and Rosetta and the new experiments; Table S3 gives the restricted subsets used to evaluate the reproducibility; and Table S4 gives the gene sets for external validation. Click here for file(5.3M, pdf) Acknowledgements We are grateful to Owen Ozier, Ryan Kelley, and Rowan Christmas for their valuable assistance with model visualization, and to Julia Zeitlinger for commenting on the manuscript. C.M., C.W., and S.M. were supported by NIGMS grant GM070743-01 and NSF grant CCF-0425926. T.I. was supported by a David and Lucille Packard Fellowship award. C.Y. and T.J. were supported in part by NIH grant(s) GM68762 and GM69676. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||
Genomics. 1987 Nov; 1(3):201-12.
[Genomics. 1987]Nature. 2000 Jun 15; 405(6788):827-36.
[Nature. 2000]Science. 2002 Oct 25; 298(5594):799-804.
[Science. 2002]Nature. 2000 Feb 10; 403(6770):623-7.
[Nature. 2000]J Comput Biol. 2002; 9(1):67-103.
[J Comput Biol. 2002]Proc Natl Acad Sci U S A. 1998 Dec 8; 95(25):14863-8.
[Proc Natl Acad Sci U S A. 1998]Nat Genet. 2003 Jun; 34(2):166-76.
[Nat Genet. 2003]Science. 2001 May 4; 292(5518):929-34.
[Science. 2001]Nature. 2004 Jan 15; 427(6971):247-52.
[Nature. 2004]J Comput Biol. 2004; 11(2-3):243-62.
[J Comput Biol. 2004]J Comput Biol. 2004; 11(2-3):243-62.
[J Comput Biol. 2004]Science. 2002 Oct 25; 298(5594):799-804.
[Science. 2002]Mol Cell Proteomics. 2002 May; 1(5):349-56.
[Mol Cell Proteomics. 2002]Cell. 2000 Jul 7; 102(1):109-26.
[Cell. 2000]Genome Res. 2003 Nov; 13(11):2498-504.
[Genome Res. 2003]Science. 2002 Oct 25; 298(5594):799-804.
[Science. 2002]Cell. 2000 Jul 7; 102(1):109-26.
[Cell. 2000]Mol Cell Biol. 2001 Jul; 21(13):4347-68.
[Mol Cell Biol. 2001]Cell. 2000 Jul 7; 102(1):109-26.
[Cell. 2000]J Comput Biol. 2004; 11(2-3):243-62.
[J Comput Biol. 2004]Appl Environ Microbiol. 2000 May; 66(5):1970-3.
[Appl Environ Microbiol. 2000]EMBO J. 1998 Jul 1; 17(13):3556-64.
[EMBO J. 1998]Mol Cell Biol. 1995 Dec; 15(12):6854-63.
[Mol Cell Biol. 1995]Science. 2004 Feb 6; 303(5659):808-13.
[Science. 2004]J Comput Biol. 2004; 11(2-3):243-62.
[J Comput Biol. 2004]Science. 2002 Oct 25; 298(5594):799-804.
[Science. 2002]Mol Cell Proteomics. 2002 May; 1(5):349-56.
[Mol Cell Proteomics. 2002]Cell. 2000 Jul 7; 102(1):109-26.
[Cell. 2000]Science. 1999 Aug 6; 285(5429):901-6.
[Science. 1999]Cell. 2000 Jul 7; 102(1):109-26.
[Cell. 2000]Genome Biol. 2002 Aug 30; 3(9):research0048.
[Genome Biol. 2002]J Comput Biol. 2000; 7(6):805-17.
[J Comput Biol. 2000]Bioinformatics. 2002; 18 Suppl 1():S233-40.
[Bioinformatics. 2002]