![]() | ![]() |
Formats:
|
||||||||||||||||
Copyright © 2007, EMBO and Nature Publishing Group Prediction of phenotype and gene expression for combinations of mutations 1Institute for Systems Biology, Seattle, WA, USA aInstitute for Systems Biology, 1441 North 34th Street, Seattle, WA 98103, USA. Tel.: +1 206 732 1396; Fax: +1 206 732 1299; Email: gcarter/at/systemsbiology.org Received November 21, 2006; Accepted February 9, 2007. This article has been cited by other articles in PMC.Abstract Molecular interactions provide paths for information flows. Genetic interactions reveal active information flows and reflect their functional consequences. We integrated these complementary data types to model the transcription network controlling cell differentiation in yeast. Genetic interactions were inferred from linear decomposition of gene expression data and were used to direct the construction of a molecular interaction network mediating these genetic effects. This network included both known and novel regulatory influences, and predicted genetic interactions. For corresponding combinations of mutations, the network model predicted quantitative gene expression profiles and precise phenotypic effects. Multiple predictions were tested and verified. Keywords: computational biology, data integration, gene expression, genetic interaction, network model Introduction Identifying causal links between genetic variation and phenotype is a central challenge of modern genetics. The combination of gene perturbation technology (Fire et al, 1998; Winzeler et al, 2000) and high-throughput phenotype assays (Drees et al, 2005; Dudley et al, 2005) enables rapid identification of genes active in a biological response. Linkages between individual genes and specific phenotypes can also be established statistically by detecting quantitative trait loci (QTLs) (Barton and Keightley, 2002). However, formulating biomolecular models based on these techniques is difficult, because most phenotypes are controlled by multiple genes with multiple allelic variants. Moreover, the alleles of these genes often interact in complex ways to affect phenotype (Shook and Johnson, 1999; Steinmetz et al, 2002; Carlborg and Haley, 2004; Sinha et al, 2006). Thus it is necessary to model functional relationships between relevant (i.e. trait) genes instead of viewing each gene as an independent factor. The resulting network models will have the capacity to predict, systematically and explicitly, the effects of multiple interacting genetic perturbations. This capacity will enable testing of genetically complex hypotheses, prioritization of candidate genes for targeted intervention, and the personalization of prognoses and therapies (Ideker et al, 2001; Galitski, 2004). The identification of functionally relevant interactions in databases of diverse high-throughput data types is a substantial challenge for the construction of predictive network models. Many recent efforts have sought to distill functionally important information by detecting systematic congruences in multiple large data sets (Wong et al, 2004; Sachs et al, 2005; Segre et al, 2005; Workman et al, 2006; Zhong and Sternberg, 2006). These approaches have had success in functionally classifying genes and identifying probable candidate gene pairs for the simple presence or absence of genetic interaction, often defined either very broadly as any genetic nonindependence (Zhong and Sternberg, 2006) or very narrowly as one particular interaction mode such as negative synthesis (Tong et al, 2004). In contrast, our goals are to infer specific functional relationships to drive network modeling, and to make precise testable predictions for novel combinatorial perturbations of genes. Accordingly, we developed an analysis of genetic interaction as a quantitative influence and used the results to direct the integration of molecular (physical) interaction data. We define these influences as positive or negative numbers of varying magnitude that account for the fraction of a measurable phenotype (e.g. the expression of a gene) inferred to be caused by a system element (e.g. a regulatory protein). The measured phenotype is modeled by multiple influences acting throughout the inferred network. Our mathematical modeling is based on the classical genetic-interaction approach of observing how genetic perturbations interact to affect phenotypes, thereby revealing functional relationships such as activation, repression, and pathway order (Avery and Wasserman, 1992). However, because mutant phenotypes result from the activities of complex molecular pathways, the biochemical interpretation of a genetic interaction is often ambiguous and frequently involves multiple alternative molecular models and both direct and indirect mechanisms (Kelley and Ideker, 2005; Zhang et al, 2005). Conversely, molecular interactions, plentifully generated through high-throughput methods, often lack in functional interpretation or are of uncertain relevance to specific genetic observations (Galitski, 2004). Therefore, our approach is to exploit this complementarity of genetic and molecular interactions. Our approach is to (i) decompose genetic-interaction data into influences encoding genetically direct and indirect effects and (ii) use the molecular wiring to constrain the molecular interpretation of genetic interactions, and thereby assign function to specific molecular interaction paths. Results Our approach required genetic manipulations, genome-scale molecular-interaction data sets, and efficient phenotype assays. Thus we used the filamentous growth response of budding yeast as a model system (Gimeno et al, 1992; Lengeler et al, 2000). In response to environmental cues, yeast cells switch from their round single-cell growth form to a pathogen-like, adhesive, invasive, filamentous form (also known as pseudohyphal growth). Both the filamentous-growth phenotype (Drees et al, 2005) and microarray data (Van Driessche et al, 2005) have been shown to be suitable measurements for the study of genetic interactions. We therefore constructed a set of single gene and double gene deletion strains and assayed each for filamentation phenotype and gene expression (thousands of measurements per strain). We inferred specific genetic influences from these data and used the results to guide the integration of molecular-interaction data in a network controlling filamentous growth. Combinatorial genetic perturbation Transcription factor genes were chosen for perturbation in our study because they play a direct role in the regulation of gene expression. Thus, they offer a good prospect of modeling their effects on gene expression and phenotype. The genes of five specific transcription factors known to regulate the filamentous growth response were chosen for deletion: TEC1 (Gavrias et al, 1996), SOK2 (Ward et al, 1995), SKN7 (Lorenz and Heitman, 1998), SFL1 (Robertson and Fink, 1998), and CUP9 (Prinz et al, 2004). We refer to these starting-point genes as seed genes. They were chosen because: (i) they show a full range of single-mutant phenotypes (from strongly hypo-filamentous to strongly hyper-filamentous), creating interesting double-mutant combinations; and (ii) they are downstream of a representative group of major signaling pathways involved in filamentous growth. The inference of a genetic interaction requires comparing the phenotypes of four genotypes: a ‘wild type', two single mutants, and a double mutant that carries both mutant genes. Thus, we studied 16 strains: the wild type, five single-gene deletion strains (tec1Δ, cup9Δ, sfl1Δ, sok2Δ, skn7Δ), and all 10 combinatorial double deletions (tec1Δcup9Δ, tec1Δsfl1Δ, etc.). Gene expression data were collected for each strain under filamentous growth conditions (Supplementary information). All subsequent analyses were restricted to 1863 genes showing differential expression (Supplementary information). Also, each strain was phenotyped for filamentation (Supplementary information). The results revealed a rich pattern of genetic interactions with frequent occurrences of classical epistasis, in which a double-mutant phenotype is the same as one of the mutations (the epistatic mutation) and the other (hypostatic) mutation is masked (Supplementary Table S1). For example, we found TEC1 deletion to be epistatic to all other seed gene deletions (i.e. the phenotypes of all tec1Δ-containing double-mutant genotypes were like the tec1Δ phenotype), in agreement with its known role as a major direct regulator of filamentation genes (Chou et al, 2006). Model of interacting genetic influences on gene expression We used data-driven linear decomposition to model genomic expression and quantify genetic interactions. Matrix decomposition methods, including singular value decomposition (SVD) (Alter et al, 2000; Carter et al, 2006) and generalized Network Component Analysis (gNCA) (Yang et al, 2005), have proven successful in disentangling multiple overlapping quantitative signals in microarray data. Our decomposition method was designed to dissect the complexities of genetic interactions (Materials and methods). The solution can be represented as a network of influences, as illustrated in Figure 1. This procedure results in the decomposition of an expression data matrix D into two matrices: (i) an influence matrix, X, of coefficients for the genotype-independent influences of the seed genes on target genes; and (ii) a genotype matrix, G, of inferred activity levels for the seed genes in each genotype. This is concisely written as ![]() Thus, the genetically ‘direct' (not necessarily molecularly direct) influences from the seed genes to target genes are separated quantitatively from the genetically ‘indirect' effects that involve a second seed gene and a genetic interaction. In the genotype matrix, G, we define the wild-type activities to be equal to one (gAwt=gbwt=…=1); activity levels of null alleles are fixed at zero (Materials and methods). Note that other allele types can be accommodated readily with a measured level of activity relative to the wild type. Other genotype matrix elements (capturing genetic interactions) are unknown a priori, but they can be calculated as activity changes relative to wild type under perturbations of other seed genes (gABΔ, gBAΔ, gCAΔBΔ, etc.). We performed a least-squares best-fit solution for the decomposition defined by Equation (1) (Supplementary information). For our data set, the resulting model showed high correlations with the observed expression profiles of all genes across all experimental conditions (Materials and methods). The inclusion of genetic interactions in the model accounted for much of this correlation (Supplementary Figure S1). We next integrated molecular interaction data with our decomposition results to construct regulatory network models (Materials and methods). methods).Figure
From the genotype matrix, G, in Equation (1), we inferred quantitative cross-influences between seed genes that genetically interact (Figure 4A
Predictions of gene expression We made quantitative predictions of gene expression for additional perturbations. Because we could not associate the background expression influences (Equation 2, x0, y0, etc.) with specific molecular pathways, it was not clear what the total effects of a novel single-mutant deletion would be. However, we could predict precise expression levels for double perturbations once the single perturbations were known. Thus for each new double mutant, we predicted the quantity XAΔBΔ–XAΔ–XBΔ for each gene. To make predictions for each new double mutant, we removed a genetic influence (quantitative contribution to gene expression) whenever the molecular path from influencer to influenced was broken by deletion of a gene on the path. If an alternative path of the same length exists, the influence was not removed. This gave more accurate results than removing the influence when an alternative path exists (data not shown). This is consistent with studies that show regulatory information often flows via parallel pathways (Kelley and Ideker, 2005). For broken paths, both expression influences in matrix X and activity influences in matrix G were removed. The matrix X.G was recomputed (Equation (1)) to obtain the prediction for mutant gene expression. We compared our predictions to observed quantities. We collected microarray gene expression data for four additional single- and double-deletion strains: yap6Δ, cup9Δ yap6Δ, sfl1Δ yap6Δ, and sok2Δ yap6Δ. YAP6, which is not a seed gene, was chosen for its central role mediating influences from SFL1, CUP9, and SOK2 to other genes (Supplementary Figure S2 and Supplementary Table S3). For these YAP6 deletion strains, we identified genes that receive expression influences putatively transmitted by Yap6 (e.g. DDR48 in Figure 2 These gene expression predictions were evaluated. Table I lists results for expression of all 1863 genes in our data set for the additional double-mutant strains. We also show results for a subset of genes determined to be filamentation-phenotype-correlated (Mode-2 genes; see below). To assess the accuracy of the model, we performed χ2 tests over all data and determined the likelihood of the result from a χ2 distribution (Supplementary information). We repeated the predictions using a linearly additive control model that lacks influences between seed genes and hence does not generate genetic interactions (Supplementary information). We computed the relative probability of the χ2 fits of the genetic-interaction model and the control model to determine the likelihood that the genetic-interaction model performed significantly better. Relative to the control, our model provided an improvement in fit across almost all genes rather than large fit improvements for a small subset of genes. The genome-wide improvement is highly significant (Table I), and provides direct evidence for both the biological importance of genetic interactions and the accuracy of our modeling technique.
Predictions of filamentation phenotype We next sought to predict the filamentous-growth phenotype for novel double-mutant strains by integrating filamentation phenotype data and gene expression data. To find a connection between these two data types, we performed SVD (Alter et al, 2000) on the gene expression data matrix for our 1863 genes and compared the results with filamentation measurements. SVD is an unsupervised algebraic method that mathematically separates a data matrix into a set of ‘modes' determined by quantitative patterns within the data. Each mode is manifest in the data as a global expression-pattern component that contributes to the expression of each gene to a degree varying from negligible to predominant. We examined the expression patterns of the SVD modes for correlation with filamentation data (Supplementary Table S1) for all 16 strains. We found that SVD Mode 2, quantitatively the second-greatest expression component (Supplementary Figure S3), was best correlated with the phenotype data (Supplementary Figure S4; Supplementary information). This implies that the 285 genes (Supplementary Table S4) that strongly exhibit the Mode-2 expression component are a quantitative proxy for the filamentation phenotype, even though this component is not the most dominant pattern in the data. Supporting this conclusion, ‘cell wall' is the most significantly enriched (data not shown) Gene Ontology (Ashburner et al, 2000) annotation among the Mode-2 genes, which include the prototypical filamentation gene FLO11 encoding a cell-wall protein, as well as many other known filamentation genes. The results raise the possibility that other SVD expression components and cognate gene sets might correlate with other phenotypes affected by our perturbations, such as cell adhesion (Robertson and Fink, 1998). More generally, SVD may be a quantitative unbiased approach to associate distinct expression patterns with specific phenotypes obtained from other assays. The Mode-2 genes are significantly enriched with genes bound by eight transcription factors (Supplementary Table S5), of which six (TEC1, STE12, PHD1, SOK2, ROX1, and SKN7) were known to have filamentous-growth-related phenotypes. Subsequently, we found that deletions of the other two (MOT3 and SKO1) have filamentation phenotypes (see below). Three (TEC1, CUP9, and SFL1) of the five transcription factor seed genes had significant influences on the Mode-2 expression component (Supplementary information). These influences were positive, and represent the genetically ‘direct' effects of these seed genes on the expression of the Mode-2 genes. We found the shortest paths of molecular interactions to connect these three seed genes to the enriched transcription factors, although in some cases involving SFL1, no paths shorter than five interactions could be found. This procedure generated the Mode-2 network (Figure 4B To initially probe the predictive value of the Mode-2 network, we constructed deletions of the newly implicated transcription factor genes YAP6, CIN5, UME6, MOT3, and SKO1 and assayed the filamentation phenotype. UME6 deletion was lethal in the filamentation-competent ∑1278b yeast background. All other deletions of transcription factor genes implicated by the Mode-2 network showed a filamentation phenotype: yap6Δ and sko1Δ mutants had filamentation defects, the mot3Δ mutant was strongly hyper-filamentous, and the cin5Δ mutant was marginally hypo-filamentous. We then tested the capacity of the model to predict specific phenotypes for 13 novel combinatorial deletions. Predictions were based on three topological motifs present in the Mode-2 network and corresponding quantitative expectations (Figure 4C Table 2 lists the predictions and experimental observations of the double-mutant phenotypes and phenotype inequalities. We assessed the accuracy of the model predictions by comparison with results generated from a training set of 1809 genetic interactions for invasive growth (Drees et al, 2005), a closely related phenotype (Supplementary information). The model correctly predicted all 13 double-mutant phenotypes, which was a very unlikely outcome using the training set (P=0.0002). Six of the 13 phenotype inequalities proved correct, which is also a significant improvement over the training set (P=0.009) due to the much larger number of possible outcomes. Note also that all of the incorrect phenotype-inequality predictions differed minimally from the observed phenotype inequalities.
Discussion The model of filamentous growth control based on the Mode-2 network (Figure 4B In addition to implicating genes, our approach was often able to correctly infer functional relationships between genes that control filamentous growth. This is evident in the broad success in predicting double-mutant expression profiles (Table I) and phenotypes (Table II). In particular, all predictions involving YAP6 deletions proved accurate for both phenotype (P=0.02) and phenotype inequality (P=0.006), suggesting that its regulatory role in the network was mapped correctly. Predictions based on parallel and serial network topologies (Figure 4C Notwithstanding the successful performance of our approach, its linear approximation may over-simplify many functional relationships and may miss complicated regulatory effects that are not as relevant for modeling genome-wide transcript levels. Dynamic modeling of the seed gene network could encompass nonlinear, post-transcriptional influences and feedback loops that often lead to more complex effects. Potential transcriptional feedback loops are apparent for CUP9 in the Mode-2 network (Figure 4B Our methods are designed for application to any system in which multiple interacting genes are linked to phenotypes. The genetic influences decomposition can be used to dissect genetic-interaction effects between any number of seed genes, and a greater number can be expected to result in inference of a more comprehensive network of interactions. Combined with molecular data integration, this suggests an iterative approach in which a gene implicated in the system (such as YAP6 in the filamentation network) is taken as an additional seed gene in a subsequent round of experimentation and analysis. Furthermore, although we have exclusively used null alleles in this study, the method can incorporate hypomorphic and hypermorphic alleles by fitting the genotype matrix elements to appropriate activity values relative to wild type. Possible methods to estimate these values include assays of protein levels (or phosphorylated protein levels for phospho-activated regulators) and using results fit with a cognate null mutant to constrain all parameters other than the activity levels of the non-null mutant allele. The method is extensible and can also predict the effects of higher-order combinatorial genotypes, such as triple gene deletions, through removal of the influence coefficients associated with every perturbed gene and the paths in which they form a critical link. Finally, the genetic influences decomposition is formulated to be directly applicable to all quantitative phenotypes, not only gene expression, with the requirement that the number of phenotypes assayed for each strain be equal to or greater than the number of seed genes plus one (Supplementary information). With the abundance of molecular interactions, there are often numerous possible paths of influence among gene products. Likewise, genetic interactions often have multiple possible molecular interpretations. By emphasizing the complementarity of these data types, our integration of genetic influences decomposition and molecular interaction data greatly constrained these possibilities and assigned specific functional significance to molecular interactions in a network model of the transcriptional control of filamentous growth. This model generated predictions that relied on both the accuracy of our genetic influence decomposition and our data integration strategy. The integration strategy exploited the availability of accurate, genome-scale molecular interaction data sets, and identified instances in which functionally important molecular data are missing. With the increasing availability of human interaction data (Stelzl et al, 2005) and further modeling developments to address allelic variation in outbred populations, similar quantitative and integrative techniques may ultimately be applied to disease-related models. Materials and methods Genetic influences decomposition The genetic influences decomposition method can be illustrated with the simplified case of two seed genes A and B that influence the expression of two genes X and Y. For a strain genotype labeled with superscript S, we write a linear pair of equations for gene expression: ![]() The parameters xA, xB, and x0 represent contributions to the expression of X from the gene A, gene B, and the remainder of the genetic background, respectively (similarly for gene Y). These parameters are independent of the strain genotype. The coefficients gAS and gBS are the inferred activity levels of the seed genes A and B in the strain background S, and are independent of the transcript being measured. Gene knockout strains are modeled by setting the activity of the deleted gene to zero, such as gAAΔ=0 and gAAΔBΔ=0 for strains with gene A deleted. Influences between seed genes (observed as genetic interactions) can be systematically and quantitatively inferred from changes in activity levels of one gene when the other gene is perturbed, and vice versa (Supplementary information). For example, gABΔ<gAwt would evince a positive influence from gene B on the activity of gene A. Note that these activity changes are relative to wild type (all gwt=1) and are calculated parameters. Rather than substituting transcript level data for these activities (as in many regression methods), these model-derived parameters conceptually include all levels of gene control from initiation of transcription to protein localization, modification, and degradation. The system of equations in Equation (2) can be expanded to model an arbitrary number of gene expression measurements and perturbed seed genes by systematically adding parameters (Supplementary information). The equations can be recast in matrix form. In the decomposition of our genomic expression data, the number of measurements in Equation (1) far exceeds the number of model parameters. We found the least-squares best-fit solution (Supplementary information). Directed data integration To identify functionally important expression influences, we first determined coefficients in the influence matrix, X, that were significantly different from zero (Supplementary information). For example, the influences of TEC1, CUP9, and SKN7 on DDR48 are mapped in Figure 2A Genomic expression data have been deposited in the Gene Expression Omnibus, accession GSE5938. Supplementary Information and Tables Click here to view.(137K, doc) Supplementary Figure 1 Click here to view.(73K, jpg) Supplementary Figure 2 Click here to view.(231K, jpg) Supplementary Figure 3 Click here to view.(120K, jpg) Supplementary Figure 4 Click here to view.(108K, jpg) Supplementary Information Click here to view.(116K, xls) Supplementary Information Click here to view.(5.3K, txt) Acknowledgments We thank Pamela Troisch for assistance with microarray preparation and Christine Aldridge, David Galas, Ilya Shmulevich, and R James Taylor for their contributions. This work was supported by grant P50 GM076547 from NIH. GWC and TG were supported in part by grant FIBR EF-0527023 from NSF. TG is a recipient of a Burroughs Wellcome Fund Career Award in the Biomedical Sciences. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||
Nature. 1998 Feb 19; 391(6669):806-11.
[Nature. 1998]Novartis Found Symp. 2000; 229():105-9; discussion 109-11.
[Novartis Found Symp. 2000]Genome Biol. 2005; 6(4):R38.
[Genome Biol. 2005]Mol Syst Biol. 2005; 1():2005.0001.
[Mol Syst Biol. 2005]Nat Rev Genet. 2002 Jan; 3(1):11-21.
[Nat Rev Genet. 2002]Proc Natl Acad Sci U S A. 2004 Nov 2; 101(44):15682-7.
[Proc Natl Acad Sci U S A. 2004]Science. 2005 Apr 22; 308(5721):523-9.
[Science. 2005]Nat Genet. 2005 Jan; 37(1):77-83.
[Nat Genet. 2005]Science. 2006 May 19; 312(5776):1054-9.
[Science. 2006]Science. 2006 Mar 10; 311(5766):1481-4.
[Science. 2006]Trends Genet. 1992 Sep; 8(9):312-6.
[Trends Genet. 1992]Nat Biotechnol. 2005 May; 23(5):561-6.
[Nat Biotechnol. 2005]J Biol. 2005; 4(2):6.
[J Biol. 2005]Annu Rev Genomics Hum Genet. 2004; 5():177-87.
[Annu Rev Genomics Hum Genet. 2004]Cell. 1992 Mar 20; 68(6):1077-90.
[Cell. 1992]Microbiol Mol Biol Rev. 2000 Dec; 64(4):746-85.
[Microbiol Mol Biol Rev. 2000]Genome Biol. 2005; 6(4):R38.
[Genome Biol. 2005]Nat Genet. 2005 May; 37(5):471-7.
[Nat Genet. 2005]Mol Microbiol. 1996 Mar; 19(6):1255-63.
[Mol Microbiol. 1996]Mol Cell Biol. 1995 Dec; 15(12):6854-63.
[Mol Cell Biol. 1995]Genetics. 1998 Dec; 150(4):1443-57.
[Genetics. 1998]Proc Natl Acad Sci U S A. 1998 Nov 10; 95(23):13783-7.
[Proc Natl Acad Sci U S A. 1998]Genome Res. 2004 Mar; 14(3):380-90.
[Genome Res. 2004]Proc Natl Acad Sci U S A. 2000 Aug 29; 97(18):10101-6.
[Proc Natl Acad Sci U S A. 2000]Genome Res. 2006 Apr; 16(4):520-6.
[Genome Res. 2006]BMC Genomics. 2005 Jun 10; 6(1):90.
[BMC Genomics. 2005]J Biochem. 1994 Apr; 115(4):683-8.
[J Biochem. 1994]BMC Bioinformatics. 2002 Nov 1; 3():34.
[BMC Bioinformatics. 2002]Nat Biotechnol. 2005 May; 23(5):561-6.
[Nat Biotechnol. 2005]Proc Natl Acad Sci U S A. 2000 Aug 29; 97(18):10101-6.
[Proc Natl Acad Sci U S A. 2000]Nat Genet. 2000 May; 25(1):25-9.
[Nat Genet. 2000]Proc Natl Acad Sci U S A. 1998 Nov 10; 95(23):13783-7.
[Proc Natl Acad Sci U S A. 1998]Genome Biol. 2005; 6(4):R38.
[Genome Biol. 2005]Genome Biol. 2005; 6(4):R38.
[Genome Biol. 2005]Cell. 1997 Nov 28; 91(5):673-84.
[Cell. 1997]Nature. 2004 Sep 2; 431(7004):99-104.
[Nature. 2004]Genes Dev. 2006 Feb 15; 20(4):435-48.
[Genes Dev. 2006]Cell. 2005 Sep 23; 122(6):957-68.
[Cell. 2005]Science. 2002 Oct 25; 298(5594):799-804.
[Science. 2002]Cell. 2003 May 2; 113(3):395-404.
[Cell. 2003]Nature. 2004 Sep 2; 431(7004):99-104.
[Nature. 2004]Genes Dev. 2006 Feb 15; 20(4):435-48.
[Genes Dev. 2006]Nucleic Acids Res. 2001 Jan 1; 29(1):242-5.
[Nucleic Acids Res. 2001]