![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 The Author(s) Gene network reconstruction from transcriptional dynamics under kinetic model uncertainty: a case for the second derivative 1Ottawa Institute of Systems Biology, 2Department of Biochemistry, Microbiology, and Immunology, University of Ottawa, 451 Smyth Road, Ottawa, Ontario, ON K1H 8M5, Canada, 3Graduate Institute of Systems Biology and Bioinformatics, National Central University, No. 300, Jhongda Road, Jhongli City, Taoyuan County 32001, Taiwan (R.O.C.) and 4Pioneer Hi-Bred International, Inc., 7300 NW 62nd Avenue, PO Box 1004, Johnston, Iowa, IA 50131-1004, USA *To whom correspondence should be addressed. Associate Editor: Thomas Lengauer Received May 9, 2008; Revised December 4, 2008; Accepted January 12, 2009. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Motivation: Measurements of gene expression over time enable the reconstruction of transcriptional networks. However, Bayesian networks and many other current reconstruction methods rely on assumptions that conflict with the differential equations that describe transcriptional kinetics. Practical approximations of kinetic models would enable inferring causal relationships between genes from expression data of microarray, tag-based and conventional platforms, but conclusions are sensitive to the assumptions made. Results: The representation of a sufficiently large portion of genome enables computation of an upper bound on how much confidence one may place in influences between genes on the basis of expression data. Information about which genes encode transcription factors is not necessary but may be incorporated if available. The methodology is generalized to cover cases in which expression measurements are missing for many of the genes that might control the transcription of the genes of interest. The assumption that the gene expression level is roughly proportional to the rate of translation led to better empirical performance than did either the assumption that the gene expression level is roughly proportional to the protein level or the Bayesian model average of both assumptions. Availability: http://www.oisb.ca points to R code implementing the methods (R Development Core Team 2004). Contact: dbickel/at/uottawa.ca Supplementary information: http://www.davidbickel.com 1 INTRODUCTION 1.1 Transcriptional network reconstruction Much of the recent interest in biomolecular network reconstruction is motivated by the desire to map microscopic interactions to macroscopic traits that are of interest to the medical and food industries (Peccoud et al., 2004). For example, pharmaceutical companies have an interest in reverse-engineering molecular networks to find druggable targets (Hopkins and Groom, 2002; Schadt et al., 2007) or otherwise find genes that strongly influence disease and that could respond to therapy (Chen et al., 2008). Markowetz and Spang (2007) provide an introduction to the literature. Transcriptional networks have been reconstructed from gene expression measured at a snapshot in time, often in response to some set of perturbations or treatments. Expression time-course experiments have also raised the prospect of inferring not only the existence of causal relationships between genes, but also the direction of causality from regulating genes to regulated genes, without requiring the manipulation of genes one by one. 1.2 Bayesian inference of biomolecular networks 1.2.1 Bayesian statistics Approximate Bayesian inference tends to achieve a level of conservatism between, on one hand, hypothesizing the network deemed most likely irrespective of the degree of uncertainty in that network (e.g. Friedman et al., 2000; Husmeier, 2003) and, on the other hand, correcting P-values for multiple testing. A recent hierarchical model for inferring regulatory networks via Bayes's theorem is a case in point; information that provides evidence of transcription factors is represented in terms of a prior distribution, whereas the evidence associated with gene expression data remains in the likelihood function (Jensen et al., 2007). Such modeling of transcription factors seeks a more detailed understanding than does allowing unknown mediating genes that may encode transcription factors and other intermediate connections between genes in the network to remain unmodeled. For the purpose of inferring previously unknown influences between transcriptional network components given appropriate data and any well specified model of such influences, many of the most important causal models may be classified either as directed acyclic graphs or as kinetic models. The use of Bayesian probability theory results in Bayesian networks when applied to directed acyclic graphs but can result in the methodology of this araticle when applied to kinetic models. 1.2.2 Bayesian networks A Bayesian network is a directed acyclic graph whose nodes are random variables and whose edges are associations expressed in terms of conditional probabilities (Jensen, 2001). Bayesian networks interpreted causally (Pearl, 2000) have often been the tool of choice for biomolecular network reconstruction; they have, for example, been employed to uncover evidence of physical relationships between proteins after applying several perturbations (e.g. Sachs et al., 2005). Bayesian networks have also been applied to time course data when generalized to networks called dynamic because each node corresponds to a gene at a given time (e.g. Husmeier, 2003; Kim et al., 2003). If the conditional independence topology of the network is unknown, as is the case in frontier network reconstruction, then edge indicators, conveniently labeled as 0 or 1 for absent or present edges, also have a joint prior distribution. In principle, the problem is solved with Bayes's theorem by computing the joint posterior distribution of the latent variables, edge indicators and parameters conditional on the observed data. Mathematical difficulties make the exact solution unattainable, sometimes leading to computational searches for a single network that, although optimal in some sense, may have very low posterior probability. The simplicity of Bayesian networks have made them natural preliminary tools for applying Bayesian inference to the problem of biomolecular network reconstruction. However, the Bayesian statistical framework is general enough to instead incorporate knowledge of the dynamics of physical gene–gene interactions. 1.2.3 Bayesian inference of kinetic models In common with other statistical approaches, Bayesian inference can give misleading results when used without adequate modeling. Consequently, rather than using Bayesian networks, we apply Bayes's theorem to a well studied class of kinetic models to infer causal relationships between genes. Such models have attracted recent attention, but usually without computing the posterior distributions of their parameters (e.g. Chen et al., 1999; de Hoon et al., 2002; Vander Velden and Peccoud, 2003). A discussion of two important non-Bayesian approaches to kinetic models appears in Section S5 of the Supplementary Material (Bonneau et al., 2006; Bonneau et al., 2007; Gardner et al., 2003). Modeling at a sufficiently high level is supported by the finding that modeling the dynamical system in too much detail can lead to differential equation or stochastic process parameters that can only be identified, even in the absence of statistical error, when there are not only measurements of the abundance of transcripts over time, but also other information such as knowledge of specific interactions between promoters and transcription factors (Zak et al., 2002). More generally, model complexity at an inappropriate level for the data at hand often leads to wasted analyst effort, to computational intractability and to parameter overfitting. Bridging the gap between high-level statistical methods and low-level differential equation models, we herein describe a Bayesian method of inferring causal relationships between genes on the basis of gene expression measurements that have little or no replication and that only roughly reflect numbers of molecules. This is accomplished by carefully approximating both the kinetic models that describe transcriptional dynamics and the posterior probabilities of gene–gene influences based on such models. Compared with other Bayesian methods of inferring kinetic model parameters (Wilkinson, 2006), our approach is simple in that it does not require advanced computation such as that of Markov chain Monte Carlo simulations. Select approximations may be relaxed for more precise inference once higher quality expression data or reliable information from other sources becomes available. In the transcriptional network reconstruction method, our propose is applicable to studies of gene expression measured at four or more consecutive points separated by equal intervals of time (e.g. Serban and Wasserman, 2003; Spellman et al., 1998), provided that such intervals are small enough to capture the transcriptional dynamics, that the total time of measurement is large enough to capture translation, and that there is only one dominant cell type in the tissue samples. While the method requires neither replication (unless there are fewer than five time points) nor information about which genes encode transcription factors, straightforward ways to incorporate either or both into the data analysis are provided. Replication is handled at the level of the statistical model, whereas transcription factor information becomes part of a prior distribution. These necessary requirements for application of the proposed methodology are not sufficient to ensure that there are enough biological samples to obtain reliable predictions of regulation; statistical power depends on the extent of biological variability as well as the sample size. However, the methods are designed to be robust to insufficient data in the sense that all the reported probabilities of regulatory relationships would in that case tend to be very small, thereby helping prevent unwarranted predictions. Section S4 of the Supplementary Material supplies details on the number of time points needed and on the reliability of network inference at different levels of biological variability. Section S5 relaxes the requirement of equal sampling times. Section 2 describes the kinetic models that, under the conditions specified, hold in all cell systems. Section 3 introduces the regression model and prior distribution used to infer the model parameters representing gene–gene influences. The Supplementary Material reports the findings of a simulation study and illustrates this gene network reconstruction methodology by applying it to the replicated plant cell culture experiment that initially motivated the methodology. The results of applications to data of non-replicated yeast and bacteria experiments are also presented in the Supplementary Material and are summarized Section 4. In the Section 5, we draw general conclusions. 2 TRANSCRIPTIONAL NETWORK MODELS Consider the set of genes any of which might be the dominant regulator of any gene of interest i. The number m of such potentially regulating genes may equal the genome size in the absence of adequate information about transcription factors. Let xi(t) denote the transcript abundance of the i-th gene at time t, βij correspond to a real-valued strength of influence associated with a product of the j-th gene affecting the i-th gene and Di correspond to the non-negative degradation rate of the transcript of the i-th gene. βij>0 corresponds to activation, whereas βij<0 corresponds to repression. The linear transcription model of Chen et al. (1999) reduces to
The complete Bayesian solution to model (3) would be the joint posterior distribution of βij for all values of i and j computed on the basis of the observations, one or more error models, and a joint prior distribution encoding all relevant biological knowledge and its uncertainties. While that ideal cannot be attained, it supplies guidance for achieving approximate solutions and, as necessary, indicates a direction in which they may be improved. 3 STATISTICAL METHODS 3.1 First-order difference equations 3.1.1 Regression framework To apply model (3) to gene expression data, the concentrations xj(t) are replaced by their observed values yj(t) after averaging over any technical replicates. These measurements are considered approximately proportional to the transcript copy numbers of their genes. For example, with a microarray platform, yj(t) could be a hybridization intensity (or a monotonic transformation of such an intensity) deemed roughly proportional to the mRNA concentration level; yj(t) may be more accurately estimated given platform-specific information (Frigessi et al., 2005). With a tag-based method of measuring expression, yj(t) is the abundance of tags corresponding to the j-th gene (Gainetdinov et al., 2007; Hu and Polyak, 2006). Also replacing the time derivative with the first-forward difference yields, at times t {τ, 2τ,…, tmax−τ, tmax}=T,
i(t) represents the error due to biological variability. If there is biological replication, the replicates are denoted by k {1, …, n} and the observed values by yjk(t). In a repeated measures design, there would be a total of n individual organisms or cell cultures, and each value of k would refer to the same individual over all points in time, leading to the autoregressive model
, the mean expression intensity of the i-th gene over the n replicates at time t. A simplistic data reduction method would then stipulate model
ik(t) is a residual assumed drawn from a zero-mean normal distribution of standard deviation (SD) σi and density f. The unknown quantities, βij, βi and ik(t), are random variables in the Bayesian framework, whereas the design matrix y is fixed by the observations for all t; it has (tmax/τ−1)n=(|T|−1)n rows corresponding to the forward differences Δ yik(t), also fixed, and m+1 columns corresponding to a unit intercept column and the m regression coefficients. The following methodology is developed with biological replication (n≥2) in mind for the sake of generality, but it applies equally to studies without replication since n=1 implies
3.1.2 Case of complete measurements In the case of no missing regressors, expression measurements are available over time for all m potentially regulating genes. Unless all m regressor genes may be considered regulators for gene i, each regression coefficient βij equals 0 with some non-zero prior probability. Thus, facilitating the selection of non-zero coefficients, Equation (6) may conveniently be rewritten as
{1,…, M}). Conveniently, the number of regression parameters per sub-model is no more than the number of observations per sub-model even if m>>(|T|−1)n. Were the goal to find a set of predictive sub-models, we could proceed by stochastic search through the restricted sub-model space (see Casella and Moreno, 2006) or perhaps by conventional stepwise selection. However, Bayesian quantification of network uncertainty instead requires the computation of P(αij=1|y).In the simplest situation, one regulating gene dominates all others and expression measurements are available for all m potentially regulating genes. The prior distribution of αij then satisfies ∀j,j′≠jP(αij=1, αij′=1)=0, i.e. ∑j=1mαij=1 with probability 1, so that only m sub-models have non-zero probability. Unless there is evidence outside the expression experiment favoring some genes above others as the regulator of gene i, each possible sub-model has equal prior probability: ∀j {1,…,m}P(αij=1)=1/m. This uniformity of the prior does not represent all states of previous information; for example, given the results of a transcription factor prediction algorithm, the prior probability for each potentially regulating gene would increase monotonically with the algorithm's reported evidence that it codes for a transcription factor. To simplify notation, we express the posterior probability that the j-th gene is the regulator of the i-th gene in terms of Bayes factors BFij with respect to any arbitrarily chosen sub-model of positive probability:
For purposes of inferring biological networks, the main advantage of genome-wide platforms such as those of microarrays and tag-based methods may be that enough of the genome is represented that P(αij=1|y) and analogous quantities can often attain high values even without previous information on the gene network. Such posterior probabilities provide informative upper bounds of how much confidence one may reasonably place in causal relationships between genes on the basis of observed data when honestly accounting for network uncertainty. In contrast, conventional technologies such as RT-PCR may suffer from the missing regulator problem of the next subsection. Although microarray platforms do not enable direct comparison of intensities between genes, that limitation does not affect our approach since gene-to-gene variations in intensity are included in the βij s. 3.1.3 Generalization to missing regulators To generalize the above methodology, consider the set of genes that could be the regulator of any gene of interest i. Suppose m′, the number of those genes with expression measurements over time, is less than m, the number of potentially regulating genes, which might be the genome size or the number of putative transcription factors. (If the uncertainty in m is substantial, it may be assigned a prior distribution to propagate that uncertainty to the posterior probability of each model.) Although the SD estimates of the missing m−m′ genes cannot be known, none of them would be greater than , the SD estimates based on the intercept term of Equation (7) with ∀j {m′+1,…,m}αij=0. Then the data for the first m′ genes provide an approximate upper bound on the posterior probability of each of their models:
{m′+1,…,m}P(αij=1|y)=0:
3.2 Second-order difference equations If doubling the mRNA copy number of a regulator doubles a rate of translation rather than doubling the amount of a protein in the cell, the methods of Section 3.1 require modification. Using the second-order derivative Equation (2) instead of the first-order derivative Equation (3) puts the second-forward difference equation
. If they lack such certainty, their uncertainty can be reflected in the amount of prior probability assigned to each model, as developed below.3.3 Uncertainty in the difference equation order To reflect complete uncertainty about whether doubling the mRNA copy number of a regulator doubles a rate of translation rather than doubling the amount of a protein in the cell, we assigned 50% of the prior probability to the model of Equation (2) and 50% to that of Equation (3). Then the model selection problem in the complete measurement case may be framed in terms of this analog of Equation (7):
and
. Then, following the same approximations that led to Equation (8),
for the second-forward difference model since each time t in the sum is one of |T|−2 elements of {τ,…, tmax−2τ}.The posterior probability that the first-order difference model (6) is correct for the presumed regulated gene i can also be computed by averaging, now over all the possible regulating genes instead of over the two model orders:
4 VALIDATION USING PUBLIC DATA We applied our models to two yeast datasets (de Lichtenberg et al., 2005; Spellman et al., 1998) and two bacteria dataset (Bansal et al., 2006; Kao et al., 2004). Each dataset is from a different strain. For each regulated gene of interest, we found the gene with the highest posterior probability of being its dominant regulating gene. We then checked the annotation of those probability-maximizing genes in EchoBASE, GeneDB and in the Saccharomyces Genome Database and noted which among them were putative transcription factors. To quantify performance, we estimated the Area Under the receiver operating characteristic Curve (AUC) (Green and Swets, 1966) between the probability-maximizing genes that encode putative transcription factors and the probability-maximizing genes that do not. The AUC measures how accurately the models predicted putative transcription factors. An AUC of 0 indicates that low probabilities perfectly predict putative transcription factors (least desirable), an AUC of 0.5 indicates that there is no predictive power (also undesirable) and an AUC of 1 indicates that higher probabilities perfectly predict putative transcription factors (ideal). (The AUC has also been applied to the problem of determining which genes are differentially expressed between groups (Bickel, 2004; Pepe et al., 2003)). Table 1 summarizes the results for the four datasets. Figure 1
5 DISCUSSION We highlight three unique aspects of our choosing prior distributions to represent available information for causal network inference. First, under substantial uncertainty about which differential equation best models the transcriptional influences on the transcription of a given gene of interest, we assign each differential equation a prior probability followed by averaging the posterior probabilities of influences over all differential equations considered. This strategy, unlike the analysis of the data on the basis of a single kinetic model, propagates the uncertainty in the kinetic model to the uncertainty in the causal relationships between genes inferred in the analysis [Equation (14)]. Second, the proposed method of assigning a prior distribution is general enough to cover both the absence and presence of information about which genes encode transcription factors. This is accomplished by using previous information to assign a prior probability to each gene that could be the one that dominates the expression of the gene of interest. For example, lacking such information, every gene in the genome of the species studied has a prior probability equal to 1/m, the reciprocal of the genome size, thereby guarding against the practice of automatically inferring a causal network even when expression has been measured only over a small fraction of the genome. On the other hand, a selection of genes to measure guided by previous transcription factor information can be expected to yield better results. Our approach reflects this by assigning a prior probability for each gene selected on the order of 1/m, where m in this case is the number of transcription factors. Third, the distinction between m, the number of genes that might dominate the expression of the gene of interest, and m′, the number of genes with measured expression levels, generalizes the methodology without introducing a bias toward overconfidence in network connections. In other words, unless the number of genes measured is comparable to the number of genes that have not been ruled out as the gene dominating the gene of interest, adequate specification of the prior distribution makes it unlikely that any measured gene will have 50% or more posterior probability of influencing the expression of the gene of interest [Equation (9)]. This applies separately to each of the regulated genes of interest, the number of which is limited only by the availability of expression measurements and of computational resources. This methodology may seem overly stringent when compared with algorithms that would hypothesize networks of hundreds of genes even from data of questionable adequacy. While we have indeed guarded against putting undue confidence in causal interpretations of estimated association networks, we have not embraced the conservatism in traditional hypothesis testing that seeks to avoid all false discoveries. Based in part on the results of the Supplementary Material, we take the moderate position that some aspects of causal gene networks may be inferred with at least some degree of confidence using current technology. This is accomplished by coherently inferring parameters of kinetic models as protection against overstating or understating how much can be learned from gene expression time courses. Systems biology as a field is discredited by the publication of more and more large transcriptional networks without quantifying the extent to which such networks are justified by experimental data and what is already known about the systems. Even with such precautionary measures, more thorough modeling of the uncertainty may yield lower posterior probabilities of gene–gene influences than those of Equation (14):
Nonetheless, the second-order model performed well in validation (Table 1). One AUC estimate is 100%, reflecting the case in which the only putative transcription factor has a higher posterior probability of being the dominant regulator of a gene of interest than any of the other 42 genes. [Supplementary Data]
ACKNOWLEDGEMENTS We are very grateful to Maria Fedorova for information on BMS cell lines; Hai Zhu for computational support; George Casella and Merlise Clyde for conversations regarding Bayesian statistics; Jon Lightner, Chris Martin, Mark Whitsitt and Bob Merrill for institutional support; Jennifer Chung for assistance with sample collection; Brian Zeka and John Nau for profiling technical support; Danielle Dewar-Darch for helpful information about yeast; Kent Vander Velden and Corey Yanofsky for insightful comments on the manuscript; and Carolina Perez-Iratxeta for a discussion on the merits of tag-based measurements of gene expression. In addition, Jean Peccoud provided indispensable leadership of the plant case study during the brainstorming and experimental design phases. We also thank Mark Cooper, Tim Helentjaris, Chris Zinselmeier and Dave Selinger for field-trial, association-network discussions that prompted us to seek causal interactions between genes. Finally, the suggestions of all of the anonymous reviewers lead to a much more thorough manuscript. Funding: Supported in part by the Canada Foundation for Innovation (CFI16604) and the Ministry of Research and Innovation (MRI16604). Conflict of Interest: none declared. REFERENCES
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Genetics. 2004 Apr; 166(4):1715-25.
[Genetics. 2004]Nat Rev Drug Discov. 2002 Sep; 1(9):727-30.
[Nat Rev Drug Discov. 2002]Nature. 2008 Mar 27; 452(7186):429-35.
[Nature. 2008]J Comput Biol. 2000; 7(3-4):601-20.
[J Comput Biol. 2000]Bioinformatics. 2003 Nov 22; 19(17):2271-82.
[Bioinformatics. 2003]Science. 2005 Apr 22; 308(5721):523-9.
[Science. 2005]Bioinformatics. 2003 Nov 22; 19(17):2271-82.
[Bioinformatics. 2003]Brief Bioinform. 2003 Sep; 4(3):228-35.
[Brief Bioinform. 2003]Genome Biol. 2006; 7(5):R36.
[Genome Biol. 2006]Cell. 2007 Dec 28; 131(7):1354-65.
[Cell. 2007]Science. 2003 Jul 4; 301(5629):102-5.
[Science. 2003]Mol Biol Cell. 1998 Dec; 9(12):3273-97.
[Mol Biol Cell. 1998]Nat Biotechnol. 2008 Nov; 26(11):1251-9.
[Nat Biotechnol. 2008]Science. 2003 Jul 4; 301(5629):102-5.
[Science. 2003]Science. 2005 Apr 22; 308(5721):523-9.
[Science. 2005]Bioinformatics. 2005 Apr 1; 21(7):1121-8.
[Bioinformatics. 2005]Bioinformatics. 2005 Mar; 21(6):754-64.
[Bioinformatics. 2005]Nucleic Acids Res. 2005; 33(1):1-12.
[Nucleic Acids Res. 2005]Biochemistry (Mosc). 2007 Nov; 72(11):1179-86.
[Biochemistry (Mosc). 2007]Nat Protoc. 2006; 1(4):1743-60.
[Nat Protoc. 2006]Bioinformatics. 2005 Apr 1; 21(7):1164-71.
[Bioinformatics. 2005]Mol Biol Cell. 1998 Dec; 9(12):3273-97.
[Mol Biol Cell. 1998]Bioinformatics. 2006 Apr 1; 22(7):815-22.
[Bioinformatics. 2006]Proc Natl Acad Sci U S A. 2004 Jan 13; 101(2):641-6.
[Proc Natl Acad Sci U S A. 2004]Bioinformatics. 2004 Mar 22; 20(5):682-8.
[Bioinformatics. 2004]Biometrics. 2003 Mar; 59(1):133-42.
[Biometrics. 2003]Proc Natl Acad Sci U S A. 2004 Jan 13; 101(2):641-6.
[Proc Natl Acad Sci U S A. 2004]Bioinformatics. 2006 Apr 1; 22(7):815-22.
[Bioinformatics. 2006]Mol Biol Cell. 1998 Dec; 9(12):3273-97.
[Mol Biol Cell. 1998]Bioinformatics. 2005 Apr 1; 21(7):1164-71.
[Bioinformatics. 2005]Nucleic Acids Res. 2005; 33(1):1-12.
[Nucleic Acids Res. 2005]Biochemistry (Mosc). 2007 Nov; 72(11):1179-86.
[Biochemistry (Mosc). 2007]Nat Protoc. 2006; 1(4):1743-60.
[Nat Protoc. 2006]Nucleic Acids Res. 2005; 33(1):1-12.
[Nucleic Acids Res. 2005]Biochemistry (Mosc). 2007 Nov; 72(11):1179-86.
[Biochemistry (Mosc). 2007]Nat Protoc. 2006; 1(4):1743-60.
[Nat Protoc. 2006]