![]() | ![]() |
Formats:
|
||||||||||||||||
Copyright © The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org System estimation from metabolic time-series data Integrative BioSystems Institute and The Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, 313 Ferst Drive, Atlanta, GA 30332, USA *To whom correspondence should be addressed. Associate Editor: Jonathan Wren Received June 26, 2008; Revised August 27, 2008; Accepted August 29, 2008. This article has been cited by other articles in PMC.Abstract Motivation: At the center of computational systems biology are mathematical models that capture the dynamics of biological systems and offer novel insights. The bottleneck in the construction of these models is presently the identification of model parameters that make the model consistent with observed data. Dynamic flux estimation (DFE) is a novel methodological framework for estimating parameters for models of metabolic systems from time-series data. DFE consists of two distinct phases, an entirely model-free and assumption-free data analysis and a model-based mathematical characterization of process representations. The model-free phase reveals inconsistencies within the data, and between data and the alleged system topology, while the model-based phase allows quantitative diagnostics of whether—or to what degree—the assumed mathematical formulations are appropriate or in need of improvement. Hallmarks of DFE are the facility to: diagnose data and model consistency; circumvent undue compensation of errors; determine functional representations of fluxes uncontaminated by errors in other fluxes and pinpoint sources of remaining errors. Our results suggest that the proposed approach is more effective and robust than presently available methods for deriving metabolic models from time-series data. Its avoidance of error compensation among process descriptions promises significantly improved extrapolability toward new data or experimental conditions. Contact: eberhard.voit/at/bme.gatech.edu Supplementary information: Supplementary data are available at Bioinformatics online. 1 INTRODUCTION The construction of a mathematical model occurs in five stages. The first consists of collecting ideas, data and information, which are converted into a conceptual model that is often visualized as a diagram with nodes and arrows. The second stage includes the choice of a mathematical modeling framework and the formulation of suitable equations. The goal of the third stage is the determination of numerical parameter values that make the model consistent with observations, while the fourth and fifth stages are dedicated to diagnostics and to model use, respectively. In most cases the process is iterative, requiring the return to earlier stages. Arguably the most challenging task is the estimation of parameter values. Until recently, this task was typically pursued from the bottom up, by characterizing model components and processes one at a time and subsequently merging all ‘local’ descriptions into one comprehensive model. This procedure often failed, for unknown or speculative reasons, and if it succeeded, it was the product of excruciatingly slow and cumbersome effort if models of even moderate size were considered. Recent advances in molecular and systems biology have provided us with a strikingly different estimation strategy, which is based on experimentally determined time series of observations at the genomic, proteomic or metabolic levels. These time profiles contain enormous information about the structure, dynamics and regulatory mechanisms that govern the biological systems of interest. However, extraction and integration of this information into fully functional, explanatory models is a daunting task, and about one hundred articles have appeared within the past 10 years, each improving certain aspects of the estimation process. Most of them used regression, genetic algorithms, simulated annealing or different evolutionary approaches (Cho et al., 2006; Daisuke and Horton, 2006; Gonzalez et al., 2007; Kikuchi et al., 2003; Kim et al., 2006; Kimura et al., 2004, 2005; Noman and Iba, 2007) to attack the main problem of optimizing parameter values against the observed time-series data. Other papers developed support algorithms, for instance, for smoothing overly noisy data, characterizing basins of attractions containing solutions with minimal error, or circumventing the costly integration of differential equations (Almeida and Voit, 2003; Kimura et al., 2004; Kutalik et al., 2007; Maki et al., 2002; Tsai and Wang, 2005; Vilela et al., 2007; Voit and Almeida, 2004; Voit and Savageau, 1982). All of the proposed estimation methods developed up-to-date face significant problems in four distinctly different classes:
Many articles have acknowledged and discussed various computational issues in great detail and some have addressed issues related to data and models. However, there has been little if any substantial discussion of model validity and quality beyond residual errors, except for the common statement that the estimated parameter set may not be unique. Here, we propose a novel approach to estimating metabolic pathway systems, called dynamic flux estimation (DFE), which resolves several of the issues mentioned above. The approach consists of two distinct phases. The first consists of an entirely model-free and assumption-free data analysis that reveals inconsistencies within the data, and between data and the alleged system topology. The second phase addresses the mathematical formulation of the processes in the biological system. In contrast to all currently available methods, this phase allows quantitative diagnostics of whether—or to what degree—the assumed mathematical formulations are appropriate or in need of improvement. DFE builds upon the tenets of stoichiometric (Gavalas, 1968; Heinrich and Schuster, 1996; Stephanopoulos et al., 1998) and flux balance analysis [FBA; for a review see (Palsson, 2006)] in that it focuses on the stoichiometry at all nodes in the investigated system to ensure conservation of mass and to estimate flux distribution at each instant in time. However, in DFE the system is typically not in a steady state or quasi steady state (Ishii et al., 2007b; Maki et al., 2002; Okamoto, 2008; Sekiyama and Kikuchi, 2007; Teixeira et al., 2008; Wittmann, 2007; Yang et al., 2002), and its transient dynamics is utilized as a crucial indicator of the regulation within the system. Because DFE consists of two phases that include several steps, some of which are new, some computational, some logistic (e.g. the choice of mathematical representations in the second phase) and some using any of a variety of existing methods, its exact computational time requirements and accuracy of solution are difficult to assess against currently available methods. Nonetheless, our results suggest that the proposed approach is more effective and robust than presently available methods for deriving metabolic models from time-series data. Specifically, its combined model-free and model-based analyses avoid compensation among and within equations and therefore promise significantly improved extrapolability toward new data or experimental conditions (see Supplementary Material). Its diagnostic tools pinpoint causes of inadequate fits between model and data, and suggest either changes in assumptions related to model choice or the use of data as unmodeled ‘off-line data’. In the following, we describe DFE and demonstrate its features with a series of successively more complicated (and more realistic) situations, beginning with an idealized, yet representative case, and ending with actual experimental observations describing fermentation in the bacterium Lactococcus lactis. The proposed method requires time-series data that characterize the dynamics of the system variables. Such data are still relatively rare, but are being generated with increased frequency and quality. Some suitable datasets that exist already have been obtained with an in vivo NMR (Neves et al., 2005), mass spectrometry (Ishii et al., 2007a) and other methods (Du et al., 2008). Furthermore, the prospect of the availability of efficacious methods of analysis may inspire experimentalists to generate more of these types of data, which is technically possible and probably worth the effort, even if it is more expensive. Since much of the advantage of DFE is the result of natural constraints among fluxes, DFE is particularly useful for metabolic systems, but less so for gene expression and protein interaction systems (see Section 4). 2 METHODS DFE is a phased approach with well-defined outcomes for each step and rigorous checks and balances that ensure consistency of the solution (Table 1). We first present the overall concepts and then discuss each step in greater detail.
Each phase facilitates incremental development and analysis of the metabolic target model. Phase I, which is entirely model free, consists of two distinct sets of activities yielding slope estimates and dynamic flux profiles. First, the experimental data are analyzed for mass/material balance and smoothed as necessary. Slope estimates can be derived using different numerical techniques. Next, the pathway structure (i.e. the system topology) is used to generate a system of symbolic equations describing the dynamics of the system. Substituting slope estimates in this system of equations results in a system of fluxes that is linear at each time step t. This linear set of equations can be solved at each time step to obtain dynamic (time-series) profiles of all fluxes in the system. These dynamic flux profiles can be checked for flux balances at the overall system level and at the level of each metabolite pool. Phase II is model-based. Here, based on the flux profiles from the previous phase, one evaluates each plot of a flux versus its alleged substrates and modulators to analyze and choose between possible mathematical representations for each flux. Once decided, the parameters of the chosen functional form are fitted easily with some regression technique to obtain a fully parameterized kinetic model for the system. The fitness of the parameters for each flux function can be evaluated independently and the same can be done for the overall system performance. Wide arrays of robust numerical techniques are available for the computational aspects of each component of DFE, including data smoothing, slope estimation, the assessment of linear flux systems and linear/non-linear regression methods for parameter estimation. The proposed DFE workflow (Fig. 1
3 RESULTS We applied DFE to four case studies that were inspired by data describing how the bacterium L.lactis converts glucose into lactate via the pathway shown in Figure 2
3.1 Idealized situation We applied DFE first to idealized data (Fig. 2 Note that these dynamic flux profiles were obtained purely from knowledge of the system topology and our ‘experimental data’, yet without any assumptions regarding an underlying functional model. Mimicking a realistic situation, we were then interested in a numerical model and made the default assumption that all fluxes could be validly modeled with products of power-law functions, as it is customary in BST. Thus using a symbolic power-law representation for each flux that included all contributing variables, the estimation of the kinetic orders and rate constant was straightforward, since each flux term becomes linear when represented in logarithmic coordinates. The dynamic model with these flux representations was integrated and its behavior closely matched that of the experimental time-series data (Fig. 2 3.2 Simulated data with noise To test the robustness of the DFE approach against noise, we added 10% artificial pseudo-random noise (drawn from a uniform distribution) to the ideal dataset from Case 1 (Section 3.1). Due to the noise, the total mass in the system was no longer constant and required balancing, along with smoothing (see Section 2 and Fig 3
3.3 Simulated data with non-power-law terms In the first two cases, the data-generating system was implemented with power-law representations. To test and demonstrate the diagnostic capabilities of DFE, we simulated the same system (without noise) with a non-power law, sigmoidal glucose uptake function (Fig. 4
Next, we estimated slopes and solved the dynamic stoichiometric system as before. The estimated fluxes were notably different from those obtained in the earlier studies, especially at the initial time points (Supplementary Fig. S3a). Attempts to model this system of fluxes exclusively with power-law functions failed. Other methods would have had to stop at this point, simply concluding that the fit was sub-optimal. Even worse in some sense, the simultaneous fitting of all equations or of all terms within each equation would have led to error compensation between terms, thereby not only mis-fitting the sigmoidal flux but other fluxes as well (see Supplementary Material for general discussion). The overall fit might actually have been acceptable, but attempts to extrapolate the resulting numerical model to other datasets or conditions would have become problematic (see Supplementary Material for general discussion). In contrast to this ‘system-wide distribution of error’, DFE prevented such distribution of error and allowed us to pinpoint the source of error accurately by enabling us to test every flux individually against any hypothesized functional representations. We executed this analysis with power laws, using linear regression in log space. The result was encouraging: All fluxes were reasonably well represented with power laws except for the uptake process (Flux v1). Evaluation of the flux plots for this reaction step (Fig. 4 3.4 Real data Many methods seem to function well for artificial data, yet break down in the real world. We therefore used actual experimental NMR data from the L.lactis pathway (Figs 2 As a first check, we assessed the total mass in the raw experimental data at each time point and detected that they were significantly unbalanced (Fig. 5 It is worth noting that the residual error of this model may be larger than the error in a model that is optimized with standard methods, because a standard estimator has the freedom of distributing errors throughout some or all fluxes, which DFE does not permit. As a consequence, the total error in DFE may be higher, but the fit to each individual flux is more reliable. 4 DISCUSSION Biological time-series data that characterize trends in gene expression, protein prevalence or the accumulation of metabolites in vivo are being generated with increased frequency and quality. They contain valuable information about the structure, dynamics and regulatory mechanisms that govern the behavior of cellular systems. However, this information is not explicit and requires extraction methods that are by no means straightforward. While many methods have been proposed over the past years, none of them is effective in all cases. Furthermore, the existing methods have not addressed questions of diagnostics beyond CPU time and goodness of fit. We have here proposed DFE as a new approach that resolves at least some of the open issues in the estimation of metabolic pathway systems. The first, model-free and essentially assumption-free phase of DFE permits consistency checks within the metabolic time-series data and leads to numerical representations of fluxes as functions of the variables affecting them. The second, model-based phase allows the objective testing of functional forms for fluxes and is not within the repertoire of any of the existing methods. The two-phased approach thus permits rigorous, quantitative diagnoses of the metabolic data, the alleged pathway structure, the assumptions made in the choice of flux representations and the causes of residual errors. DFE eliminates compensation of error among terms and among variables, which has been a tremendously complex problem with other methods, especially when it comes to extrapolations with the estimated model (see Supplementary Material). While DFE very significantly reduces error compensation between equations and between flux terms, it still admits error compensation among the parameters within a given flux, independent of what representation is chosen. In the context of BST, this type of compensation between a rate constant and the kinetic orders is well known (Berg et al., 1996; Chou et al., 2007; Sands and Voit, 1996). For reliable extrapolations, the within-flux compensation should also be removed. This removal seems to require data covering wide ranges of variation, multiple datasets or additional information about some of the parameter values, for instance, from traditional enzyme kinetics. Illustrations and discussion of different types of error compensation are presented in the Supplementary Material. It has been observed in related work that the strategy of replacing differentials with slopes may lead to good fits for the dynamics of each variable in isolation, yet cause problems when all estimated parameter values are entered into the differential equation model (Voit and Almeida, 2004). The reason is that even small deviations between data and model results in one variable that can lead to an amplification of error in other equations. This issue occurs in DFE as well. However, in contrast to other methods, DFE allows diagnostic analyses of the solution. For instance, it turned out in Case 4 (Section 3.4) that Flux v2, which determines the degradation of G6P to FBP, was fitted quite well in isolation with a power-law function. Yet, embedded within the system of ordinary differential equations, the deviations in its variables, G6P and ATP, were sufficient to cause notably different flux values (Supplementary Fig. S4b and c). In response to such a situation, one may ignore the differences, search for causes of the deviations, or substitute smoothed data for a troublesome flux in the form of an off-line process (Voit et al., 2005, 2006). A key feature of DFE is the requirement of time-series data that are sufficient to capture the dynamics of the system. It is in general difficult to say how many data points are needed for reliable estimations. The key reason is that there is no good, quantitative criterion for the complexity of a time course. In simple dynamic responses, such as monotonically saturating functions, a few data points may be enough to characterize a time trend with sufficient reliability. In other cases, such as the example demonstrated here, the number of time points needed is higher. It seems quite evident that the number very much depends on the complexity of the time course and the noise in the data. Importantly, the types of data required for DFE are becoming more commonplace because modern methods of molecular biology permit their measurement with a variety of already existing experimental methods. DFE is an estimation approach particularly geared towards metabolic pathway systems, which are better suited for this type of estimation than genomic or proteomic systems because of conservation of mass at all nodes. Furthermore, DFE focuses on parameter estimation rather than on the identification of structure and regulation in ill-characterized pathway systems. Issues needing further development are related to missing data, missing flux information, underdetermined stoichiometric matrices and ill-characterized systems topologies. [Supplementary Data]
ACKNOWLEDGEMENTS The authors are grateful to Dr. Santos and Dr. Neves at ITQB, Portugal for sharing their in vivo NMR data on et al. Funding: National Heart, Lung and Blood Institute Proteomics Initiative (Contract N01-HV-28181; D. Knapp, PI); a Molecular and Cellular Biosciences (Grant MCB-0517135; E.O.V., PI) from the National Science Foundation; Biological Energy Science Center grant from the Department of Energy (M. Keller, PI). Conflict of Interest: none declared. REFERENCES
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||
Bioinformatics. 2006 Jul 1; 22(13):1631-40.
[Bioinformatics. 2006]J Bioinform Comput Biol. 2006 Apr; 4(2):503-14.
[J Bioinform Comput Biol. 2006]Bioinformatics. 2007 Feb 15; 23(4):480-6.
[Bioinformatics. 2007]Bioinformatics. 2003 Mar 22; 19(5):643-50.
[Bioinformatics. 2003]Bioinformatics. 2005 Apr 1; 21(7):1154-63.
[Bioinformatics. 2005]Theor Biol Med Model. 2007 May 20; 4():19.
[Theor Biol Med Model. 2007]Phytochemistry. 2007 Aug-Sep; 68(16-18):2320-9.
[Phytochemistry. 2007]Neurochem Int. 2008 Feb; 52(3):478-86.
[Neurochem Int. 2008]Microb Cell Fact. 2007 Feb 7; 6():6.
[Microb Cell Fact. 2007]J Biosci Bioeng. 2002; 93(1):78-87.
[J Biosci Bioeng. 2002]FEMS Microbiol Rev. 2005 Aug; 29(3):531-54.
[FEMS Microbiol Rev. 2005]Science. 2007 Apr 27; 316(5824):593-7.
[Science. 2007]J Proteome Res. 2008 Jul; 7(7):2595-604.
[J Proteome Res. 2008]BMC Bioinformatics. 2007 Aug 21; 8():305.
[BMC Bioinformatics. 2007]Bioinformatics. 2004 Jul 22; 20(11):1670-81.
[Bioinformatics. 2004]J Bacteriol. 2003 May; 185(9):2692-9.
[J Bacteriol. 2003]Theor Biol Med Model. 2007 May 20; 4():19.
[Theor Biol Med Model. 2007]BMC Syst Biol. 2008 Apr 16; 2():35.
[BMC Syst Biol. 2008]BMC Bioinformatics. 2007 Aug 21; 8():305.
[BMC Bioinformatics. 2007]Bioinformatics. 2004 Jul 22; 20(11):1670-81.
[Bioinformatics. 2004]J Bacteriol. 2003 May; 185(9):2692-9.
[J Bacteriol. 2003]Theor Biol Med Model. 2007 May 20; 4():19.
[Theor Biol Med Model. 2007]BMC Syst Biol. 2008 Apr 16; 2():35.
[BMC Syst Biol. 2008]Theor Biol Med Model. 2006 Jul 19; 3():25.
[Theor Biol Med Model. 2006]Appl Environ Microbiol. 2004 Mar; 70(3):1466-74.
[Appl Environ Microbiol. 2004]Microbiology. 2002 Nov; 148(Pt 11):3467-76.
[Microbiology. 2002]Biotechnol Bioeng. 1999 Jul 20; 64(2):200-12.
[Biotechnol Bioeng. 1999]J Biol Chem. 2002 Aug 2; 277(31):28088-98.
[J Biol Chem. 2002]Appl Environ Microbiol. 2002 Dec; 68(12):6332-42.
[Appl Environ Microbiol. 2002]FEMS Microbiol Rev. 2005 Aug; 29(3):531-54.
[FEMS Microbiol Rev. 2005]Theor Biol Med Model. 2006 Jul 19; 3():25.
[Theor Biol Med Model. 2006]BMC Syst Biol. 2008 Apr 16; 2():35.
[BMC Syst Biol. 2008]In Silico Biol. 2005; 5(2):83-92.
[In Silico Biol. 2005]Bull Math Biol. 1996 Sep; 58(5):923-38.
[Bull Math Biol. 1996]Bioinformatics. 2004 Jul 22; 20(11):1670-81.
[Bioinformatics. 2004]In Silico Biol. 2005; 5(2):83-92.
[In Silico Biol. 2005]