• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of biophysjLink to Publisher's site
Biophys J. Aug 1, 2008; 95(3): 1487–1499.
PMCID: PMC2479599

Group Contribution Method for Thermodynamic Analysis of Complex Metabolic Networks

Abstract

A new, to our knowledge, group contribution method based on the group contribution method of Mavrovouniotis is introduced for estimating the standard Gibbs free energy of formation (ΔfG°) and reaction (ΔrG°) in biochemical systems. Gibbs free energy contribution values were estimated for 74 distinct molecular substructures and 11 interaction factors using multiple linear regression against a training set of 645 reactions and 224 compounds. The standard error for the fitted values was 1.90 kcal/mol. Cross-validation analysis was utilized to determine the accuracy of the methodology in estimating ΔrG° and ΔfG° for reactions and compounds not included in the training set, and based on the results of the cross-validation, the standard error involved in these estimations is 2.22 kcal/mol. This group contribution method is demonstrated to be capable of estimating ΔrG° and ΔfG° for the majority of the biochemical compounds and reactions found in the iJR904 and iAF1260 genome-scale metabolic models of Escherichia coli and in the Kyoto Encyclopedia of Genes and Genomes and University of Minnesota Biocatalysis and Biodegradation Database. A web-based implementation of this new group contribution method is available free at http://sparta.chem-eng.northwestern.edu/cgi-bin/GCM/WebGCM.cgi.

INTRODUCTION

Thermodynamics is increasingly being applied to improve our understanding of the metabolism of microorganisms, especially in the context of constraint-based analysis of genome-scale models of microorganisms (14). Constraints based on the laws of thermodynamics have been applied for the determination of feasible ranges for the rates of biochemical reactions and the concentrations of metabolites (2,5). Methods for quantifying the feasible ranges for the Gibbs free energy change of reaction (ΔrG′) have been applied to the curation of new metabolic reconstructions (4), the systematic assessment of the degree of reversibility of metabolic reactions (6), and the evaluation of the feasibility of biodegradation reactions (S. D. Finley, L. J. Broadbelt, and V. Hatzimanikatis, unpublished). Numerous methods based on thermodynamic constraints and the laws of thermodynamics have also been applied in the study of the regulatory network of the cell (2,5,8). All these studies require that the standard Gibbs free energy change of reaction (ΔrG°) be known so that the degree of thermodynamic favorability of the reactions in these systems can be quantified.

Thermodynamic analysis of metabolism based entirely on experimentally measured ΔrG° data has been restricted to either small-scale systems (8) or small subsections of genome-scale systems (5,6) due to the limited amount of experimental data currently available. For example, in the latest genome-scale model of Escherichia coli (4), experimentally measured ΔrG° data are available for only 169 (8.1%) of the 2077 reactions in the model (4). Due to this scarcity of experimentally measured values of ΔrG°, methods for its estimation are often applied to fill in the gaps in the experimental data. One of the most prevalent techniques for estimating ΔrG° of biochemical reactions is the group contribution method of Mavrovouniotis (9,10). This method allows the rapid calculation of accurate estimations of ΔrG° and the standard Gibbs free energy of formation (ΔfG°) for a wide variety of biological reactions and compounds (1). Unlike the group contribution method of Benson (11), this method is tailored for aqueous organic chemistry taking place at neutral pH involving ionic species.

This group contribution method has been applied to the study of the thermodynamic feasibility of numerous native (1216) and novel (17,18) metabolic pathways. The method has been utilized to estimate ΔfG° and ΔrG° for the majority of the compounds and reactions contained in the Kyoto Encyclopedia of Genes and Genomes (KEGG) (19) and in an earlier genome-scale model of E. coli (1). The method has also enabled the development of thermodynamic metabolic flux analysis, a framework for the genome-scale thermodynamic analysis of metabolism that accounts for the effect of metabolite activity levels on the thermodynamic feasibility of biochemical reactions embedded in a metabolic network (2).

In all these applications, the group contribution method of Mavrovouniotis has been demonstrated to be capable of rapidly producing accurate estimates of ΔfG° and ΔrG° for many of the common metabolites in the central metabolic pathways. However, the method could not be used to estimate ΔfG° for molecules involving some sulfur, nitrogen, and halogen substructures commonly found in large, genome-scale metabolic models or in databases of biochemical reactions such as the BioCyc (20), Brenda (21), KEGG (22,23), and University of Minnesota Biocatalysis and Biodegradation Database (UM-BBD) (24). Additionally, ΔfG° estimations calculated using the group contribution method of Mavrovouniotis differ significantly from the literature values for ΔfG° of many phosphorylated compounds (25,26), and ΔrG° estimations differ significantly from experimentally observed ΔrG° values for reactions involving the formation (or destruction) of thioester bonds or the formation (or destruction) of conjugated double bonds. Finally, the method of Mavrovouniotis provides only a limited ability to quantify the uncertainty in the ΔfG° and ΔrG° estimates. Although the initial work by Mavrovouniotis provided 68% and 95% confidence intervals of 3 and 5 kcal/mol, respectively, for the overall uncertainty in all estimated ΔfG°, no confidence intervals were provided for the uncertainty in the estimated ΔrG°. Additionally, insufficient data were provided for the quantification of the uncertainty in each specific ΔfG° estimate calculated using the method. These limitations result in imprecise predictions of uncertainty in estimated ΔfG° and ΔrG° values.

We introduce here an updated and expanded group contribution method which utilizes a larger and more current training set of ΔrG° and ΔfG° data including new tables of thermodynamic data found in the National Institute of Standards and Technology (NIST) database (27) and in the work by Alberty (25,26), Thauer (28,29), and Dolfing (30,31). Due to the availability of additional data, group contribution energies were fit to a number of molecular substructures involving halogens, sulfur, and nitrogen (Table 1) that were not included in Mavrovouniotis's original work. The addition of these new molecular substructures to the group contribution method enables the estimation of ΔfG° for a wider variety of molecules. The method also includes a set of seven new interaction factors to account for the energy contributions of the various types of conjugated double bonds, thioester bonds, and vicinal chlorine atoms (see Methods). Finally, the uncertainty analysis performed allows the uncertainty of each estimated ΔrG° and ΔfG° to be determined based on the uncertainty in the constituent group contribution energies.

TABLE 1
Structural groups used in group contribution method

METHODS

Group contribution method

The group contribution method was developed as a means of estimating ΔrG° of a reaction based on the molecular structures of the compounds involved in the reaction (911). In group contribution methods, the molecular structure of a single compound is decomposed into a set of smaller molecular substructures based on the hypothesis that ΔrG° and ΔfG° can be estimated using a linear model where each model parameter is associated with one of the constituent molecular substructures (or groups) that combine to form the compound. To estimate ΔfG° of the entire compound, the contributions of each of the groups to this property are summed as follows:

equation M11
(1)

where equation M12 is the estimated ΔfG°, equation M13 is the contribution of group i to equation M14 ni is the number of instances of group i in the molecular structure, and Ngr is the number of groups for which equation M15 is known (i.e., the total number of groups in our database). Similarly, ΔrG° is estimated by summing the contribution of each structural group created or destroyed during the reaction:

equation M16
(2)

where equation M17 is the estimated ΔrG°, vi is the stoichiometric coefficient of species i in the reaction, and m is the number of species involved in the reaction. The advantage of estimating ΔrG° using Eq. 2 instead of using the estimated formation energies is that any structural groups unchanged during the reaction cancel out of Eq. 2, and this can include groups for which equation M18 is unknown.

Determination of the groups involved in a molecular structure

In keeping with the group contribution scheme developed by Mavrovouniotis, this implementation of the group contribution method involves two different kinds of energy contributions: i), contributions from the structural groups that combine together to form the structure of the molecule (Table 1), and ii), contributions from the interaction factors that account for the effect of the interactions between various structural groups on the ΔfG° of a molecule (Table 2). When calculating equation M19 of a compound using the group contribution method, the molecular structure of the compound is first broken down into the set of structural groups that combine to form the compound. equation M20 can be calculated for a compound only if every single atom involved in the molecular structure of the compound can be assigned to exactly one structural group for which equation M21 is known.

TABLE 2
Interaction factors used in the group contribution method

Some of the larger structural groups included in this group contribution method can be further broken down into smaller structural groups. These larger groups, called “characteristic groups”, were included in the method because the properties of these groups are significantly different from the summed properties of their smaller constituent structural groups. For example, the −COO group could be further broken down into the >C=O group and the −O group. However, the equation M22 of the −COO group is −83.1 kcal/mol, whereas the sum of the equation M23 values for the >C=O group and the −O group is −61.2 kcal/mol. The characteristic groups used in this method were originally developed by Mavrovouniotis based on expert knowledge of biochemistry and goodness of fit of the group contribution model to the available experimental data.

Because these characteristic groups exist, often multiple structural groups can be mapped to the same atoms in the molecular structure of a compound. For example, the carbon in a carboxylic acid functional group can be assigned to either the −COO group or the >C=O group. When these cases arise, the atoms should always be assigned to the structural group with the smallest search priority number, which is provided along with the equation M24 values in Table 1 (Fig. 1 A). The only exception to this rule concerns the phosphate chains found in molecules such as NAD(H) or ATP. If every phosphate in a phosphate chain is assigned to the structural group with the smallest priority number, then every phosphate that is not a terminal phosphate would be assigned to the equation M25 group. This leads to the assignment of the oxygen bridging two neighboring phosphates to two equation M26 groups, which violates the requirement that every atom be assigned to exactly one group. To avoid this violation, terminal phosphate chains (like the phosphate chain in ATP) involving n phosphorus atoms are always decomposed into one equation M27 group and (n − 1) equation M28 groups (Fig. 1 B). Similarly, internal phosphate chains (like the phosphate chain in NADH) involving n phosphorus atoms were always decomposed into one equation M29 group and (n − 1) −OPO2- groups. An algorithm for automatically breaking down molecular structures into the appropriate structural groups in the group contribution method is discussed in the work by Forsythe, Karp and Mavrovouniotis (32).

FIGURE 1
Decomposition of molecular structures into structural groups and interaction factors. When assigning atoms in a molecular structure to structural groups, atoms should always be assigned to the structural group with the lowest search priority number ( ...

The molecular structures being decomposed into structural groups must also be in the form of the predominant ion for the molecule in the same conditions at which the fitting of the equation M33 values was performed: pH 7, zero ionic strength, and a temperature of 298 K. The predominant ions of all the molecules involved in the training set at pH 7 were determined using pKa estimation software (MarvinBeans pKa estimation plug-in, ver. 4.0.3, ChemAxon, Budapest, Hungary). When a molecule exists in multiple isomeric or resonance forms in equilibrium, such as keto-enol tautomers, the most stable form (the form resulting in the lowest equation M34) is decomposed into structural groups. This ensures that the form of the molecule used to calculate equation M35 is the predominant form in solution.

Stereochemistry is ignored when labeling atoms in a molecule according to their structural groups. For example, all forms of sugars with six carbon atoms including glucose, galactose, and mannose, which have equation M36 values of −219, −217, and −217 kcal/mol, respectively, are decomposed into exactly the same structural groups and interaction factors, and as a result, all these sugars have identical equation M37 values. This is a reasonable assumption given the similarity of the equation M38 values.

Once every single atom in the molecular structure of a compound has been assigned to the proper structural group, the interaction factors must be determined. The equation M39 associated with each interaction factor is then added to compound equation M40 to account for the effect of the interaction factor on the formation energy. There are seven types of interaction factors used in this implementation of the group contribution method (Fig. 1 C). Four of the interaction factors used were originally proposed in the group contribution method of Mavrovouniotis: the hydrocarbon factor, the heteroaromatic ring factor, the three-member ring factor, and the amide factor. The hydrocarbon factor is added to equation M41 of any compound that consists of only carbon and hydrogen. The heteroaromatic ring factor is added to equation M42 of a compound for every heteroaromatic ring in the compound, as determined according to Hückel's rule. Similarly, the three-member ring factor is added to equation M43 of a compound for every three-member ring in the compound regardless of the atoms that make up the ring. The amide factor is added to equation M44 of a compound for every instance of a nitrogen atom neighboring a carbonyl group in the compound. Note that if a nitrogen atom is neighboring two carbonyl groups, this is counted as a single amide factor.

Three new types of interaction factors were introduced in this implementation of the group contribution method that were not included in the method of Mavrovouniotis: the thioester factor, the double bond conjugation factors, and the vicinal Cl factor. The thioester factor is added to equation M45 of a compound for every instance of a sulfur atom neighboring a carbonyl group in the compound. This factor accounts for high energy of the thioester bond (33). Like the amide factor, if a sulfur atom is neighboring two carbonyl groups, this is counted as a single thioester factor.

The conjugation of double bonds has a significant stabilizing effect on the molecular structure of a molecule, making the removal of a conjugated double bond more difficult than the removal of an isolated double bond (33). Without any interaction factor for double bond conjugation, the group contribution method has no means of capturing these characteristics of conjugated double bonds. Therefore, the double bond conjugation factors were introduced to account for the stabilizing effect of double bond conjugation on a molecular structure. Ten forms of conjugated double bonds are possible in a molecular structure containing C, N, and O, and a separate double bond conjugation factor was initially introduced for each of these 10 forms (Table 3). Five of these forms were not included in the final implementation of the method due to a lack of data or because the conjugation factor was statistically insignificant (see Table 3 and Results). Note that double bond conjugation factors are not added for conjugated double bonds that are contained completely within an aromatic or heteroaromatic ring.

TABLE 3
Interaction factors for conjugated double bonds

The vicinal Cl factor was introduced based on the examination of the effect of chlorine substitution on the equation M46 of aliphatic compounds performed by Dolfing and Janssen (31). Dolfing and Janssen proposed that chlorine atoms attached to neighboring carbon atoms have a destabilizing effect on one another, and an interaction factor is required to account for this destabilization to accurately estimate the equation M47 of chlorinated compounds using the group contribution method. The vicinal Cl factor is an implementation of the interaction factor proposed by Dolfing and Janssen, and two variations of this interaction factor were explored. The first variation implemented, VCldistinct, is based on the hypothesis that a larger number of chlorine atoms attached to neighboring carbons results in a larger destabilizing effect, described mathematically as follows:

equation M48
(3)

where VCldistinct is the total value of the correction for the interaction of vicinal chlorine atoms that is added to equation M49 is the group contribution energy for the vicinal Cl interaction factor, NC is the number of carbon atoms in the molecule, NCl,i is the number of chlorine atoms attached to carbon atom i, and δij is the Kronecker Δ, a binary variable equaling zero unless carbon atom i is bonded to carbon atom j.

The second variation of the vicinal Cl interaction factor, VClbinary, is based on the hypothesis that the destabilizing effect of vicinal chlorine atoms is independent of the number of chlorine atoms attached to each of the neighboring carbons (Fig. 1 C), described mathematically as follows:

equation M50
(4)

Both variations of the vicinal Cl interaction factor were tested, and the VCldistinct interaction was selected for the final implementation of the method because it resulted in the best possible fit of the thermodynamic data included in the training set (see Results).

Multiple linear regression

The multiple linear regression (MLR) method (least squares) was used to determine the equation M51 values for the set of structural groups and interaction factors that allow the best fit of the observed ΔrG° (equation M52) and observed ΔfG° (equation M53) values included in a training set. The equation M54 values are calculated using the following:

equation M55
(5)

where equation M56 is an Ngr × 1 vector of the energies associated with each group in the group contribution method, X is an Nobs × Ngr matrix of the number of each group contained in each molecular structure or created or destroyed in each reaction in the training set, X is the transpose of matrix X, Nobs is the number of equation M57 and equation M58 values included in the training set, and equation M59 is an Nobs × 1 vector of equation M60 and equation M61 values included in the training set (34).

MLR is the ideal technique for producing equation M62 values that optimally fit the training set only if the data included in the training set satisfies the following two conditions: (MLR.I) ΔrG° and ΔfG° must be linearly related to the model parameters (the equation M63 values) and the differences between the equation M64 and equation M65 for each data point in the training set must be uncorrelated, and (MLR.II) the absolute uncertainty in each of the ΔrG′° and ΔfG° observations included in equation M66 must be similar in magnitude (34). The random distribution of the residuals of the fit indicates that condition MLR.I is satisfied (see Fig. 2 B). The discussion of the uncertainty in the training set data explains why condition MLR.II is also satisfied by the data (see the section “Uncertainty in training set data”).

FIGURE 2
Distribution of residuals from the MLR fitting of the training set cumulative distribution (A) and histogram (B) of the deviations between equation M67 calculated using the fitted equation M68 values and the equation M69 values in the training set. The cumulative probability for the deviations ...

Quantification of the goodness of fit

The goodness of the MLR fit is quantified using the standard deviation of the differences between equation M76 and equation M77 for the compounds and reactions involved in the training set, SEMLR (34):

equation M78
(6)

where R′ is the transpose of the vector R, and R is the vector of residuals for the fit, calculated as follows (34):

equation M79
(7)

If the differences between equation M80 and equation M81 in the training set follow a normal distribution, then 68% of the residuals will be less than or equal to SEMLR (34). The SEMLR is also used to assess the effect of removal or addition of interaction factors on the group contribution scheme (see the section “Whole model and individual parameter validation”).

Formation of the training set for the MLR

The equation M82 values used in the training set for the MLR involved a total of 3153 equation M83 values and 288 equation M84 values. The equation M85 and equation M86 values in the training set were pulled from a variety of literature sources including work on methanogenesis by Thauer (28,29), work on halogen thermodynamics by Dolfing and co-workers (30,31), work on formation energy standardization and redox potentials by Alberty (25,26), and thermodynamic data compiled in the NIST (27) and National Bureau of Standards (NBS) (35) databases. The experimentally measured equation M87 values reported in these references were captured under a variety of temperature and pH conditions. Only data captured within one pH unit and 15 K of the chosen reference state of pH 7 and 298 K was utilized. Most of the data utilized were collected within 1 K of 298 K and 0.1 pH units of pH 7 (Fig. 3). Overall, 645 distinct biochemical reactions are represented in the 3153 equation M88 values used in the training set, meaning that multiple data points were included for many reactions. Similarly, 224 distinct molecular structures are represented by the 288 equation M89 values used in the training set. When multiple data points existed for single reactions or compounds, we used all data points in the data set rather than averaging the data and including the average. By using all the data points instead of the average, the variability in the data is included in the residuals, covariance matrix, and standard deviation for the fit, which results in a better quantification of the uncertainty in the group free energy values. All equation M90 and equation M91 values included in the training set are listed in Supplementary Material, Data S2 along with the associated reactions and compounds.

FIGURE 3
pH, temperature, and equation M92 distributions for the equation M93 data within the training set. The distributions of pH (A), T (B), and equation M94 (C) values for the 3153 equation M95 measurements used in the training set to determine the group contribution energies are shown. The most prevalent ...

Uncertainty in training set data

To estimate the total uncertainty in each equation M98 and equation M99 data point included in the training set, the sources of uncertainty were enumerated and quantified. The total uncertainty in the equation M100 values included in the training set were estimated from the precision of the reported equation M101 values; the reported precision in the equation M102 values ranges from 0.01 to 1 kcal/mol, implying that the absolute uncertainty in the equation M103 values ranges from 0.005 to 0.5 kcal/mol (26,28,35,36). The total uncertainties in the equation M104 values included in the training set were calculated from four primary sources: (UC.I) uncertainty in the method used to measure the equilibrium constant, (UC.II) uncertainty due to differences between the ionic strength at which each equation M105 was measured and the reference ionic strength of zero, (UC.III) uncertainty due to differences between the pH at which each equation M106 was measured and the reference pH of 7, and (UC.IV) uncertainty due to differences between the temperature at which each equation M107 was measured and the reference temperature of 298 K.

Most of the equation M108 values included in the training set were measured using spectroscopy, which has a typical precision of 1%–3% of the measured values when used to determine equilibrium constants (37). This translates into an absolute uncertainty of <0.30 kcal/mol for 95% of the equation M109 values included in the training set. Uncertainty due to deviations of the conditions for the equation M110 measurements from the reference ionic strength of zero was determined using the extended Debye-Huckel equation as described in Maskow and Stockar (12). For 95% of the reactions in the training set, this uncertainty was <1.45 kcal/mol when the deviation in ionic strength was <0.2 M. The absolute uncertainty due to deviations in the conditions for the equation M111 measurements from the reference pH of 7 was <1.49 kcal/mol for 95% of the training set reactions within the allowed pH ranges (pH 6–8), as calculated using the methods described by Alberty (26). Uncertainties due to pH and ionic strength deviations are independent of the reference equation M112 value.

As equation M113 measurements were accepted into the training set if measured within 15 K of the reference temperature of 298 K, deviations of the equation M114 measurement conditions from the reference temperature were another source of uncertainty in the equation M115 values. A rearranged version of the Gibbs-Helmholtz relationship was utilized to determine how temperature changes affect equation M116 of a reaction:

equation M117
(8)

where ΔrH° is the standard enthalpy change of reaction. Although measured ΔrH° values are unavailable for most of the reactions contained in the training set, the (1 − ΔrH°rG°) term in the Gibbs-Helmholtz relationship will typically have a maximum value of one for biochemical reactions. Based on this assumption, a 15 K maximum temperature change results in a maximum change of 5.7% in equation M118 This translates into an absolute error of <0.57 kcal/mol for 95% of the equation M119 values in the training set. Overall, for 95% of the equation M120 values included in the training set, the total absolute uncertainty is >0.1 kcal/mol and <2.2 kcal/mol, which satisfies the MLR.II condition that the uncertainty of all equation M121 values in the training set be similar in magnitude.

Quantification of the uncertainty in the ΔgrG° values

The uncertainties in the equation M122 values estimated using MLR were quantified using the covariance matrix of the MLR, which allowed the calculation of a standard error for each equation M123 in the group contribution method as follows (34):

equation M124
(9)

where SEgr,i is the standard error for the group contribution value of group i, equation M125 The SEgr,i values can be used to quantify the uncertainty in the estimated Gibbs free energy of formation and reaction, equation M126 calculated by taking the Euclidean norm of the uncertainties in each group equation M127 value multiplied by the number of instances of each group involved in the molecular structure or reaction (34):

equation M128
(10)

Whole model and individual parameter validation

An F-test was performed to validate the use of the linear group contribution model to estimate ΔrG° and ΔfG° for the data included in the training set. The F-test indicates whether or not the variability in the equation M129 values within the training set that is captured by the group contribution model is statistically significant compared to the variability not captured by the model (the variances between equation M130 and equation M131) (34). If the location of the F-value in the F-cumulative distribution function corresponds to a probability value >90%, the linear model is accepted.

A t-test was also used to validate the inclusion of each interaction factor in the group contribution model. The t-test indicates whether the value of equation M132 for an interaction factor is statistically significant compared to the uncertainty in the equation M133 value, SEgr,i (34). The interaction factor was retained as a part of the model if the location of its t value in the student t-cumulative distribution function corresponds to a probability value of <5% (34). Although t-tests were performed on the structural groups as well, structural groups with high t-tests were not removed from the model because they were required for the complete decomposition of the molecular structures involved in the training set. For example, although the >C= group that participates in two fused aromatic rings has a equation M134 of −0.0245 kcal/mol and an SEgr,i of 0.927 kcal/mol resulting in a t-test of 0.98, it is retained because it is required for the complete decomposition of molecules involving fused aromatic rings.

Interaction factors with t-tests of over 5% (indicating insignificantly small equation M135 values) were eliminated from the final implementation of the group contribution method because interaction factors are not required for the complete decomposition of a molecular structure and removal of an interaction factor with a equation M136 of zero results in little or no increase in SEMLR of the fitting. However, passing the t-test does not guarantee that the addition of an interaction factor results in any significant reduction in SEMLR for the fitting. Therefore, in addition to performing a t-test for each interaction factor, the SEMLR with and without the interaction factor was also calculated as a measure of the effect of the interaction factor on the goodness of fit. Details of how the t-tests and F-test were calculated are provided in Data S1.

Cross-validation analysis

A cross-validation analysis of the training set used for the fitting was performed to validate the ability of the group contribution method to produce equation M137 and equation M138 estimates for compounds and reactions outside the training set with the same degree of accuracy as the equation M139 and equation M140 estimates for compounds and reactions within the training set. Two hundred distinct cross-validation runs were performed. In each run, 10% of the 869 distinct reactions and compounds involved in the training set were selected at random, and all the equation M141 values associated with each of the selected compounds and reactions were removed from the training set. When a compound was removed from the training set, the equation M142 values associated with the stereoisomeric forms of the compound were also removed from the data set. However, reactions involving the removed compounds were left in the training set unless they were also randomly selected for removal. MLR was then performed on the data remaining in the training set to produce a new set of equation M143 values. The SEMLR, equation M144 and R were all calculated for the data included and excluded from the reduced training set using the new set of equation M145 values.

RESULTS

Development of the improved group contribution method

The new, to our knowledge, group contribution method introduced here consists of 74 molecular substructures (called structural groups) and 11 factors to account for interactions between molecular substructures (called interaction factors) for which group contribution energies (equation M146) are provided (Tables 1 and and2).2). The equation M147 values provided were determined based on an MLR fitting of a training set consisting of 224 compounds with 288 known ΔfG° values and 645 reactions with 3153 known ΔrG° values. The standard error for the fit of the group contribution model to this training set was 1.90 kcal/mol.

Although this new group contribution method is based on the previous group contribution method developed by Mavrovouniotis (9,10), the new method is a significant improvement over the previous method both in the range of biochemical compounds and reactions for which ΔfG° and ΔrG° may be estimated and in the accuracy of the ΔfG° and ΔrG° estimates generated. The expanded applicability of this new group contribution method is due to the addition of 20 new structural groups to the method. When restricted to the structural groups included in the previous group contribution method, ΔfG° could be estimated for only 65% of the compounds and ΔrG° could be estimated for only 97% of the reactions in the training set for the new method. In contrast, the new method allows the estimation of ΔfG° and ΔrG° for 100% of the compounds and reactions in the training set.

The expanded applicability of the new group contribution method also extends to large databases of known biochemical reactions such as the KEGG, UM-BBD, iAF1260 (4), and iJR904 (38). The application of the current and previous group contribution methods to the estimation of ΔfG° and ΔrG° for these databases is discussed in detail later (see the section “Estimating ΔG° of known biochemical reactions”). For the compounds and reactions in the training set for which ΔfG° and ΔrG° could be estimated using the previous group contribution method, the standard error of the estimates generated by the previous method was 3.92 kcal/mol, compared to a standard error of 1.98 kcal/mol for the estimates generated by the new group contribution method. This difference in standard error confirms that the accuracy in the ΔG′° estimates produced using the new group contribution method is significantly improved.

Results from MLR fitting

To assess the goodness of fit of the new group contribution method to the training set of available thermodynamic data, the distribution of the residuals of the fit (the deviations between the estimated ΔG° (equation M148) and the observed ΔG° (equation M149) values) were analyzed (Fig. 2). Analysis of the cumulative distribution of the residuals indicated that 85% and 96% of equation M150 for the training set fall within one and two standard deviations of equation M151 respectively (Fig. 2 A). This agrees well with the confidence intervals expected if the residuals from the training set were normally distributed (68% and 95% within one and two standard deviations, respectively). The distribution of residuals for the training set (shaded bars in Fig. 2 B) is also similar to a normal distribution with the same mean and standard deviation (dashed line in Fig. 2 B). The high peak in the distribution of the residuals above the expected normal distribution indicates the presence of a small number of outlying data points with uncharacteristically large errors that are causing the standard deviation to be larger than would be expected. Although the reactions and compounds associated with each of these outlying data points were carefully analyzed, no clear trends emerged to indicate the need for any additional structural groups or interaction factors in the group contribution method.

F-tests and t-tests were also performed to validate the group contribution method as a whole and to validate each of the interaction factors included in the group contribution method (see Methods). The F-value calculated for the method corresponded to a probability of 100% on the F-cumulative distribution curve, indicating that the method passes the F-test. Additionally, all the t-tests for the interaction factors included in the final implementation of the new method scored below 5%, indicating that the equation M152 values for these factors are statistically significant.

The uncertainties in the equation M153 values of the structural groups (Table 1) and interaction factors (Table 2) (SEgr,i) were utilized to calculate the specific uncertainty in equation M154 or equation M155 (equation M156) for each data point in the training set (see Methods). We found that 73% and 87% of the equation M157 values in the training set fell within one and two equation M158 of the equation M159 values, respectively, validating that the equation M160 calculated from the SEgr,i values provided for the individual structural groups and interaction factors is an effective predictor of the uncertainty in equation M161 Furthermore, 93% and 99% of the equation M162 values for the training set were lower than one and two SEMLR, respectively, verifying that using the equation M163 as an estimate of the uncertainty in each equation M164 provides tighter bounds on the uncertainty in the estimates than using the overall SEMLR as the uncertainty estimate for every equation M165

Results of cross-validation analysis

In addition to assessing the accuracy of the equation M166 and equation M167 estimates generated by the new group contribution method for the compounds and reactions included in the training set, we also performed a cross-validation analysis to assess the ability of the new method to estimate equation M168 and equation M169 for compounds and reactions outside the training set. After 200 cross-validation runs were performed (see Methods), the standard error for the data excluded from the training set in the cross-validation runs (SEExcluded) was compared to the standard error for the data remaining in the training set (SEMLR) (Fig. 4 A). The overall SEExcluded for all the cross-validation runs was 2.22 kcal/mol, which is only 1.0% higher than the SEMLR for the entire training set (1.90 kcal/mol). These results indicate that the accuracy of equation M170 for the data included in and excluded from the training set is similar. Additionally, the distributions of the residuals for the data excluded from the training set (shaded bars in Fig. 4 B) and the data included in the training set (solid bars in Fig. 4 B) are nearly identical, further confirming that the accuracy of equation M171 and equation M172 for the data included in and excluded from the training set is similar.

FIGURE 4
Characterization of residuals from the cross-validation analysis. Characterization of residuals for the data associated with the 10% of the reactions and compounds removed from the training set during each cross-validation run. The standard deviation ...

To assess the sensitivity of the equation M180 values included in the group contribution method to the training set used to fit the method, we studied the variance of these values during the 200 cross validation runs (Fig. 5). The median equation M181 value calculated for each group during the cross validation analysis never differed from the final reported equation M182 value by more than 0.5 kcal/mol. Furthermore, 50% of the equation M183 values calculated for each group typically fell within 1.0 kcal/mol of the final reported value and always fell within 2 kcal/mol of the final reported value (Fig. 5). These results indicate that the sensitivity of the equation M184 values to the training set used to fit this group contribution method is within the same order of magnitude as the uncertainty in the equation M185 values.

FIGURE 5
Variation of group energy values during cross-validation analysis. The differences between the final reported equation M186 value and the median of the equation M187 values (equation M188) calculated during the 200 cross-validation runs for each structural group and interaction factor included ...

We also examined the accuracy of the equation M192 values calculated for equation M193 of the data excluded from the training set. The residual (difference between equation M194 and equation M195) of each excluded data point was compared to the equation M196 for the same data point, and it was found that 62%, 75%, and 88% of the residuals were less than one equation M197 two equation M198 and four equation M199 respectively. This study indicates that when estimating uncertainty in equation M200 for compounds and reactions not included in the data set, uncertainties of two equation M201 and four equation M202 will provide approximately the same confidence interval as one and two standard deviations for normally distributed residuals. As a conservative limit, the overall SEExcluded from the cross-validation runs (2.22 kcal/mol) may be used for the uncertainty in any equation M203 value, as has been previously proposed by Mavrovouniotis.

Contribution of the conjugation interaction factors

One significant advance in this new group contribution method compared with previous methods is the addition of interaction factors to account for the effect of double bond conjugation on the ΔfG° and ΔrG° values. Initially, one new interaction factor was introduced into the group contribution method for each of the types of double bond conjugation possible between carbon, oxygen, and nitrogen atoms (Table 3). Double bond conjugation involving sulfur atoms was not considered, as such structures are less common in biochemistry. Two interaction factors, NCNC and CNNC, were removed from the method before the fitting, as no example of this class of double bond conjugation was found in any of the molecules within the training set. When MLR was used to determine the equation M204 SEgr, and t-test values for each of the interaction factors (Table 3), it was found that the interaction factors NCCN and CCNC both had t-tests well over 10%, indicating that these interaction factors were not statistically significant. Additionally, the interaction factor OCNC had an insignificant effect on the SEMLR for the fitting. For these reasons, these interaction factors were also removed from the method. All the remaining interaction factors had statistically significant equation M205 values, and the addition of each of the remaining interaction factors resulted in a significant drop in the overall SEMLR for the group contribution method. Overall, the inclusion of the five remaining interaction factors for double bond conjugation reduced the SEMLR for the fitting from 2.04 to 1.90 kcal/mol.

Estimating ΔG′° of known biochemical reactions

The group contribution method of Mavrovouniotis and the final implementation of the new group contribution method were both applied to calculating equation M206 of the compounds and equation M207 of the reactions in four databases of biochemical reactions: the iJR904 genome-scale model of E. coli, the iAF1260 genome-scale model of E. coli, the UM-BBD, and the KEGG (Table 4). The molecular structures of some of the metabolites contained in these databases involve pseudoatoms such as R, X, or *, and the molecular structures of some other metabolites are unknown. These metabolites were considered ineligible for ΔfG° estimation because the complete structure of a compound must be known for equation M208 to be calculated. Similarly, some of the reactions contained in these databases are not mass or charge balanced or involve compounds with unknown molecular structures. Such reactions were also considered ineligible for ΔrG° estimation because equation M209 can be calculated only for complete and balanced reactions. Once the ineligible reactions and compounds were removed from consideration, any remaining compounds and reactions for which equation M210 and equation M211 could not be calculated were entirely due to the presence of molecular substructures for which the equation M212 value was unknown.

TABLE 4
Coverage of major biochemical databases

Both the group contribution method of Mavrovounoitis and the new group contribution method are capable of estimating ΔfG° and ΔrG° for nearly all the compounds and reactions involved in the iJR904 and iAF1260 models. The coverage of the new group contribution method is only slightly better for these genome-scale models. However, the new group contribution method performs significantly better than the Mavrovouniotis method in estimating ΔfG° and ΔrG° of the UM-BBD compounds and reactions. This is primarily due to the addition of equation M221 values for the halogen substructures, which are prevalent in the biodegradation chemistry. The new group contribution method also performs significantly better in estimating ΔfG° and ΔrG° of the KEGG compounds and reactions. All 20 additional substructures that have been included in the new group contribution method contribute evenly to this improvement in the coverage of the KEGG. All the ΔfG° and ΔrG° values estimated for the compounds and reactions in the KEGG using the new group contribution method have been provided in Data S2. Note that in all four databases, the coverage of the reactions by the group contribution method is better than the coverage of the compounds. This is because structural groups with unknown equation M222 values cancel out of many reactions, as they are not created or destroyed in most reactions. Overall, the new group contribution method is demonstrated to be capable of estimating ΔfG° and ΔrG° for a wide range of biochemical compounds and reactions.

Prevalent substructures with unknown ΔgrG° values

Clearly, molecular substructures still exist in these databases for which the equation M223 value is unknown. Many of these structures are present in organic-inorganic complexes involving iron, nickel, or cobalt for which the new group contribution method has not been designed. However, a small number of prevalent organic substructures with unknown equation M224 values appear in many of the metabolites and reactions for which equation M225 and equation M226 could not be calculated (Table 5). These substructures represent important targets for future experiments involving the measurement of the thermodynamic properties of biochemical reactions. As experimental data for reactions involving these substructures does emerge, new structural groups and interaction factors can be developed and added to this group contribution method. When determining equation M227 values for these additions to the model, it is recommended that the new equation M228 data be appended to the training set used to fit the entire group contribution model and that all equation M229 values be refit in the model rather than solely the values for the new groups. This will result in better accuracy and reveal the effect of the addition of the new data and groups on the method. To facilitate this kind of expansion and improvement of this group contribution method, details of the training set used in this method have been provided in Data S2. Molfiles created for the molecular structures of every compound involved in the training set in the correct ionic form at pH 7 are also available in Data S3.

TABLE 5
Prevalent molecular substructures with unknown equation M230 values

DISCUSSION

The group contribution method introduced here has numerous advantages over previous methods including i), the ability to calculate equation M231 for a greater variety of compounds and reactions; ii), improved accuracy in the equation M232 values produced using the method; iii), improved estimation for the uncertainty in the equation M233 values produced; and iv), complete disclosure of the training set used to fit the equation M234 values to facilitate the expansion of the method with additional data, interaction factors, and structural groups. The application of this group contribution method toward the estimation of equation M235 and equation M236 for the compounds and reactions in the iJR904 model, the iAF1260 model, the UM-BBD, and the KEGG confirms the ability of the method to predict equation M237 and equation M238 for a significant portion of the known biochemistry. The equation M239 and equation M240 estimations generated for the KEGG and provided in Data S2 represent the most complete and most accurate set of thermodynamic data compiled for the KEGG to date, to our knowledge, and the addition of halogens to the methodology allows the application of this method to new types of chemistry beyond genome-scale metabolic models in areas such as bioremediation (24).

All the changes introduced in this new group contribution method have not only expanded the applicability of the method to calculate equation M241 for a wider range of compounds and reactions but also improved the accuracy of the method. For the compounds and reactions in the training set for which equation M242 can be calculated using the Mavrovouniotis group contribution method, the standard deviation of the residuals is 3.92 kcal/mol, compared to a standard deviation of 1.98 kcal/mol when the new group contribution method is used to calculate equation M243 for the same reactions and compounds.

The quantification of the uncertainty in each equation M244 value in the method allows improved resolution in uncertainty estimates for all equation M245 produced using the method. This enhanced resolution in the uncertainty in equation M246 is essential to any genome-scale analysis of metabolic pathways and metabolomic studies involving thermodynamics such as thermodynamics-based metabolic flux analysis (2). As a result of the cross-validation analysis, it is recommended that the uncertainty used for equation M247 values calculated with this method be four times the equation M248 calculated from the SEgr values using Eq. 10, as this uncertainty provides an 83% confidence interval for equation M249

A web interface has been developed to allow the automated estimation of the ΔfG° values for a set of compounds based on the molecular structures of the compounds using the new group contribution method. This interface is available free at the following web address: http://sparta.chem-eng.northwestern.edu/cgi-bin/GCM/WebGCM.cgi.

SUPPLEMENTARY MATERIAL

To view all of the supplemental files associated with this article, visit www.biophysj.org.

Acknowledgments

Many thanks to Stacey Pace for her assistance with the UM-BBD database analysis.

This work was supported by the U.S. Department of Energy Genomes to Life program, the DuPont Young Professor's grant, and a National Science Foundation Integrative Graduate Education and Research Traineeship Complex Systems fellowship.

Notes

Matthew D. Jankowski and Christopher S. Henry contributed equally to this work.

Editor: Costas D. Maranas.

References

1. Henry, C. S., M. D. Jankowski, L. J. Broadbelt, and V. Hatzimanikatis. 2006. Genome-scale thermodynamic analysis of Escherichia coli metabolism. Biophys. J. 90:1453–1461. [PMC free article] [PubMed]
2. Henry, C. S., L. J. Broadbelt, and V. Hatzimanikatis. 2007. Thermodynamics-based metabolic flux analysis. Biophys. J. 92:1792–1805. [PMC free article] [PubMed]
3. Price, N. D., J. A. Papin, C. H. Schilling, and B. O. Palsson. 2003. Genome-scale microbial in silico models: the constraints-based approach. Trends Biotechnol. 21:162–169. [PubMed]
4. Feist, A. M., C. S. Henry, J. L. Reed, M. Krummenacker, A. R. Joyce, P. D. Karp, L. J. Broadbelt, V. Hatzimanikatis, and B. Ø. Palsson. 2007. A genome-scale metabolic reconstruction for Escherichia coli K-12 MG1655 that accounts for 1261 ORFs and thermodynamic information. Mol. Syst. Biol. 3:1–18. [PMC free article] [PubMed]
5. Kummel, A., S. Panke, and M. Heinemann. 2006. Putative regulatory sites unraveled by network-embedded thermodynamic analysis of metabolome data. Mol. Syst. Biol. 2:1–10. [PMC free article] [PubMed]
6. Kummel, A., S. Panke, and M. Heinemann. 2006. Systematic assignment of thermodynamic constraints in metabolic network models. BMC Bioinformatics. 7:1–12. [PMC free article] [PubMed]
7. Reference deleted in proof.
8. Beard, D. A., and H. Qian. 2005. Thermodynamic-based computational profiling of cellular regulatory control in hepatocyte metabolism. Am. J. Physiol. Endocrinol. Metab. 288:E633–E644. [PubMed]
9. Mavrovouniotis, M. L. 1991. Estimation of standard Gibbs energy changes of biotransformations. J. Biol. Chem. 266:14440–14445. [PubMed]
10. Mavrovouniotis, M. L. 1990. Group contributions for estimating standard Gibbs energies of formation of biochemical-compounds in aqueous-solution. Biotechnol. Bioeng. 36:1070–1082. [PubMed]
11. Benson, S. W. 1968. Thermochemical Kinetics; Methods for the Estimation of Thermochemical Data and Rate Parameters. Wiley, New York.
12. Maskow, T., and U. V. Stockar. 2005. How reliable are thermodynamic feasibility statements in biochemical pathways? Biotechnol. Bioeng. 92:223–230. [PubMed]
13. Scholten, J. C. M., J. C. Murrell, and D. P. Kelly. 2003. Growth of sulfate-reducing bacteria and methanogenic archaea with methylated sulfur compounds: a commentary on the thermodynamic aspects. Arch. Microbiol. 179:135–144. [PubMed]
14. VanBriesen, J. M. 2002. Evaluation of methods to predict bacterial yield using thermodynamics. Biodegradation. 13:171–190. [PubMed]
15. Weber, A. L. 2002. Chemical constraints governing the origin of metabolism: the thermodynamic landscape of carbon group transformations under mild aqueous conditions. Orig. Life Evol. Biosph. 32:333–357. [PubMed]
16. Magnus, J. B., D. Hollwedel, M. Oldiges, and R. Takors. 2006. Monitoring and modeling of the reaction dynamics in the valine/leucine synthesis pathway in Corynebacterium glutamicum. Biotechnol. Prog. 22:1071–1083. [PubMed]
17. Hatzimanikatis, V., C. Li, J. A. Ionita, C. S. Henry, M. D. Jankowski, and L. J. Broadbelt. 2004. Exploring the diversity of complex metabolic networks. Bioinformatics. 21:1603–1609. [PubMed]
18. Li, C., J. A. Ionita, C. S. Henry, M. D. Jankowski, V. Hatzimanikatis, and L. J. Broadbelt. 2004. Computational discovery of biochemical routes to specialty chemicals. Chem. Eng. Sci. 59:5051–5060.
19. Tanaka, M., Y. Okuno, T. Yamada, S. Goto, S. Uemura, and M. Kanehisa. 2003. Extraction of a thermodynamic property for biochemical reactions in the metabolic pathway. Genome Inform. 14:370–371.
20. Caspi, R., H. Foerster, C. A. Fulcher, R. Hopkinson, J. Ingraham, P. Kaipa, M. Krummenacker, S. Paley, J. Pick, S. Y. Rhee, C. Tissier, P. F. Zhang, and P. D. Karp. 2006. MetaCyc: a multiorganism database of metabolic pathways and enzymes. Nucleic Acids Res. 34:D511–D516. [PMC free article] [PubMed]
21. Schomburg, I., A. J. Chang, O. Hofmann, C. Ebeling, F. Ehrentreich, and D. Schomburg. 2002. BRENDA: a resource for enzyme data and metabolic information. Trends Biochem. Sci. 27:54–56. [PubMed]
22. Ogata, H., S. Goto, K. Sato, W. Fujibuchi, H. Bono, and M. Kanehisa. 1999. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 27:29–34. [PMC free article] [PubMed]
23. Kanehisa, M., S. Goto, S. Kawashima, and A. Nakaya. 2002. The KEGG databases at GenomeNet. Nucleic Acids Res. 30:42–46. [PMC free article] [PubMed]
24. Ellis, L. B. M., D. Roe, and L. P. Wackett. 2006. The University of Minnesota Biocatalysis/Biodegradation Database: the first decade. Nucleic Acids Res. 34:D517–D521. [PMC free article] [PubMed]
25. Alberty, R. A. 1998. Calculation of standard transformed formation properties of biochemical reactants and standard apparent reduction potentials of half reactions. Arch. Biochem. Biophys. 358:25–39. [PubMed]
26. Alberty, R. A. 2003. Thermodynamics of Biochemical Reactions. Massachusetts Institute of Technology Press, Cambridge, MA.
27. Goldberg, R. N., Y. B. Tewari, and T. N. Bhat. 2004. Thermodynamics of enzyme-catalyzed reactions—a database for quantitative biochemistry. Bioinformatics. 20:2874–2877. [PubMed]
28. Thauer, R. K., K. Jungermann, and K. Decker. 1977. Energy conservation in chemotrophic anaerobic bacteria. Bacteriol. Rev. 41:100–180. [PMC free article] [PubMed]
29. Thauer, R. K. 1998. Biochemistry of methanogenesis: a tribute to Marjory Stephenson. Microbiology. 144:2377–2406. [PubMed]
30. Dolfing, J., and B. K. Harrison. 1992. Gibbs free-energy of formation of halogenated aromatic-compounds and their potential role as electron-acceptors in anaerobic environments. Environ. Sci. Technol. 26:2213–2218.
31. Dolfing, J., and D. B. Janssen. 1994. Estimates of Gibbs free energies of formation of chlorinated aliphatic compounds. Biodegradation. 5:21–28.
32. Forsythe, R. G., P. D. Karp, and M. L. Mavrovouniotis. 1997. Estimation of equilibrium constants using automated group contribution methods. Comput. Appl. Biosci. 13:537–543. [PubMed]
33. Wade, L. G. 2003. Organic Chemistry. Prentice Hall/Pearson Education, Upper Saddle River, NJ.
34. Neter, J., W. Wasserman, and M. H. Kutner. 1990. Applied Linear Statistical Models: Regression, Analysis of Variance, and Experimental Designs. Irwin, Homewood, IL.
35. Wagman, D. D. 1982. The NBS Tables of Chemical Thermodynamic Properties: Selected Values for Inorganic and C1 and C2 Organic Substances in SI Units. American Chemical Society and the American Institute of Physics for the National Bureau of Standards, Washington, DC.
36. Alberty, R. A. 1998. Calculation of standard transformed Gibbs energies and standard transformed enthalpies of biochemical reactants. Arch. Biochem. Biophys. 353:116–130. [PubMed]
37. Soovali, L., E. I. Room, A. Kutt, I. Kaljurand, and I. Leito. 2006. Uncertainty sources in UV-Vis spectrophotometric measurement. Accredit. Qual. Assur. 11:246–255.
38. Reed, J. L., T. D. Vo, C. H. Schilling, and B. O. Palsson. 2003. An expanded genome-scale model of Escherichia coli K-12 (iJR904 GSM/GPR). Genome Biol. 4:1–12. [PMC free article] [PubMed]

Articles from Biophysical Journal are provided here courtesy of The Biophysical Society

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...