- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

# Group Contribution Method for Thermodynamic Analysis of Complex Metabolic Networks

^{*}Christopher S. Henry,

^{†}Linda J. Broadbelt,

^{‡}and Vassily Hatzimanikatis

^{§}

^{*}Mayo Clinic, Rochester, Minnesota 55905;

^{†}Mathematics and Computer Science, Argonne National Laboratory, Argonne, Illinois 60439;

^{‡}Department of Chemical and Biological Engineering, McCormick School of Engineering and Applied Sciences, Northwestern University, Evanston, Illinois 60208; and

^{§}Laboratory of Computational Systems Biotechnology, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland

## Abstract

A new, to our knowledge, group contribution method based on the group contribution method of Mavrovouniotis is introduced for estimating the standard Gibbs free energy of formation (Δ_{f}*G*′^{°}) and reaction (Δ_{r}*G*′^{°}) in biochemical systems. Gibbs free energy contribution values were estimated for 74 distinct molecular substructures and 11 interaction factors using multiple linear regression against a training set of 645 reactions and 224 compounds. The standard error for the fitted values was 1.90 kcal/mol. Cross-validation analysis was utilized to determine the accuracy of the methodology in estimating Δ_{r}*G*′^{°} and Δ_{f}*G*′^{°} for reactions and compounds not included in the training set, and based on the results of the cross-validation, the standard error involved in these estimations is 2.22 kcal/mol. This group contribution method is demonstrated to be capable of estimating Δ_{r}*G*′^{°} and Δ_{f}*G*′^{°} for the majority of the biochemical compounds and reactions found in the *i*JR904 and *i*AF1260 genome-scale metabolic models of *Escherichia* *coli* and in the Kyoto Encyclopedia of Genes and Genomes and University of Minnesota Biocatalysis and Biodegradation Database. A web-based implementation of this new group contribution method is available free at http://sparta.chem-eng.northwestern.edu/cgi-bin/GCM/WebGCM.cgi.

## INTRODUCTION

Thermodynamics is increasingly being applied to improve our understanding of the metabolism of microorganisms, especially in the context of constraint-based analysis of genome-scale models of microorganisms (1–4). Constraints based on the laws of thermodynamics have been applied for the determination of feasible ranges for the rates of biochemical reactions and the concentrations of metabolites (2,5). Methods for quantifying the feasible ranges for the Gibbs free energy change of reaction (Δ_{r}*G*′) have been applied to the curation of new metabolic reconstructions (4), the systematic assessment of the degree of reversibility of metabolic reactions (6), and the evaluation of the feasibility of biodegradation reactions (S. D. Finley, L. J. Broadbelt, and V. Hatzimanikatis, unpublished). Numerous methods based on thermodynamic constraints and the laws of thermodynamics have also been applied in the study of the regulatory network of the cell (2,5,8). All these studies require that the standard Gibbs free energy change of reaction (Δ_{r}*G*′^{°}) be known so that the degree of thermodynamic favorability of the reactions in these systems can be quantified.

Thermodynamic analysis of metabolism based entirely on experimentally measured Δ_{r}*G*′^{°} data has been restricted to either small-scale systems (8) or small subsections of genome-scale systems (5,6) due to the limited amount of experimental data currently available. For example, in the latest genome-scale model of *Escherichia coli* (4), experimentally measured Δ_{r}*G*′^{°} data are available for only 169 (8.1%) of the 2077 reactions in the model (4). Due to this scarcity of experimentally measured values of Δ_{r}*G*′^{°}, methods for its estimation are often applied to fill in the gaps in the experimental data. One of the most prevalent techniques for estimating Δ_{r}*G*′^{°} of biochemical reactions is the group contribution method of Mavrovouniotis (9,10). This method allows the rapid calculation of accurate estimations of Δ_{r}*G*′^{°} and the standard Gibbs free energy of formation (Δ_{f}*G*′^{°}) for a wide variety of biological reactions and compounds (1). Unlike the group contribution method of Benson (11), this method is tailored for aqueous organic chemistry taking place at neutral pH involving ionic species.

This group contribution method has been applied to the study of the thermodynamic feasibility of numerous native (12–16) and novel (17,18) metabolic pathways. The method has been utilized to estimate Δ_{f}*G*′^{°} and Δ_{r}*G*′^{°} for the majority of the compounds and reactions contained in the Kyoto Encyclopedia of Genes and Genomes (KEGG) (19) and in an earlier genome-scale model of *E. coli* (1). The method has also enabled the development of thermodynamic metabolic flux analysis, a framework for the genome-scale thermodynamic analysis of metabolism that accounts for the effect of metabolite activity levels on the thermodynamic feasibility of biochemical reactions embedded in a metabolic network (2).

In all these applications, the group contribution method of Mavrovouniotis has been demonstrated to be capable of rapidly producing accurate estimates of Δ_{f}*G*′^{°} and Δ_{r}*G*′^{°} for many of the common metabolites in the central metabolic pathways. However, the method could not be used to estimate Δ_{f}*G*′^{°} for molecules involving some sulfur, nitrogen, and halogen substructures commonly found in large, genome-scale metabolic models or in databases of biochemical reactions such as the BioCyc (20), Brenda (21), KEGG (22,23), and University of Minnesota Biocatalysis and Biodegradation Database (UM-BBD) (24). Additionally, Δ_{f}*G*′^{°} estimations calculated using the group contribution method of Mavrovouniotis differ significantly from the literature values for Δ_{f}*G*′^{°} of many phosphorylated compounds (25,26), and Δ_{r}*G*′^{°} estimations differ significantly from experimentally observed Δ_{r}*G*′^{°} values for reactions involving the formation (or destruction) of thioester bonds or the formation (or destruction) of conjugated double bonds. Finally, the method of Mavrovouniotis provides only a limited ability to quantify the uncertainty in the Δ_{f}*G*′^{°} and Δ_{r}*G*′^{°} estimates. Although the initial work by Mavrovouniotis provided 68% and 95% confidence intervals of 3 and 5 kcal/mol, respectively, for the overall uncertainty in all estimated Δ_{f}*G*′^{°}, no confidence intervals were provided for the uncertainty in the estimated Δ_{r}*G*′^{°}. Additionally, insufficient data were provided for the quantification of the uncertainty in each specific Δ_{f}*G*′^{°} estimate calculated using the method. These limitations result in imprecise predictions of uncertainty in estimated Δ_{f}*G*′^{°} and Δ_{r}*G*′^{°} values.

We introduce here an updated and expanded group contribution method which utilizes a larger and more current training set of Δ_{r}*G*′^{°} and Δ_{f}*G*′^{°} data including new tables of thermodynamic data found in the National Institute of Standards and Technology (NIST) database (27) and in the work by Alberty (25,26), Thauer (28,29), and Dolfing (30,31). Due to the availability of additional data, group contribution energies were fit to a number of molecular substructures involving halogens, sulfur, and nitrogen (Table 1) that were not included in Mavrovouniotis's original work. The addition of these new molecular substructures to the group contribution method enables the estimation of Δ_{f}*G*′^{°} for a wider variety of molecules. The method also includes a set of seven new interaction factors to account for the energy contributions of the various types of conjugated double bonds, thioester bonds, and vicinal chlorine atoms (see Methods). Finally, the uncertainty analysis performed allows the uncertainty of each estimated Δ_{r}*G*′^{°} and Δ_{f}*G*′^{°} to be determined based on the uncertainty in the constituent group contribution energies.

## METHODS

### Group contribution method

The group contribution method was developed as a means of estimating Δ_{r}*G*′^{°} of a reaction based on the molecular structures of the compounds involved in the reaction (9–11). In group contribution methods, the molecular structure of a single compound is decomposed into a set of smaller molecular substructures based on the hypothesis that Δ_{r}*G*′^{°} and Δ_{f}*G*′^{°} can be estimated using a linear model where each model parameter is associated with one of the constituent molecular substructures (or groups) that combine to form the compound. To estimate Δ_{f}*G*′^{°} of the entire compound, the contributions of each of the groups to this property are summed as follows:

where is the estimated Δ_{f}*G*′^{°}, is the contribution of group *i* to *n*_{i} is the number of instances of group *i* in the molecular structure, and *N*_{gr} is the number of groups for which is known (i.e., the total number of groups in our database). Similarly, Δ_{r}*G*′^{°} is estimated by summing the contribution of each structural group created or destroyed during the reaction:

where is the estimated Δ_{r}*G*′^{°}, *v*_{i} is the stoichiometric coefficient of species *i* in the reaction, and *m* is the number of species involved in the reaction. The advantage of estimating Δ_{r}*G*′^{°} using Eq. 2 instead of using the estimated formation energies is that any structural groups unchanged during the reaction cancel out of Eq. 2, and this can include groups for which is unknown.

### Determination of the groups involved in a molecular structure

In keeping with the group contribution scheme developed by Mavrovouniotis, this implementation of the group contribution method involves two different kinds of energy contributions: i), contributions from the structural groups that combine together to form the structure of the molecule (Table 1), and ii), contributions from the interaction factors that account for the effect of the interactions between various structural groups on the Δ_{f}*G*′^{°} of a molecule (Table 2). When calculating of a compound using the group contribution method, the molecular structure of the compound is first broken down into the set of structural groups that combine to form the compound. can be calculated for a compound only if every single atom involved in the molecular structure of the compound can be assigned to exactly one structural group for which is known.

Some of the larger structural groups included in this group contribution method can be further broken down into smaller structural groups. These larger groups, called “characteristic groups”, were included in the method because the properties of these groups are significantly different from the summed properties of their smaller constituent structural groups. For example, the −COO^{−} group could be further broken down into the >C=O group and the −O^{−} group. However, the of the −COO^{−} group is −83.1 kcal/mol, whereas the sum of the values for the >C=O group and the −O^{−} group is −61.2 kcal/mol. The characteristic groups used in this method were originally developed by Mavrovouniotis based on expert knowledge of biochemistry and goodness of fit of the group contribution model to the available experimental data.

Because these characteristic groups exist, often multiple structural groups can be mapped to the same atoms in the molecular structure of a compound. For example, the carbon in a carboxylic acid functional group can be assigned to either the −COO^{−} group or the >C=O group. When these cases arise, the atoms should always be assigned to the structural group with the smallest search priority number, which is provided along with the values in Table 1 (Fig. 1 *A*). The only exception to this rule concerns the phosphate chains found in molecules such as NAD(H) or ATP. If every phosphate in a phosphate chain is assigned to the structural group with the smallest priority number, then every phosphate that is not a terminal phosphate would be assigned to the group. This leads to the assignment of the oxygen bridging two neighboring phosphates to two groups, which violates the requirement that every atom be assigned to exactly one group. To avoid this violation, terminal phosphate chains (like the phosphate chain in ATP) involving *n* phosphorus atoms are always decomposed into one group and (*n* − 1) groups (Fig. 1 *B*). Similarly, internal phosphate chains (like the phosphate chain in NADH) involving *n* phosphorus atoms were always decomposed into one group and (*n* − 1) −OPO_{2}- groups. An algorithm for automatically breaking down molecular structures into the appropriate structural groups in the group contribution method is discussed in the work by Forsythe, Karp and Mavrovouniotis (32).

**...**

The molecular structures being decomposed into structural groups must also be in the form of the predominant ion for the molecule in the same conditions at which the fitting of the values was performed: pH 7, zero ionic strength, and a temperature of 298 K. The predominant ions of all the molecules involved in the training set at pH 7 were determined using pK_{a} estimation software (MarvinBeans pK_{a} estimation plug-in, ver. 4.0.3, ChemAxon, Budapest, Hungary). When a molecule exists in multiple isomeric or resonance forms in equilibrium, such as keto-enol tautomers, the most stable form (the form resulting in the lowest ) is decomposed into structural groups. This ensures that the form of the molecule used to calculate is the predominant form in solution.

Stereochemistry is ignored when labeling atoms in a molecule according to their structural groups. For example, all forms of sugars with six carbon atoms including glucose, galactose, and mannose, which have values of −219, −217, and −217 kcal/mol, respectively, are decomposed into exactly the same structural groups and interaction factors, and as a result, all these sugars have identical values. This is a reasonable assumption given the similarity of the values.

Once every single atom in the molecular structure of a compound has been assigned to the proper structural group, the interaction factors must be determined. The associated with each interaction factor is then added to compound to account for the effect of the interaction factor on the formation energy. There are seven types of interaction factors used in this implementation of the group contribution method (Fig. 1 *C*). Four of the interaction factors used were originally proposed in the group contribution method of Mavrovouniotis: the hydrocarbon factor, the heteroaromatic ring factor, the three-member ring factor, and the amide factor. The hydrocarbon factor is added to of any compound that consists of only carbon and hydrogen. The heteroaromatic ring factor is added to of a compound for every heteroaromatic ring in the compound, as determined according to Hückel's rule. Similarly, the three-member ring factor is added to of a compound for every three-member ring in the compound regardless of the atoms that make up the ring. The amide factor is added to of a compound for every instance of a nitrogen atom neighboring a carbonyl group in the compound. Note that if a nitrogen atom is neighboring two carbonyl groups, this is counted as a single amide factor.

Three new types of interaction factors were introduced in this implementation of the group contribution method that were not included in the method of Mavrovouniotis: the thioester factor, the double bond conjugation factors, and the vicinal Cl factor. The thioester factor is added to of a compound for every instance of a sulfur atom neighboring a carbonyl group in the compound. This factor accounts for high energy of the thioester bond (33). Like the amide factor, if a sulfur atom is neighboring two carbonyl groups, this is counted as a single thioester factor.

The conjugation of double bonds has a significant stabilizing effect on the molecular structure of a molecule, making the removal of a conjugated double bond more difficult than the removal of an isolated double bond (33). Without any interaction factor for double bond conjugation, the group contribution method has no means of capturing these characteristics of conjugated double bonds. Therefore, the double bond conjugation factors were introduced to account for the stabilizing effect of double bond conjugation on a molecular structure. Ten forms of conjugated double bonds are possible in a molecular structure containing C, N, and O, and a separate double bond conjugation factor was initially introduced for each of these 10 forms (Table 3). Five of these forms were not included in the final implementation of the method due to a lack of data or because the conjugation factor was statistically insignificant (see Table 3 and Results). Note that double bond conjugation factors are not added for conjugated double bonds that are contained completely within an aromatic or heteroaromatic ring.

The vicinal Cl factor was introduced based on the examination of the effect of chlorine substitution on the of aliphatic compounds performed by Dolfing and Janssen (31). Dolfing and Janssen proposed that chlorine atoms attached to neighboring carbon atoms have a destabilizing effect on one another, and an interaction factor is required to account for this destabilization to accurately estimate the of chlorinated compounds using the group contribution method. The vicinal Cl factor is an implementation of the interaction factor proposed by Dolfing and Janssen, and two variations of this interaction factor were explored. The first variation implemented, VCl_{distinct}, is based on the hypothesis that a larger number of chlorine atoms attached to neighboring carbons results in a larger destabilizing effect, described mathematically as follows:

where VCl_{distinct} is the total value of the correction for the interaction of vicinal chlorine atoms that is added to is the group contribution energy for the vicinal Cl interaction factor, *N*_{C} is the number of carbon atoms in the molecule, *N*_{Cl,i} is the number of chlorine atoms attached to carbon atom *i*, and *δ*_{ij} is the Kronecker Δ, a binary variable equaling zero unless carbon atom *i* is bonded to carbon atom *j*.

The second variation of the vicinal Cl interaction factor, VCl_{binary}, is based on the hypothesis that the destabilizing effect of vicinal chlorine atoms is independent of the number of chlorine atoms attached to each of the neighboring carbons (Fig. 1 *C*), described mathematically as follows:

Both variations of the vicinal Cl interaction factor were tested, and the VCl_{distinct} interaction was selected for the final implementation of the method because it resulted in the best possible fit of the thermodynamic data included in the training set (see Results).

### Multiple linear regression

The multiple linear regression (MLR) method (least squares) was used to determine the values for the set of structural groups and interaction factors that allow the best fit of the observed Δ_{r}*G*′^{°} () and observed Δ_{f}*G*′^{°} () values included in a training set. The values are calculated using the following:

where is an *N*_{gr} × 1 vector of the energies associated with each group in the group contribution method, **X** is an *N*_{obs} × *N*_{gr} matrix of the number of each group contained in each molecular structure or created or destroyed in each reaction in the training set, **X***′* is the transpose of matrix **X**, *N*_{obs} is the number of and values included in the training set, and is an *N*_{obs} × 1 vector of and values included in the training set (34).

MLR is the ideal technique for producing values that optimally fit the training set only if the data included in the training set satisfies the following two conditions: (MLR.I) Δ_{r}*G*′^{°} and Δ_{f}*G*′^{°} must be linearly related to the model parameters (the values) and the differences between the and for each data point in the training set must be uncorrelated, and (MLR.II) the absolute uncertainty in each of the Δ_{r}G′^{°} and Δ_{f}*G*′^{°} observations included in must be similar in magnitude (34). The random distribution of the residuals of the fit indicates that condition MLR.I is satisfied (see Fig. 2 *B*). The discussion of the uncertainty in the training set data explains why condition MLR.II is also satisfied by the data (see the section “Uncertainty in training set data”).

### Quantification of the goodness of fit

The goodness of the MLR fit is quantified using the standard deviation of the differences between and for the compounds and reactions involved in the training set, *SE*_{MLR} (34):

where **R′** is the transpose of the vector **R**, and **R** is the vector of residuals for the fit, calculated as follows (34):

If the differences between and in the training set follow a normal distribution, then 68% of the residuals will be less than or equal to *SE*_{MLR} (34). The *SE*_{MLR} is also used to assess the effect of removal or addition of interaction factors on the group contribution scheme (see the section “Whole model and individual parameter validation”).

### Formation of the training set for the MLR

The values used in the training set for the MLR involved a total of 3153 values and 288 values. The and values in the training set were pulled from a variety of literature sources including work on methanogenesis by Thauer (28,29), work on halogen thermodynamics by Dolfing and co-workers (30,31), work on formation energy standardization and redox potentials by Alberty (25,26), and thermodynamic data compiled in the NIST (27) and National Bureau of Standards (NBS) (35) databases. The experimentally measured values reported in these references were captured under a variety of temperature and pH conditions. Only data captured within one pH unit and 15 K of the chosen reference state of pH 7 and 298 K was utilized. Most of the data utilized were collected within 1 K of 298 K and 0.1 pH units of pH 7 (Fig. 3). Overall, 645 distinct biochemical reactions are represented in the 3153 values used in the training set, meaning that multiple data points were included for many reactions. Similarly, 224 distinct molecular structures are represented by the 288 values used in the training set. When multiple data points existed for single reactions or compounds, we used all data points in the data set rather than averaging the data and including the average. By using all the data points instead of the average, the variability in the data is included in the residuals, covariance matrix, and standard deviation for the fit, which results in a better quantification of the uncertainty in the group free energy values. All and values included in the training set are listed in Supplementary Material, Data S2 along with the associated reactions and compounds.

### Uncertainty in training set data

To estimate the total uncertainty in each and data point included in the training set, the sources of uncertainty were enumerated and quantified. The total uncertainty in the values included in the training set were estimated from the precision of the reported values; the reported precision in the values ranges from 0.01 to 1 kcal/mol, implying that the absolute uncertainty in the values ranges from 0.005 to 0.5 kcal/mol (26,28,35,36). The total uncertainties in the values included in the training set were calculated from four primary sources: (UC.I) uncertainty in the method used to measure the equilibrium constant, (UC.II) uncertainty due to differences between the ionic strength at which each was measured and the reference ionic strength of zero, (UC.III) uncertainty due to differences between the pH at which each was measured and the reference pH of 7, and (UC.IV) uncertainty due to differences between the temperature at which each was measured and the reference temperature of 298 K.

Most of the values included in the training set were measured using spectroscopy, which has a typical precision of 1%–3% of the measured values when used to determine equilibrium constants (37). This translates into an absolute uncertainty of <0.30 kcal/mol for 95% of the values included in the training set. Uncertainty due to deviations of the conditions for the measurements from the reference ionic strength of zero was determined using the extended Debye-Huckel equation as described in Maskow and Stockar (12). For 95% of the reactions in the training set, this uncertainty was <1.45 kcal/mol when the deviation in ionic strength was <0.2 M. The absolute uncertainty due to deviations in the conditions for the measurements from the reference pH of 7 was <1.49 kcal/mol for 95% of the training set reactions within the allowed pH ranges (pH 6–8), as calculated using the methods described by Alberty (26). Uncertainties due to pH and ionic strength deviations are independent of the reference value.

As measurements were accepted into the training set if measured within 15 K of the reference temperature of 298 K, deviations of the measurement conditions from the reference temperature were another source of uncertainty in the values. A rearranged version of the Gibbs-Helmholtz relationship was utilized to determine how temperature changes affect of a reaction:

where Δ_{r}*H*′^{°} is the standard enthalpy change of reaction. Although measured Δ_{r}*H*′^{°} values are unavailable for most of the reactions contained in the training set, the (1 − Δ_{r}*H*′^{°}/Δ_{r}*G*′^{°}) term in the Gibbs-Helmholtz relationship will typically have a maximum value of one for biochemical reactions. Based on this assumption, a 15 K maximum temperature change results in a maximum change of 5.7% in This translates into an absolute error of <0.57 kcal/mol for 95% of the values in the training set. Overall, for 95% of the values included in the training set, the total absolute uncertainty is >0.1 kcal/mol and <2.2 kcal/mol, which satisfies the MLR.II condition that the uncertainty of all values in the training set be similar in magnitude.

#### Quantification of the uncertainty in the *Δ*_{gr}*G*′^{°} values

The uncertainties in the values estimated using MLR were quantified using the covariance matrix of the MLR, which allowed the calculation of a standard error for each in the group contribution method as follows (34):

where *SE*_{gr,i} is the standard error for the group contribution value of group *i*, The *SE*_{gr,i} values can be used to quantify the uncertainty in the estimated Gibbs free energy of formation and reaction, calculated by taking the Euclidean norm of the uncertainties in each group value multiplied by the number of instances of each group involved in the molecular structure or reaction (34):

### Whole model and individual parameter validation

An F-test was performed to validate the use of the linear group contribution model to estimate Δ_{r}*G*′^{°} and Δ_{f}*G*′^{°} for the data included in the training set. The F-test indicates whether or not the variability in the values within the training set that is captured by the group contribution model is statistically significant compared to the variability not captured by the model (the variances between and ) (34). If the location of the F-value in the F-cumulative distribution function corresponds to a probability value >90%, the linear model is accepted.

A *t*-test was also used to validate the inclusion of each interaction factor in the group contribution model. The *t*-test indicates whether the value of for an interaction factor is statistically significant compared to the uncertainty in the value, *SE*_{gr,i} (34). The interaction factor was retained as a part of the model if the location of its *t* value in the student t-cumulative distribution function corresponds to a probability value of <5% (34). Although *t*-tests were performed on the structural groups as well, structural groups with high *t*-tests were not removed from the model because they were required for the complete decomposition of the molecular structures involved in the training set. For example, although the >C= group that participates in two fused aromatic rings has a of −0.0245 kcal/mol and an *SE*_{gr,i} of 0.927 kcal/mol resulting in a *t*-test of 0.98, it is retained because it is required for the complete decomposition of molecules involving fused aromatic rings.

Interaction factors with *t*-tests of over 5% (indicating insignificantly small values) were eliminated from the final implementation of the group contribution method because interaction factors are not required for the complete decomposition of a molecular structure and removal of an interaction factor with a of zero results in little or no increase in *SE*_{MLR} of the fitting. However, passing the *t*-test does not guarantee that the addition of an interaction factor results in any significant reduction in *SE*_{MLR} for the fitting. Therefore, in addition to performing a *t*-test for each interaction factor, the *SE*_{MLR} with and without the interaction factor was also calculated as a measure of the effect of the interaction factor on the goodness of fit. Details of how the *t*-tests and F-test were calculated are provided in Data S1.

### Cross-validation analysis

A cross-validation analysis of the training set used for the fitting was performed to validate the ability of the group contribution method to produce and estimates for compounds and reactions outside the training set with the same degree of accuracy as the and estimates for compounds and reactions within the training set. Two hundred distinct cross-validation runs were performed. In each run, 10% of the 869 distinct reactions and compounds involved in the training set were selected at random, and all the values associated with each of the selected compounds and reactions were removed from the training set. When a compound was removed from the training set, the values associated with the stereoisomeric forms of the compound were also removed from the data set. However, reactions involving the removed compounds were left in the training set unless they were also randomly selected for removal. MLR was then performed on the data remaining in the training set to produce a new set of values. The *SE*_{MLR}, and **R** were all calculated for the data included and excluded from the reduced training set using the new set of values.

## RESULTS

### Development of the improved group contribution method

The new, to our knowledge, group contribution method introduced here consists of 74 molecular substructures (called structural groups) and 11 factors to account for interactions between molecular substructures (called interaction factors) for which group contribution energies () are provided (Tables 1 and and2).2). The values provided were determined based on an MLR fitting of a training set consisting of 224 compounds with 288 known Δ_{f}*G*′^{°} values and 645 reactions with 3153 known Δ_{r}*G*′^{°} values. The standard error for the fit of the group contribution model to this training set was 1.90 kcal/mol.

Although this new group contribution method is based on the previous group contribution method developed by Mavrovouniotis (9,10), the new method is a significant improvement over the previous method both in the range of biochemical compounds and reactions for which Δ_{f}*G*′^{°} and Δ_{r}*G*′^{°} may be estimated and in the accuracy of the Δ_{f}*G*′^{°} and Δ_{r}*G*′^{°} estimates generated. The expanded applicability of this new group contribution method is due to the addition of 20 new structural groups to the method. When restricted to the structural groups included in the previous group contribution method, Δ_{f}*G*′^{°} could be estimated for only 65% of the compounds and Δ_{r}*G*′^{°} could be estimated for only 97% of the reactions in the training set for the new method. In contrast, the new method allows the estimation of Δ_{f}*G*′^{°} and Δ_{r}*G*′^{°} for 100% of the compounds and reactions in the training set.

The expanded applicability of the new group contribution method also extends to large databases of known biochemical reactions such as the KEGG, UM-BBD, *i*AF1260 (4), and *i*JR904 (38). The application of the current and previous group contribution methods to the estimation of Δ_{f}*G*′^{°} and Δ_{r}*G*′^{°} for these databases is discussed in detail later (see the section “Estimating Δ*G*′^{°} of known biochemical reactions”). For the compounds and reactions in the training set for which Δ_{f}*G*′^{°} and Δ_{r}*G*′^{°} could be estimated using the previous group contribution method, the standard error of the estimates generated by the previous method was 3.92 kcal/mol, compared to a standard error of 1.98 kcal/mol for the estimates generated by the new group contribution method. This difference in standard error confirms that the accuracy in the ΔG′^{°} estimates produced using the new group contribution method is significantly improved.

### Results from MLR fitting

To assess the goodness of fit of the new group contribution method to the training set of available thermodynamic data, the distribution of the residuals of the fit (the deviations between the estimated Δ*G*′^{°} () and the observed Δ*G*′^{°} () values) were analyzed (Fig. 2). Analysis of the cumulative distribution of the residuals indicated that 85% and 96% of for the training set fall within one and two standard deviations of respectively (Fig. 2 *A*). This agrees well with the confidence intervals expected if the residuals from the training set were normally distributed (68% and 95% within one and two standard deviations, respectively). The distribution of residuals for the training set (*shaded bars* in Fig. 2 *B*) is also similar to a normal distribution with the same mean and standard deviation (*dashed line* in Fig. 2 *B*). The high peak in the distribution of the residuals above the expected normal distribution indicates the presence of a small number of outlying data points with uncharacteristically large errors that are causing the standard deviation to be larger than would be expected. Although the reactions and compounds associated with each of these outlying data points were carefully analyzed, no clear trends emerged to indicate the need for any additional structural groups or interaction factors in the group contribution method.

F-tests and *t*-tests were also performed to validate the group contribution method as a whole and to validate each of the interaction factors included in the group contribution method (see Methods). The F-value calculated for the method corresponded to a probability of 100% on the F-cumulative distribution curve, indicating that the method passes the F-test. Additionally, all the *t*-tests for the interaction factors included in the final implementation of the new method scored below 5%, indicating that the values for these factors are statistically significant.

The uncertainties in the values of the structural groups (Table 1) and interaction factors (Table 2) (*SE*_{gr,i}) were utilized to calculate the specific uncertainty in or () for each data point in the training set (see Methods). We found that 73% and 87% of the values in the training set fell within one and two of the values, respectively, validating that the calculated from the *SE*_{gr,i} values provided for the individual structural groups and interaction factors is an effective predictor of the uncertainty in Furthermore, 93% and 99% of the values for the training set were lower than one and two *SE*_{MLR}, respectively, verifying that using the as an estimate of the uncertainty in each provides tighter bounds on the uncertainty in the estimates than using the overall *SE*_{MLR} as the uncertainty estimate for every

### Results of cross-validation analysis

In addition to assessing the accuracy of the and estimates generated by the new group contribution method for the compounds and reactions included in the training set, we also performed a cross-validation analysis to assess the ability of the new method to estimate and for compounds and reactions outside the training set. After 200 cross-validation runs were performed (see Methods), the standard error for the data excluded from the training set in the cross-validation runs (*SE*_{Excluded}) was compared to the standard error for the data remaining in the training set (*SE*_{MLR}) (Fig. 4 *A*). The overall *SE*_{Excluded} for all the cross-validation runs was 2.22 kcal/mol, which is only 1.0% higher than the *SE*_{MLR} for the entire training set (1.90 kcal/mol). These results indicate that the accuracy of for the data included in and excluded from the training set is similar. Additionally, the distributions of the residuals for the data excluded from the training set (*shaded bars* in Fig. 4 *B*) and the data included in the training set (*solid bars* in Fig. 4 *B*) are nearly identical, further confirming that the accuracy of and for the data included in and excluded from the training set is similar.

**...**

To assess the sensitivity of the values included in the group contribution method to the training set used to fit the method, we studied the variance of these values during the 200 cross validation runs (Fig. 5). The median value calculated for each group during the cross validation analysis never differed from the final reported value by more than 0.5 kcal/mol. Furthermore, 50% of the values calculated for each group typically fell within 1.0 kcal/mol of the final reported value and always fell within 2 kcal/mol of the final reported value (Fig. 5). These results indicate that the sensitivity of the values to the training set used to fit this group contribution method is within the same order of magnitude as the uncertainty in the values.

**...**

We also examined the accuracy of the values calculated for of the data excluded from the training set. The residual (difference between and ) of each excluded data point was compared to the for the same data point, and it was found that 62%, 75%, and 88% of the residuals were less than one two and four respectively. This study indicates that when estimating uncertainty in for compounds and reactions not included in the data set, uncertainties of two and four will provide approximately the same confidence interval as one and two standard deviations for normally distributed residuals. As a conservative limit, the overall *SE*_{Excluded} from the cross-validation runs (2.22 kcal/mol) may be used for the uncertainty in any value, as has been previously proposed by Mavrovouniotis.

### Contribution of the conjugation interaction factors

One significant advance in this new group contribution method compared with previous methods is the addition of interaction factors to account for the effect of double bond conjugation on the Δ_{f}*G*′^{°} and Δ_{r}*G*′^{°} values. Initially, one new interaction factor was introduced into the group contribution method for each of the types of double bond conjugation possible between carbon, oxygen, and nitrogen atoms (Table 3). Double bond conjugation involving sulfur atoms was not considered, as such structures are less common in biochemistry. Two interaction factors, NCNC and CNNC, were removed from the method before the fitting, as no example of this class of double bond conjugation was found in any of the molecules within the training set. When MLR was used to determine the *SE*_{gr}, and *t*-test values for each of the interaction factors (Table 3), it was found that the interaction factors NCCN and CCNC both had *t*-tests well over 10%, indicating that these interaction factors were not statistically significant. Additionally, the interaction factor OCNC had an insignificant effect on the *SE*_{MLR} for the fitting. For these reasons, these interaction factors were also removed from the method. All the remaining interaction factors had statistically significant values, and the addition of each of the remaining interaction factors resulted in a significant drop in the overall *SE*_{MLR} for the group contribution method. Overall, the inclusion of the five remaining interaction factors for double bond conjugation reduced the *SE*_{MLR} for the fitting from 2.04 to 1.90 kcal/mol.

### Estimating ΔG′^{°} of known biochemical reactions

The group contribution method of Mavrovouniotis and the final implementation of the new group contribution method were both applied to calculating of the compounds and of the reactions in four databases of biochemical reactions: the *i*JR904 genome-scale model of *E. coli*, the *i*AF1260 genome-scale model of *E. coli*, the UM-BBD, and the KEGG (Table 4). The molecular structures of some of the metabolites contained in these databases involve pseudoatoms such as R, X, or *, and the molecular structures of some other metabolites are unknown. These metabolites were considered ineligible for Δ_{f}*G*′^{°} estimation because the complete structure of a compound must be known for to be calculated. Similarly, some of the reactions contained in these databases are not mass or charge balanced or involve compounds with unknown molecular structures. Such reactions were also considered ineligible for Δ_{r}*G*′^{°} estimation because can be calculated only for complete and balanced reactions. Once the ineligible reactions and compounds were removed from consideration, any remaining compounds and reactions for which and could not be calculated were entirely due to the presence of molecular substructures for which the value was unknown.

Both the group contribution method of Mavrovounoitis and the new group contribution method are capable of estimating Δ_{f}*G*′^{°} and Δ_{r}*G*′^{°} for nearly all the compounds and reactions involved in the *i*JR904 and *i*AF1260 models. The coverage of the new group contribution method is only slightly better for these genome-scale models. However, the new group contribution method performs significantly better than the Mavrovouniotis method in estimating Δ_{f}*G*′^{°} and Δ_{r}*G*′^{°} of the UM-BBD compounds and reactions. This is primarily due to the addition of values for the halogen substructures, which are prevalent in the biodegradation chemistry. The new group contribution method also performs significantly better in estimating Δ_{f}*G*′^{°} and Δ_{r}*G*′^{°} of the KEGG compounds and reactions. All 20 additional substructures that have been included in the new group contribution method contribute evenly to this improvement in the coverage of the KEGG. All the Δ_{f}*G*′^{°} and Δ_{r}*G*′^{°} values estimated for the compounds and reactions in the KEGG using the new group contribution method have been provided in Data S2. Note that in all four databases, the coverage of the reactions by the group contribution method is better than the coverage of the compounds. This is because structural groups with unknown values cancel out of many reactions, as they are not created or destroyed in most reactions. Overall, the new group contribution method is demonstrated to be capable of estimating Δ_{f}*G*′^{°} and Δ_{r}*G*′^{°} for a wide range of biochemical compounds and reactions.

### Prevalent substructures with unknown Δ_{gr}*G*′^{°} values

Clearly, molecular substructures still exist in these databases for which the value is unknown. Many of these structures are present in organic-inorganic complexes involving iron, nickel, or cobalt for which the new group contribution method has not been designed. However, a small number of prevalent organic substructures with unknown values appear in many of the metabolites and reactions for which and could not be calculated (Table 5). These substructures represent important targets for future experiments involving the measurement of the thermodynamic properties of biochemical reactions. As experimental data for reactions involving these substructures does emerge, new structural groups and interaction factors can be developed and added to this group contribution method. When determining values for these additions to the model, it is recommended that the new data be appended to the training set used to fit the entire group contribution model and that all values be refit in the model rather than solely the values for the new groups. This will result in better accuracy and reveal the effect of the addition of the new data and groups on the method. To facilitate this kind of expansion and improvement of this group contribution method, details of the training set used in this method have been provided in Data S2. Molfiles created for the molecular structures of every compound involved in the training set in the correct ionic form at pH 7 are also available in Data S3.

## DISCUSSION

The group contribution method introduced here has numerous advantages over previous methods including i), the ability to calculate for a greater variety of compounds and reactions; ii), improved accuracy in the values produced using the method; iii), improved estimation for the uncertainty in the values produced; and iv), complete disclosure of the training set used to fit the values to facilitate the expansion of the method with additional data, interaction factors, and structural groups. The application of this group contribution method toward the estimation of and for the compounds and reactions in the *i*JR904 model, the *i*AF1260 model, the UM-BBD, and the KEGG confirms the ability of the method to predict and for a significant portion of the known biochemistry. The and estimations generated for the KEGG and provided in Data S2 represent the most complete and most accurate set of thermodynamic data compiled for the KEGG to date, to our knowledge, and the addition of halogens to the methodology allows the application of this method to new types of chemistry beyond genome-scale metabolic models in areas such as bioremediation (24).

All the changes introduced in this new group contribution method have not only expanded the applicability of the method to calculate for a wider range of compounds and reactions but also improved the accuracy of the method. For the compounds and reactions in the training set for which can be calculated using the Mavrovouniotis group contribution method, the standard deviation of the residuals is 3.92 kcal/mol, compared to a standard deviation of 1.98 kcal/mol when the new group contribution method is used to calculate for the same reactions and compounds.

The quantification of the uncertainty in each value in the method allows improved resolution in uncertainty estimates for all produced using the method. This enhanced resolution in the uncertainty in is essential to any genome-scale analysis of metabolic pathways and metabolomic studies involving thermodynamics such as thermodynamics-based metabolic flux analysis (2). As a result of the cross-validation analysis, it is recommended that the uncertainty used for values calculated with this method be four times the calculated from the *SE*_{gr} values using Eq. 10, as this uncertainty provides an 83% confidence interval for

A web interface has been developed to allow the automated estimation of the *Δ*_{f}*G*′^{°} values for a set of compounds based on the molecular structures of the compounds using the new group contribution method. This interface is available free at the following web address: http://sparta.chem-eng.northwestern.edu/cgi-bin/GCM/WebGCM.cgi.

## SUPPLEMENTARY MATERIAL

To view all of the supplemental files associated with this article, visit www.biophysj.org.

## Acknowledgments

Many thanks to Stacey Pace for her assistance with the UM-BBD database analysis.

This work was supported by the U.S. Department of Energy Genomes to Life program, the DuPont Young Professor's grant, and a National Science Foundation Integrative Graduate Education and Research Traineeship Complex Systems fellowship.

## Notes

Matthew D. Jankowski and Christopher S. Henry contributed equally to this work.

Editor: Costas D. Maranas.

## References

**The Biophysical Society**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (607K)

- Thermodynamics-based metabolic flux analysis.[Biophys J. 2007]
*Henry CS, Broadbelt LJ, Hatzimanikatis V.**Biophys J. 2007 Mar 1; 92(5):1792-805. Epub 2006 Dec 15.* - Thermodynamic analysis of biodegradation pathways.[Biotechnol Bioeng. 2009]
*Finley SD, Broadbelt LJ, Hatzimanikatis V.**Biotechnol Bioeng. 2009 Jun 15; 103(3):532-41.* - Chemical and biochemical thermodynamics: from ATP hydrolysis to a general reassessment.[J Phys Chem B. 2010]
*Iotti S, Sabatini A, Vacca A.**J Phys Chem B. 2010 Feb 11; 114(5):1985-93.* - The application of flux balance analysis in systems biology.[Wiley Interdiscip Rev Syst Biol Med. 2010]
*Gianchandani EP, Chavali AK, Papin JA.**Wiley Interdiscip Rev Syst Biol Med. 2010 May-Jun; 2(3):372-82.* - Understanding human metabolic physiology: a genome-to-systems approach.[Trends Biotechnol. 2009]
*Mo ML, Palsson BØ.**Trends Biotechnol. 2009 Jan; 27(1):37-44. Epub 2008 Nov 17.*

- Metabolic modelling in the development of cell factories by synthetic biology[Computational and Structural Biotechnology ...]
*Jouhten P.**Computational and Structural Biotechnology Journal. 3e201210009* - Mathematical models of cell factories: moving towards the core of industrial biotechnology[Microbial Biotechnology. 2011]
*Cvijovic M, Bordel S, Nielsen J.**Microbial Biotechnology. 2011 Sep; 4(5)572-584* - Steady-State Metabolite Concentrations Reflect a Balance between Maximizing Enzyme Efficiency and Minimizing Total Metabolite Load[PLoS ONE. ]
*Tepper N, Noor E, Amador-Noguez D, Haraldsdóttir HS, Milo R, Rabinowitz J, Liebermeister W, Shlomi T.**PLoS ONE. 8(9)e75370* - Biodegradation-inspired bioproduction of methylacetoin and 2-methyl-2,3-butanediol[Scientific Reports. ]
*Jiang X, Zhang H, Yang J, Zheng Y, Feng D, Liu W, Xu X, Cao Y, Zou H, Zhang R, Cheng T, Jiao F, Xian M.**Scientific Reports. 32445* - Metingear: a development environment for annotating genome-scale metabolic models[Bioinformatics. 2013]
*May JW, James AG, Steinbeck C.**Bioinformatics. 2013 Sep 1; 29(17)2213-2215*

- MedGenMedGenRelated information in MedGen
- PubMedPubMedPubMed citations for these articles
- TaxonomyTaxonomyRelated taxonomy entry
- Taxonomy TreeTaxonomy Tree

- Group Contribution Method for Thermodynamic Analysis of Complex Metabolic Networ...Group Contribution Method for Thermodynamic Analysis of Complex Metabolic NetworksBiophysical Journal. Aug 1, 2008; 95(3)1487PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...