# Identifying essential genes in *Escherichia coli* from a metabolic optimization principle

^{a}Andrea De Martino,

^{b}Enzo Marinari,

^{c,}

^{1}Matteo Marsili,

^{d}and Isaac Pérez Castillo

^{e}

^{a}Dipartimento di Fisica,

^{b}Consiglio Nazionale delle Ricerche (Istituto per i Sistemi Complessi)/Istituto Nazionale per la Fisica della Materia (Centre for Statistical Mechanics and Complexity) and Dipartimento di Fisica, and

^{c}Dipartimento di Fisica, Consiglio Nazionale delle Ricerche/Istituto Nazionale per la Fisica della Materia and Istituto Nazionale di Fisica Nucleare, Università di Roma “La Sapienza”, Piazzale A. Moro 2, 00185 Rome, Italy;

^{d}The Abdus Salam International Center for Theoretical Physics, Strada Costiera 11, 34014 Trieste, Italy; and

^{e}Department of Mathematics, King's College London, Strand, London WC2R 2LS, United Kingdom

^{1}To whom correspondence should be addressed. E-mail: ti.nfni.1amor@iraniram.ozne

Author contributions: C.M., A.D.M., E.M., M.M., and I.P.C. performed research.

## Abstract

Understanding the organization of reaction fluxes in cellular metabolism from the stoichiometry and the topology of the underlying biochemical network is a central issue in systems biology. In this task, it is important to devise reasonable approximation schemes that rely on the stoichiometric data only, because full-scale kinetic approaches are computationally affordable only for small networks (e.g., red blood cells, ≈50 reactions). Methods commonly used are based on finding the stationary flux configurations that satisfy mass-balance conditions for metabolites, often coupling them to local optimization rules (e.g., maximization of biomass production) to reduce the size of the solution space to a single point. Such methods have been widely applied and have proven able to reproduce experimental findings for relatively simple organisms in specific conditions. Here, we define and study a constraint-based model of cellular metabolism where neither mass balance nor flux stationarity are postulated and where the relevant flux configurations optimize the global growth of the system. In the case of *Escherichia coli*, steady flux states are recovered as solutions, although mass-balance conditions are violated for some metabolites, implying a nonzero net production of the latter. Such solutions furthermore turn out to provide the correct statistics of fluxes for the bacterium *E. coli* in different environments and compare well with the available experimental evidence on individual fluxes. Conserved metabolic pools play a key role in determining growth rate and flux variability. Finally, we are able to connect phenomenological gene essentiality with “frozen” fluxes (i.e., fluxes with smaller allowed variability) in *E. coli* metabolism.

**Keywords:**fluxomics, growth, stoichiometry, conserved moieties, gene essentiality

Cellular metabolism involves a complex network of interactions and cross-regulations among metabolites, proteins, and genes. Although our knowledge of regulatory functions at the genetic level and at the level of protein–protein interactions is still in its infancy, the biochemical network of reactions that describes metabolism has been characterized in detail for many organisms (1–3). Most information about the metabolic network is contained in the matrices **B** = {*b*_{i}^{μ}} and **A** = {*a*_{i}^{μ}} describing, respectively, the input and output stoichiometric coefficients of each metabolite μ (ranging from 1 to *P*) for all of the metabolic reactions *i* (1 to *N*). Their knowledge allows for the definition of constraint-based models from which a prediction of metabolic fluxes is possible (4–6). These models typically rely on a steady-state assumption that reflects the time scale separation between the (faster) equilibration of metabolic processes and the (slower) dynamics of genetic regulation. Under this condition, the concentration *X*^{μ} of metabolite μ and the flux ν_{i} ≥ 0 of reaction *i* are constant in time and globally linked by a set of mass-balance conditions:

or, in matrix notation, (**A** − **B**)**ν** = **0**, where **ν** = {ν_{1},…,ν_{N}} is a flux vector. The problem is that of finding a flux vector satisfying the set of *P* linear equations (1). For real metabolic networks, the above system is usually underdetermined because *N* > *P* [e.g., *Escherichia coli* has ≈1,100 reactions and <800 metabolites (1)], so that multiple solutions exist, and one has to specify which of these are the working, physical states of the network.

In organisms with high functional specificity or whose main objective is to produce certain specified metabolites, like, e.g., red blood cells, physical states are taken to be all those consistent with the mass-balance conditions. In such cases, it is important to be able to explore the solution space of Eq. **1** uniformly. Uniform sampling can be achieved in small networks with Monte Carlo methods (7), and, more recently, message-passing algorithms have been used (8, 9). However, computational considerations still prevent the application of such techniques to explore more general aspects of larger metabolic networks.

For more complex organisms, one normally complements Eq. **1** with the assumption that the physical state of the network obeys a specific optimization principle. In flux-balance analysis (FBA) (5), the choice usually falls on the maximization of biomass production, a useful proxy of the growth capabilities of an organism. FBA has been widely applied to different organisms (mostly bacteria) to investigate general aspects of metabolism, like flux distributions in different environments (10), evolutionary dispensability of enzymes (11), or the plasticity of the reaction network (12). Minimization of metabolic adjustment (MOMA) (6) is instead able to predict the reorganization of fluxes after a reaction knockout by minimizing the distance between the FBA solution and that obtained from Eq. **1** after the knockout.

Here, we consider a constraint-based model of metabolic activity with the aim of characterizing flux states corresponding to optimal net metabolic production. We neither assume constancy-in-time of fluxes nor do we impose mass-balance constraints on metabolites. Rather, we allow for these to be recovered as properties of the solutions. This assumption turns out to be able to reproduce the empirical statistics of fluxes for the bacterium *E. coli* in different environments with a remarkable accuracy, and the physical and biochemical origin of the robustness of the emerging picture can be completely understood. Further biological insight can be gained by comparing the variability of individual fluxes with data on phenomenological gene essentiality.

## Model Definitions

The abstract setup we consider was originally introduced by J. Von Neumann (13) to model growth in production economies as an autocatalytic process. We consider a system of *N* chemical reactions between *P* reagents, with fluxes and concentrations evolving in discrete time *t* = 0,1,… (we leave the time scale unspecified). For now, the system is assumed to be purely autocatalytic, so that the total input of any given metabolite at a certain time step must come from (all of or part of) the output at the previous time step. Let *S*_{i}(*t*) denote the scale at which reaction *i* operates at time *t*, so that the total input and output of metabolite μ are given by *I*^{μ}(*t*) = Σ_{i}*S*_{i}(*t*)*b*_{i}^{μ} and *O*^{μ}(*t*) = Σ_{i}*S*_{i}(*t*)*a*_{i}^{μ}, respectively. Stability requires that *C*^{μ}(*t*) := *O*^{μ}(*t*) − *I*^{μ}(*t* + 1) ≥ 0 at all times, otherwise the system would need an outside metabolic source to survive. We focus our attention on the feasibility of dynamical paths with constant growth rate ρ > 0, where *I*^{μ}(*t* + 1) = ρ*I*^{μ}(*t*). It is easy to see that, in such paths, reaction rates must behave as *S*_{i}(*t*) = *s*_{i}ρ^{t}, with constants *s*_{i} ≥ 0 satisfying the *P* linear constraints

The main assumption now is that the network's physical state s* = {*s*_{i}^{*}} corresponds to one of those for which the growth rate ρ attains its maximum possible value ρ* compatible with the constraints (Eq. **2**). The rationale for this choice lies in the fact that, under general conditions, it can be proven that paths with the optimal use of resources coincide, apart from an initial transient, with those of maximal expansion (14). Note that the *s*_{i}s are essentially the discrete-time version of the continuous time fluxes ν_{i} of Eq. **1**.

It is intuitive that the solution space of Eq. **2** shrinks as ρ is increased starting from ρ = 0, where any flux configuration is a viable state. Quite importantly, then, for a fixed set of input–output relations, the solutions of Eq. **2** are bound to coincide with those of Eq. **1** if ρ* = 1 and *c*^{μ} = 0 ∀μ.

The typical behavior of Von Neumann's model can be fully appreciated analytically in the case of random graphs. Depending on their size and topology, autocatalytic networks with random stoichiometry give rise to very different optimal states. As in other constrained optimization problems (15), the key control parameter is the ratio *N*/*P* = *n* of variables-to-constraints. In random topologies (16, 17), as *n* increases, the system crosses over from a contracting phase with ρ* < 1 to an expanding one with ρ* > 1, passing through a stationary regime with ρ* = 1 in which reaction fluxes are constant in time as in FBA (although the values at which fluxes settle may be different).

## Application to *E. coli*

A natural reaction network can be modeled as an autocatalytic system when the uptake reactions that provide resources and account for exchanges of metabolites with the surrounding environment are included. We have applied the Von Neumann scheme to the cellular metabolism of the bacterium *E. coli*, as reconstructed in ref. 1. To set the stage, 3 different operations have to be performed.

### Environment Selection.

To fix the environment where the cell lives, we have defined a set of external metabolites on which we applied uptake fluxes. We have considered 3 distinct environmental conditions: (*i*) isolated cell, without uptakes; (*ii*) minimal environment, with uptakes on a restricted set of metabolites (10), namely CO_{2}, K, NH_{4}, PI, O_{2}, SO_{4}, and one of glu-L, succ, or glc; (*iii*) rich environment, with uptakes on all external metabolites.

### “Leaf Removal.”

Once the network is built, one has to remove from the internal metabolites those that are never produced, because the corresponding constraints are satisfied only by taking ρ = 0 or by applying a null flux to the corresponding reactions (as is actually done in FBA).

### Accounting for Reversibility.

We have disposed of reversible reactions by introducing 2 separate fluxes and taking the absolute difference as the positive net flux.

The size of the resulting network, i.e., *N* and *P*, is ultimately different for different environments.

We calculate the maximum growth rate ρ* numerically via a generalized MinOver algorithm (18), detailed in ref. 17, in which solutions are found at fixed ρ, and then ρ is gradually increased. It has been shown rigorously (17) that this algorithm converges to a solution at a fixed ρ if at least 1 solution exists. Moreover, when multiple solutions occur (in which case they form a connected set), the algorithm provides a uniform sampling of the solution space [see supporting information (SI) for a test of this property at low dimensions]. Anticipating that indeed many flux vectors satisfy Eq. **2** at ρ* for *E. coli*, the latter is a particularly important advantage because a uniform sampling is required to characterize the solution space.

## Characterization of the Solutions

For the metabolic network of *E. coli*, we have found ρ* = 0.999 ± 0.001 independently of the environmental conditions we have tested, so that the state of optimal growth is compatible with one with constant fluxes. The distribution of reaction fluxes at ρ* (see Fig. 1) displays a regime (<2 decades) with a scaling form *P*(*s*) ≈ *s*^{−}^{γ} with exponent close to −1 [in reasonable agreement with the experimental evidence based on a limited set of measured fluxes (19, 10)], followed by a cutoff. Note that the flux histogram is remarkably stable over different solutions. This scenario has been partially reproduced by FBA (10), but the solution obtained optimizing the biomass production systematically overestimates γ. To compare individual fluxes with experiments, we aimed at studying the solution obtained in conditions similar to those described in ref. 19, focusing our attention on 17 fluxes from the central metabolism, as in ref. 6. Unfortunately, we are unable to reproduce in detail the M9 medium used in ref. 19, and can fix the uptakes of only 3 metabolites, namely glucose, CO_{2}, and O_{2} identically to ref. 19. For the remaining part of the environment, we chose to simulate a minimal medium and fixed the 4 remaining external uptakes at arbitrary values (we did not observe a strong dependency of the results on these parameters). Results are shown in Fig. 2. We stress that the medium we consider is not strictly identical to that used in experiments. It is worth noting (see also ref. 20) that experiments employ ^{13}C-labeled carbon sources as substrates for a growing bacterial culture kept at constant density, and that the relative abundance of metabolites can be captured by nuclear magnetic resonance or mass spectroscopy. Reaction fluxes of different metabolic pathways are then inferred, assuming a model of the reaction network.

*E. coli*in a minimal environment with 3 different carbon sources: glutamate (glu), succinate (suc), and glucose (glc). Here,

*N*= 1,053 and

*P*= 630 (glu);

*N*= 1,054 and

*P*= 630 (suc);

*N*= 1,057 and

*P*= 631 (glc). (

**...**

With stationarity recovered at ρ* = 1, as discussed above, the difference between the Von Neumann solution and that of FBA arises from the fact that not all *c*^{μ}s attain their lowest allowed value. This is clearly seen in Fig. 3. In each solution, some metabolites exist with a nonzero *c*^{μ}, implying that in the steady state, a net production of such metabolites occurs. This may also signal an incompleteness of the stoichiometric data.

*c*

^{μ}for different metabolites (horizontal axis) in 500 different solutions. The color code is reported on the right. The white background corresponds to

*c*

^{μ}= 0. Colored marks indicate a net production of the corresponding metabolite

**...**

To get further insight on the existence of multiple flux states compatible with Eq. **2** at ρ* and on the shape of the solution space, a rough but effective way consists in monitoring the mean overlap between different solutions. Given 2 solutions α and β at fixed ρ, we define their overlap *q*_{α}β as:

This quantity equals 1 when **s**_{α} and **s**_{β} coincide and becomes smaller and smaller as the distance between **s**_{α} and **s**_{β} in the space of flux vectors increases. In computing it, one should consider a flux to be zero whenever its value is below a threshold ε, to take into account the fact that there is a loss of information about relative fluctuations between different solutions in fluxes smaller than ε. We have chosen ε = 10^{−5} to ensure that overlaps are not overestimated. Results obtained with ε = 10^{−5}, 10^{−6}, and 10^{−7} are identical. We have furthermore taken into account the fact that the overlap between null fluxes must be defined by consistency to be equal to 1. In Fig. 4, we report the behavior of the mean overlap 〈*q*_{αβ}〉 (the angular brackets representing an average over many pairs of solutions) for *E. coli* and for a network with the same topology and *N*/*P* but random Gaussian stoichiometry. Random networks provide an important benchmark against which the behavior of real networks can be tested, to highlight the extent to which observations are specific of the particular organism that is being studied. In particular, the role of topology and stoichiometry can be fully analyzed. For instance, in ref. 17, it is shown that the power-law behavior of the flux distribution cannot be ascribed to the specific network topology.

*E. coli*and of its random counterpart. (

*A*and

*B*) Mean overlap among 500 different solutions as function of ρ in

*E. coli*and in a random metabolic network; the last point on the abscissa corresponds

**...**

One sees that for the artificial metabolic network, 〈*q*_{αβ}〉 is an increasing function of ρ that approaches 1 as ρ→ ρ* (Fig. 4*B*). Moreover, the overlap histogram has a marked δ-peak at *q* = 1 whose mass increases as ρ → ρ*. These results indicate that the optimal solution is unique. On the other hand, the behavior of 〈*q*_{αβ}〉 for *E. coli* suggests that the volume of solutions stops contracting when ρ reaches ≈0.8 from below. In particular, at ρ*, multiple solutions survive. From the corresponding histogram, Fig. 4*C*, we see that only ≈30% of reactions have an overlap close to 1. In a different jargon, one may say that ≈30% of the variables are frozen (i.e., assume the same value on all solutions of the constrained optimization problem), whereas the remaining are free. The existence of frozen variables characterizes many random constraint satisfaction problems (21), but it is normally hard to identify topological motifs where variables are more likely to be frozen. In the present case, it is reasonable to expect that for purely structural reasons, reaction chains are entirely frozen when the first reaction of the chain is frozen. This is indeed confirmed by the map of frozen/free fluxes in *E. coli*'s central metabolism, displayed in Fig. 5. The chain-like part of glycolysis appears indeed to be frozen.

*E. coli*'s central metabolism: Nodes represent metabolites; an arrow joining 2 nodes is present when a reaction exists converting one into the other. Red (green) links denote frozen (free) reactions, with overlap larger (smaller) than 0.9.

The existence of frozen fluxes raises the obvious question of their biological significance. To address this point, we have correlated reaction overlaps *q*_{αβ}^{(i)} with the essentiality of the corresponding genes according to the notion of universal essentiality used in ref. 22, which combines phenomenological relevance (a gene is essential if knocking it out causes the cell to die) with evolutionary retention (the presence of the gene in different species). In ref. 22, 55 essential genes of *E. coli* involved in metabolism have been identified that are also present in 80% of 32 different bacterial genomes. We have been able to link 52 of such genes to reactions in the reconstructed network. It turns out, see Fig. 6, that 43 of such genes correspond to reactions with overlap >0.8 and that only 7 genes relate to reactions with an overlap significantly <80%. Whether a gene is “essential” or not (according to the definition of ref. 22) depends on the choice of the environment. In particular, ref. 22 considers a rich medium as the environment. For comparison, we have considered theoretical values of the overlaps for fluxes calculated assuming a rich environment. Note that, in principle, the same gene can be linked to a more or less variable flux in different environments. However, the results presented are qualitatively preserved in the minimal environment with different carbon sources. This suggests that frozen fluxes, which in Von Neumann's framework are allowed a very limited variability if a state of optimal growth is to be kept, may carry an evolutionary significance.

## Role of Conserved Moieties

To trace back the physical origin of the results presented above, we have studied the rank of the matrix **A** − ρ**B** associated to the *P* linear constraints (Eq. **2**) as a function of the parameter ρ, see Fig. 7.

The singularity occurring at ρ = 1 is related to the presence of conserved pools of metabolites, groups of reagents whose total concentration is constant in time (23). Their existence is due to the fact that the concentrations of metabolites of a certain pool are always coupled with a common functional group. For example the 3 metabolites ATP, AMP, and ADP belong to a pool of cofactors that preserve the adenylate moiety (24). This implies that a change in the concentration of a given metabolite cannot be accomplished without considering the entire pool to which that metabolite belongs.

A conserved pool *g* is formally defined by a *P*-dimensional Boolean vector of elements *z*_{g}^{μ} such that *z*_{g}^{μ} = 1 if μ ∈ *g* and *z*_{g}^{μ} = 0 otherwise satisfying

Such conservation laws manifest themselves precisely at ρ = 1, By virtue of Eq. **2**, the relation Σ_{μ} *z*_{g}^{μ}*c*^{μ} ≥ 0 must hold for any *g*. It follows that either ρ* ≤ 1, or all fluxes connected to a metabolite belonging to a conserved pool must be equal to zero. Because the null solution **s** = **0** must be discarded on obvious physical grounds, if all fluxes are connected to metabolites in a conserved pool then necessarily ρ* ≤ 1. This is consistent with the results obtained in different environmental conditions and implies that this scenario must be stable against small perturbations of the network topology.

Moreover, from Eqs. **4** and **2**, it is easy to see that, for ρ = 1, μ ∈ *g* implies *c*^{μ} = 0, i.e., for a metabolite belonging to a conserved pool the mass-balance condition must be strictly valid. We have shown instead (Fig. 3) that the values of the constraints are not always zero: For some metabolites, the mass balance condition is not reached, and we can assert that they do not belong to any conserved pool (within the stoichiometric description used).

We conclude that in the state of optimal growth with ρ* = 1, in addition to the stationarity of reaction rates, a spontaneous condition of mass balance holds for most, not all, metabolites.

In summary, the Von Neumann model relies on the assumption that the arrangement of metabolic fluxes follows a principle of growth maximization. It does not imply a priori either the mass balance or the stationarity of fluxes, but these 2 conditions are essentially recovered at the maximum growth rate. This is due to the presence of conserved metabolic pools. The singularity in the rank of **A** − ρ**B** at ρ = 1 implies that the physically relevant solutions are those to which the system tends as ρ → 1. In this limit, mass balance is recovered (i.e., *c*^{μ} = 0) for most, but not all, metabolites. This provides a nontrivial correction to the picture extracted by FBA and reproduces the limited experimental evidence that is available.

## Acknowledgments.

We gratefully acknowledge important discussions with G. Bianconi, R. Monasson, A. Pagnani, and F. Ricci Tersenghi and a number of useful suggestions. M.M. was supported by Information Society Technologies STREP GENNETEC contract no. 034952.

## Footnotes

The authors declare no conflict of interest.

This article contains supporting information online at www.pnas.org/cgi/content/full/0813229106/DCSupplemental.

## References

*Escherichia coli*K-12 (iJR904 GSM/GPR) Genome Biol. 2003;4:R54. [PMC free article] [PubMed]

*Saccharomyces cerevisiae*iND750, a fully compartmentalized genome-scale metabolic model. Genome Res. 2004;14:1298–1309. [PMC free article] [PubMed]

*Escherichia coli*metabolic capabilities are consistent with experimental data. Nat Biotech. 2001;19:125–130. [PubMed]

*Escherichia coli*. Nature. 2004;427:839–843. [PubMed]

*Escherichia coli*. J Bacteriol. 2002;184:152–164. [PMC free article] [PubMed]

^{13}C-based flux analysis. Mol Syst Biol. 2006;2:62. [PMC free article] [PubMed]

*Escherichia coli*MG1655. J Bacteriol. 2003;185:5673–5684. [PMC free article] [PubMed]

**National Academy of Sciences**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (799K) |
- Citation

- Estimating the size of the solution space of metabolic networks.[BMC Bioinformatics. 2008]
*Braunstein A, Mulet R, Pagnani A.**BMC Bioinformatics. 2008 May 19; 9:240. Epub 2008 May 19.* - Including metabolite concentrations into flux balance analysis: thermodynamic realizability as a constraint on flux distributions in metabolic networks.[BMC Syst Biol. 2007]
*Hoppe A, Hoffmann S, Holzhütter HG.**BMC Syst Biol. 2007 Jun 1; 1:23. Epub 2007 Jun 1.* - A hybrid model of anaerobic E. coli GJT001: combination of elementary flux modes and cybernetic variables.[Biotechnol Prog. 2008]
*Kim JI, Varner JD, Ramkrishna D.**Biotechnol Prog. 2008 Sep-Oct; 24(5):993-1006.* - Flux analysis and control of the central metabolic pathways in Escherichia coli.[FEMS Microbiol Rev. 1996]
*Holms H.**FEMS Microbiol Rev. 1996 Dec; 19(2):85-116.* - Reconstruction and flux analysis of coupling between metabolic pathways of astrocytes and neurons: application to cerebral hypoxia.[Theor Biol Med Model. 2007]
*Cakir T, Alsan S, Saybaşili H, Akin A, Ulgen KO.**Theor Biol Med Model. 2007 Dec 10; 4:48. Epub 2007 Dec 10.*

- Identifying essential genes in Escherichia coli from a metabolic optimization pr...Identifying essential genes in Escherichia coli from a metabolic optimization principleProceedings of the National Academy of Sciences of the United States of America. 2009 Feb 24; 106(8)2607

Your browsing activity is empty.

Activity recording is turned off.

See more...