- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

# Reverse engineering gene networks: Integrating genetic perturbations with dynamical modeling

^{*}

^{†}

^{‡}

^{§}M. K. Stephen Yeung,

^{*}Jeff Hasty,

^{*}

^{¶}and James J. Collins

^{*}

^{*}Center for BioDynamics and Department of Biomedical Engineering, Boston University, Boston, MA 02215;

^{†}Division of Computational Biology, Department of Physics, Linköping University, S-581 83 Linköping, Sweden;

^{‡}Stockholm Bioinformatic Center, Stockholm Center for Physics, Astronomy, and Biotechnology, S-106 91 Stockholm, Sweden; and

^{¶}Department of Bioengineering, University of California at San Diego, La Jolla, CA 92093-0412

^{§}To whom correspondence should be addressed. E-mail: es.uil.mfi@trepsej.

## Abstract

While the fundamental building blocks of biology are being tabulated by the various genome projects, microarray technology is setting the stage for the task of deducing the connectivity of large-scale gene networks. We show how the perturbation of carefully chosen genes in a microarray experiment can be used in conjunction with a reverse engineering algorithm to reveal the architecture of an underlying gene regulatory network. Our iterative scheme identifies the network topology by analyzing the steady-state changes in gene expression resulting from the systematic perturbation of a particular node in the network. We highlight the validity of our reverse engineering approach through the successful deduction of the topology of a linear *in numero* gene network and a recently reported model for the segmentation polarity network in *Drosophila melanogaster*. Our method may prove useful in identifying and validating specific drug targets and in deconvolving the effects of chemical compounds.

The genome projects are rapidly generating extensive lists of the genes and proteins that govern cellular behavior, and the analysis of these lists is providing a wealth of clinically relevant information. Simultaneously, there has been impressive progress made toward the description of the regulatory mechanisms in many cellular systems (1). Transcriptional regulation, used by cells to control gene expression (2, 3), occurs when a regulatory protein increases or decreases the transcription rate through biochemical reactions that enhance or block polymerase binding at the promoter region. Because many genes code for regulatory proteins that can activate or repress other genes, the emerging picture is that of a complex web, or circuit, of interacting genes and proteins. The elucidation of how subcellular processes at the genetic level are manifest in macroscopic phenomena at the phenotypic level will be a major goal of postgenomic research.

Many cellular processes are described at the genetic level by diagrams that resemble complex electrical circuits (4), and there has been recent interest in two broad avenues of research relating to such genomic circuitry. At one end of the spectrum is the task of quantifying the fundamental laws of gene regulation. Within the context of the electrical circuit analogy, this question involves the deduction of a set of mesoscopic equations that faithfully quantify the information contained in the genetic circuit. A natural plan of attack is to use a *forward engineering* approach, whereby relatively simple circuits are designed and tested with respect to a set of equations generated from the underlying biochemistry. Recent work in this area has entailed the successful coupling of dynamical systems analysis with the construction of relatively simple genetic circuits, such as autoregulatory single-gene networks (ref. 5; F. Isaacs, J.H., C. R. Cantor, and J.J.C., unpublished work), genetic toggle switches (6), and genetic oscillators (7).

At the other end of the spectrum is the project of deducing the connectivity of the genes in a naturally occurring large-scale network. This work is being driven by recent technological advances that permit the simultaneous measurement of expression levels from thousands of genes. Such microarray technology, which rapidly produces vast catalogs of patterns of gene activity, highlights the need for systematic tools to identify the architecture and dynamics of the underlying gene networks. Here, the system identification problem (8) falls naturally into the category of *reverse engineering*; a complex genetic network underlies a massive set of expression data, and the task is to infer the connectivity of the genetic circuit.

The reverse engineering approach requires large data sets and extensive computational resources. There are typically an enormous number of network architectures that are compatible with a given set of expression data, and such a mapping problem initially makes the task of deducing a particular network seem daunting. Several studies have therefore targeted small networks by using genetic algorithms, nonlinear models, time-series analysis, and Bayesian models (9–15), but it is not clear whether these techniques scale for large networks (>100 genes). Techniques to analyze large data sets from whole-genome networks include cluster analysis and the systematic search for characteristic patterns of gene expression associated with some pathological state of interest (16–20) and typically provide only indirect information about network structure.

As mentioned above, several novel small-scale designer gene networks have been constructed and studied within the context of mathematical modeling (refs. 5–7 and 21; F. Isaacs, J.H., C. R. Cantor, and J.J.C., unpublished work). In the present work, we explore the utilization of such designer gene networks in the reverse engineering of large-scale networks. These small designer networks can be inserted into cells and used to provide a controlled perturbation mechanism for gene expression experiments. Our rationale is that the resulting changes in mRNA levels provide indirect information about the network topology. Our reverse engineering scheme is designed to provide experimentalists with a robust recipe for deducing network topology through the analysis of data generated from a series of rationally constructed, designer-perturbed microarray experiments.

## Methods

Here we address how to construct a dynamical model that captures the structure of a gene network, and how to design a reverse engineering scheme that is robust to noise, using only steady-state changes in gene expression, while using realistic *a priori* statistical constraints on the nature of gene networks. Because reliable large-scale measurement of protein and metabolite concentrations is not yet feasible, we focus on the mRNA dynamics.

Although our motivation stems from measurements of individual mRNA concentrations from microarray experiments, here we will demonstrate our reverse engineering procedure through the utilization of *in numero* models in generating simulated microarray data. We will designate these models as data-generating models, and we will demonstrate the validity of our procedure by perturbing these models and then using the generated data to deduce the underlying connectivity of the models.

We can project the dynamics of the network onto a general linear mapping model because we deliver the perturbations around a steady state. We consider a network of *N* genes, with typical time scales τ_{1}, τ_{2}, . . . , τ_{N}. We denote the mRNA expression levels of the genes by *x*_{1}, *x*_{2}, . . . , *x _{N}*. In the absence of any interaction, we let the

*i*th mRNA species degrade at some rate γ

_{i}. However, an mRNA, say the

*j*th, may indirectly affect the dynamics of another mRNA, say the

*i*th, through intermediates such as proteins and metabolites, and thus change its transcription rate. We represent this by an effective gene-to-gene coupling coefficient

*w*. We perturb the genes by using a ramp function

_{ij}*P*(

_{i}*cf.*Fig. Fig.11

*a*and

*b*). The linearized mapping model around

*x*

_{1}=

*a*

_{1}, . . . ,

*x*=

_{N}*a*, is

_{N} for *i* = 1, . . . , *N*, with 2*N* + *N*^{2} unknown parameters (*N* γ's, *N* τ's, and *N*^{2} *w*'s).

*a*) Schematic three-gene network. Arrows and filled circles indicate activation and repression, respectively, of magnitude

*w*

_{ij}from gene

*x*

_{j}to

*x*

_{i}. A genetic toggle perturbs

*x*

_{1}by using a ramp function. (

*b*) The effect on the gene dynamics (solid lines)

**...**

The general nature of the experiments and analysis we perform in the present study is illustrated with the three-gene network in Fig. Fig.1.1. Using a genetic toggle switch (Fig. (Fig.11*a*; ref. 6), we can selectively perturb the activity of a given gene (here chosen to be *x*_{1}). The time course of the gene expression before, during, and after the sustained perturbation is monitored (Fig. (Fig.11*b*). A transient stimulus switches the toggle from the lower to the upper state, and the toggle is then left in its upper state (dashed line in Fig. Fig.11*b*). The activities of the other two genes (*x*_{2}, *x*_{3}) change because of their interactions with *x*_{1}. Thus, measuring the gene expression levels before and after the perturbation gives us information on the network structure.

Because the reverse engineering method has to be robust against transient fluctuations caused by either intrinsic noise or experimental variability in microarrays, we measure only the average steady-state values of gene expression (thick lines in Fig. Fig.11*b*). Thus Eq. 1 becomes 0 = *w*′_{i1}(*x*_{1} − *a*_{1}) + . . . + *w*′_{i,i−1}(*x*_{i−1} − *a*_{i−1}) − (*x _{i}* −

*a*) +

_{i}*w*′

_{i,i+1}(

*x*

_{i+1}−

*a*

_{i+1}) + . . . +

*w*′

_{iN}(

*x*−

_{N}*a*), where we absorb γ

_{N}_{i}into

*w*and rescale the coupling parameters, viz,

_{ii}*w*′

_{ij}=

*w*/(

_{ij}*w*− γ

_{ii}_{i}), leaving only

*N*

^{2}parameters. The reverse engineering problem is to infer all of the unknown parameters

*w*′

_{ij}, constituting the matrix

*W*, from the induced changes in gene expression Δ

_{i}=

*x*−

_{i}*a*. The inference problem for large, dense networks is computationally intractable if we have to search through all possibilities. This is because the number of possible solutions consistent with the data are prohibitively large. Data from cellular networks, including protein–protein interactions (22) and metabolic networks (23), suggest a sparse topology because the maximal number of inputs (

_{i}*k*

_{max}) to a unit is

*k*

_{max}

*N*. This constraint reduces the search space and the number of computations in our reverse engineering algorithm.

### The Reverse Engineering Algorithm.

The underlying idea of our algorithm is to rationally select genes to perturb to maximize the amount of information. Without any prior knowledge, we make a random choice in the first perturbation (*cf.* Fig. Fig.1).1). Next, we iteratively perturb genes whose activity has changed the least. Then we perturb, without repetition, the genes with connections that are most uncertain. We introduce an error term to quantitate the uncertainty. In practical terms, this means that if |*x*_{i}| < or |*x _{i}* −

*x*| < in a given experiment, then we can neither distinguish

_{j}*x*from noise nor differentiate between whether

_{i}*x*or

_{i}*x*connects to a given target gene

_{j}*x*. The sources of the error include biological noise and measurement variability. We summarize our iterative procedure as follows.

_{k}- Step 1: Initialization. Randomly select a gene to perturb in the first experiment and measure the response of all genes.
- Step 2: Selection. Select, without repetition, the genes with the smallest change in expression (|Δ
_{i}| = |*x*−_{i}*a*| < ) resulting from the previous perturbation experiments. Repeat until each gene satisfies |Δ_{i}_{i}| > in at least one perturbation experiment. The number of perturbations, including the initialization step, is*r*. - Step 3: Refined selection. Here we give a rational selection procedure for which genes to perturb in additional experiments to obtain a sufficient amount of data to identify the connectivity matrix.
- Step 3.1: Construction of consistent solutions. For every gene (
*i*= 1, . . . ,*N*), construct the*q*number of input solutions (vectors) to Eq. 2, which are consistent with the previous_{i}*r*experiments. This produces a*q*×_{i}*N*solution matrix (*M*) for every gene_{i}*x*. Note that the number of consistent solutions generally differs between the genes,_{i}*q*≠_{i}*q*._{j} - Step 3.2: Construction of
*N*different ranked gene lists. For a given gene*i*and a matrix*M*we calculate the variance across the different possible inputs, i.e., Var[(_{i}*M*(:,_{i}*j*))] for every input gene*j*(matlab notation). This list of*N*variances, corresponding to possible inputs for a gene*x*, is sorted. The highest rank corresponds to the largest variation. This calculation is performed for all genes, thus producing_{i}*N*different lists where each list therefore consists of*N*ranked elements. - Step 3.3: Construction of a single ranked gene list. Using the
*N*number of ranked lists, every gene has been ranked*N*number of times. The rankings for every gene in all lists is summed. The single list contains a gene list where the first gene in the list corresponds to the input gene that has had the largest ranking across all different*N*lists. - Step 3.4: Perturb the gene(s) with the highest ranking in the list constructed in Step 3.3. From the results of experiment
*r*+ 1, filter out the inconsistent solutions in*M*for all_{i}*i*. Repeat step 3.2 and 3.3 and perform an additional perturbation experiment. - Step 4: Convergence check and weight matrix reconstruction. Inspect the remaining matrices
*M*to check whether the average {Mean[(_{i}*M*(:,_{i}*j*))]} is sufficiently large, to determine whether any*w*differs significantly from zero, which would indicate the presence of an interaction. For any_{ij}*w*that cannot be resolved, repeat Step 3. The connectivity matrix_{ij}*W*is thereby reconstructed.

*In Numero* Experiments and Results

In this section, we illustrate the validity of our reverse engineering algorithm with two *in numero* experiments. We demonstrate that we can identify the underlying architecture of the data-generating models by selectively and iteratively perturbing the gene network around the steady state. In the first *in numero* experiment, we consider a linear model that is equivalent to the mapping model. Because the mapping is exact in this case, we are able to draw conclusions that are independent of errors induced from the mapping. In the second *in numero* experiment, we consider a previously reported nonlinear data-generating model describing the *Drosophila* segmentation network (24). In this case, we are able to demonstrate that the use of a linear mapping leads to the correct deduction of the connectivity of an underlying nonlinear model.

### Example 1: A Linear Gene Network.

To illustrate our method, we constructed a random network *W* with *n* = 40 and *k*_{max} = 3. We arbitrarily chose 10 genes to have three input connections, 20 genes to have two input connections, and 10 genes to have one input connection. We then randomly assigned these connections. In this hypothetical 40-gene network with *k*_{max} = 3, there are *N*(*N*!/(*k*_{max}!(*N* − *k*_{max})!))2^{kmax} ≈ 10^{5} possible sets of inputs to a given gene. The number of possible matrices *W* is therefore on the order 10^{200}. The goal of our reverse engineering algorithm is to reduce the number of connectivity matrices to one, and below we show how the expression data from a rationally chosen sequence of perturbations can be used to infer a unique *W*.

As an example we focus here on identifying the inputs to a specific gene (17) in the network. We arbitrarily chose to perturb gene 1. Then we examined the changes in gene expression, Δ_{i} (Fig. (Fig.22 *Top*), and the variation within the set of possible solutions (Var[(*M*_{17}(:, *j*))]; Fig. Fig.22 *Middle*). The induced changes in gene expression vary from small to large (Fig. (Fig.22 *Top*). We kept track of the possible inputs that are *consistent* with the expression data generated by the first perturbation. Intuitively, we expect a small change in expression level for a given gene (*j*) to provide poor constraints as to whether gene *j* influences another gene (*i*). This expected relationship between the variation of the proposed inputs from a given gene *j* to gene *i* and the magnitude of the induced expression change are indeed confirmed in the numerical experiments (Fig. (Fig.22 *Bottom*). This observation is the basis for selecting for subsequent perturbation the gene with the smallest expression change and maximal variation in the consistent inputs. Determining *W*, we perform this calculation for all genes in each step.

*Top*) The change in gene expression (Δ

_{i}=

*x*

_{i}−

*a*

_{i}) for all genes (

*x*axis,

*n*= 40) when

*x*

_{1}is perturbed. (

*Middle*) For every gene, there are several different possible inputs (solutions) from other genes that are consistent with the expression

**...**

We reverse engineered several randomly generated 40-gene networks (*k*_{max} = 3). In Fig. Fig.3,3, the number of consistent possible solutions averaged over all genes with *k*_{max} = 3 is plotted against the number of perturbation experiments. For comparison, we studied the determination of *W* when all single-gene perturbations are selected randomly without repetition (circles in Fig. Fig.3).3). Here, 10 perturbation experiments are needed to identify the connectivity for all genes in the network. Selecting genes more judiciously, as prescribed by our scheme, is more efficient. It leads to the correct network architecture with only seven perturbations (triangles in Fig. Fig.3).3). Because *W* is sparse by definition, the perturbation of a single gene often results in little change in expression activity across the network. Therefore, efficiency is gained by introducing multiple perturbations for different genes in each experiment (squares in Fig. Fig.3).3).

**...**

The number of perturbations that are required to infer the network structure depends not only on the network complexity as determined by *N* and *k*_{max}, but also on possible sources of error, , including (*i*) experimental resolution, (*ii*) mapping errors such as those stemming from nonlinearities, and (*iii*) finite model resolution (i.e., the grid of *w _{ij}*). Here we examine how the critical number of experiments

*E*

_{C}depends on the ratio between the error and Δ, where Δ is the absolute change in gene expression |

*x*−

_{i}*a*| averaged over all genes and all experiments (Fig. (Fig.44

_{i}*a*). Because both

*N*and

*k*

_{max}determine the combinatorial complexity of a network, we consider the

*ratio*between

*E*

_{C}and log(

*N*

_{CPS}), where

*N*

_{CPS}is the number of consistent possible solutions.

*a*) Increasing the error increases the number of critical experiments (

*E*

_{C}). The experimental resolution is reduced (increased )

**...**

For reference, we have plotted the three cases from Fig. Fig.33 in Fig. Fig.44*a* (boxed). Clearly, perturbing multiple genes per experiment (squares) reduces the number of possible solutions more efficiently than perturbing only one gene per experiment (circles). In addition, increasing the error for the same set of expression data (i.e., Δ is unchanged for the different cases) necessitates more experiments to resolve the network in all cases (Fig. (Fig.44*a*). The selection algorithm produces a slope of ½, whereas there is a faster growth in *E*_{C}/log(*N*_{CPS}) when a random selection scheme is used (triangles). We therefore expect our gene selection algorithm to be particularly useful when the ratio between and Δ is large.

We have also examined how the ratio between *E*_{C} and log(*N*_{CPS}) depends on the number of genes *N* (Fig. (Fig.44*b*). Using multiple perturbations, we find that the critical number of experiments, *E*_{C}, is well approximated by log(*N*_{CPS}) for different *N*. This finding allows us to express *E*_{C} in terms of network parameters as

where *s* is the number of different possible values for *w _{ij}* and the parameters

*m*and

*p*are constants found by fitting. Here

*m*≈ 0.75 and

*p*≈ 0.5 in the case of four algorithmically selected perturbations per experiment (Fig. (Fig.44

*a*), whereas

*p*≈ 1 and

*m*≈ 0.75 for randomly selected perturbations. In the

*N*

*k*

_{max}limit,

*E*

_{C}scales as log

*N*(Fig. (Fig.44

*c*). Fig. Fig.44

*c*also shows how

*E*

_{C}depends on both

*k*

_{max}and /Δ. Note that by expanding log

*Nk*

_{max}to leading order, we can use ½

*k*

_{max}[1 + /Δ] log[

*N*] to estimate

*E*

_{C}.

The dependence of *E*_{C} on *s*, the number of different values for *w _{ij}*, is relatively weak (Fig. (Fig.44

*d*), because

*s*is inside the logarithm (Eq.

**3**). Hence, an enhanced resolution in

*w*, with the grid size Δ

_{ij}*w*= 2/(

*s*+ 1), would not increase the critical number of experiments significantly. To reverse engineer a network similar to the protein network in yeast (22), Fig. Fig.44

*d*illustrates that our method identifies ≈95% of the connections by using 25–100 experiments (/Δ = 1 − 10,

*k*

_{max}= 5,

*s*= 10,

*n*= 2,000), whereas 50–300 experiments (

*k*

_{max}= 15) are required to recover all connections. As a rule, we have

*E*

_{C}

*N*, even though it takes more experiments to resolve denser networks.

### Example 2: A Nonlinear Gene–Protein Network.

Although the linear data-generating model of the previous example provided a systematic benchmark for our scheme, it is likely that naturally occurring gene regulatory networks contain significant nonlinearities (21). In this *in numero* experiment, we explore the utilization of our scheme in the context of a previously reported data-generating model describing the segmentation polarity network in *Drosophila* (24). Even though this model contains strong nonlinearities and has several protein–protein interactions that we assume we cannot measure directly, we find that we can recover the effective gene–gene interactions by using a linear model.

The *Drosophila* model is governed by 10 nonlinear equations describing the time evolution of both genes and proteins (see *Supporting Text*, which is published as supporting information on the PNAS web site, www.pnas.org). The form of the equations is given by the evolution of mRNA for gene *x*, *dx*/*dt* = *y*_{1}*P*^{v1}/κ_{1}^{v1} + *y*_{1}*P*^{v1}, where *P* = 1 − *y*_{2}^{v2}/(κ_{2}^{v2} + *y*_{2}^{v2}). The *y*_{1} protein activates gene *x* and the *y*_{2} protein acts as a repressor. The half-maximal activation coefficient is governed by κ_{1} and κ_{2}, respectively, and *v*_{1} and *v*_{2} are the Hill coefficients. The full model (Fig. (Fig.55*a*) has several protein–protein interactions, such as those between PTC, CI, and CN. However, when we reverse engineer the segmentation network, we monitor and consider only the changes in mRNA expression, attempting to recover the effective gene-to-gene interactions (Fig. (Fig.55*b*), in the absence of information concerning the protein dynamics.

*a*) Wiring diagram of the

*Drosophila*segmentation cell model. Genes are shown as circles and proteins as squares. (

*b*) The effective gene-to-gene wiring diagram. A link here indicates that there exists at least one protein pathway connecting two genes.

**...**

Perturbing the network as outlined in the linear section was sufficient to identify the dominant connections. Our reverse engineering algorithm found that the interaction from the *ptc* gene to itself was significantly different from zero, indicating a strong effective negative coupling. The algorithm also suggested a weak positive connection from *ci* to *ptc* (dashed line in Fig. Fig.6),6), whereas all other connections to *ptc* were proposed to be zero. This compares well with the original segmentation network. As can be discerned from Fig. Fig.5,5, there are two pathways through which the *ptc* gene represses its own expression. The first pathway involves the *ptc* gene activating the protein PTC, which stimulates the production of protein CN, which then represses the expression of *ptc*. In the other net negative pathway, the PTC protein also represses the production of the CID protein and the CID protein enhances *ptc* expression. The negative *ptc* self-interaction detected by our algorithm thus corresponds to the combined inhibitory effect of these two pathways. The above procedure was repeated for each of the genes in the *Drosophila* model. The resulting network, as found by our reverse engineering scheme (Fig. (Fig.6),6), is an accurate reconstruction of the effective gene-to-gene interactions in the web of gene–protein pathways (Fig. (Fig.55*b*). All statistically significant connections (solid lines in Fig. Fig.6)6) detected by our algorithm are correct as they have the same sign as the effective gene-to-gene connections displayed in Fig. Fig.55*b*.

*Drosophila*segmentation cell model. Connections that are statistically significant are shown as solid lines; suggested connections that are not statistically significant are indicated with dashed lines. Stars indicate that

**...**

As an independent test of our scheme, we numerically computed the Jacobian of the *Drosophila* segmentation model. The gene-to-gene connections thus found are indicated as asterisks in Fig. Fig.6.6. All of these connections are identified by the reverse engineering procedure (Fig. (Fig.6).6). Interestingly, the sign in the original pathway diagram (Fig. (Fig.55 *a* and *b*) is not a reliable predictor of the net strength of an intermediate reaction cascade. Indeed, our scheme reveals that the *wg* self-interaction and the *ptc*-to-*hh* projections are functionally weak. Of the three inputs to *hh*, the repression from the *ptc* gene had the smallest Jacobian term, which is in agreement with what we found from our reverse engineering scheme. Our algorithm also finds the net effect of parallel pathways in the complete gene–protein network. For example, the *ci*-to-*wg* interaction is determined to be a net positive weak connection. Adding the Jacobian terms from the two protein pathways confirms that the sum is positive but small (asterisks in Fig. Fig.6).6). The *en* self-coupling is the single connection that does not have a corresponding Jacobian term. Finally, we note that if we can perturb and monitor the activity of any gene or protein, then we can reconstruct the full network corresponding to Fig. Fig.55*a* (data not shown). The problem is then equivalent to reconstructing a gene network without proteins.

## Discussion

We have developed an iterative reverse engineering approach suitable for reconstructing gene regulatory networks. By using a minimal linear model and by selectively and iteratively perturbing genes in a gene network, we can recover the network topology with a small number of experiments. We have calibrated our mapping scheme through a series of *in numero* experiments, including tests on a nonlinear *Drosophila* gene–protein network.

Our algorithm reverse engineers the network on a row basis, and so its efficiency depends on *k*_{max}, the bound for the number of input edges, but it is independent of the number of output edges. Our algorithm requires *O*(*k*_{max}(/Δ)log[*N*]) perturbation experiments to identify a network where each gene has at most *k*_{max} inputs. The *k*_{max}log[*N*] factor is in accordance with an earlier conjecture based on the scaling behavior of Boolean models (16). We also find, as one would intuit, that fewer perturbation experiments are needed when the error is small and the changes in gene expression are large. We observed that perturbations using multiple toggles were more efficient than single-gene perturbations. The signal (Δ) is larger with multiple perturbations because we are less likely to encounter situations where the activity of only a small fraction of genes is altered. This is different from the situation where individual perturbations are large, which, while increasing the signal, has the undesirable, possible consequence of causing significant nonlinear effects. Using multiple genetic perturbations is therefore an efficient method for increasing the average change in gene expression without causing nonlinear effects. Our approach differs from other schemes where only a single gene is perturbed at a time (13, 25). In these earlier studies, knockouts were used to induce changes in gene expression; however, knockouts may induce secondary compensatory changes and therefore change the connectivity matrix *W*. In contrast, perturbing the gene dynamics with synthetic gene networks, such as a toggle switch (6), provides a rapid and controlled means to induce changes in gene expression without changing *W*.

Our reverse engineering algorithm is very efficient in terms of the number of required experiments for a given network. However, this does not necessarily imply computational efficiency. Indeed, it is inefficient to search through all possible solutions for large values of *k*_{max} and *N* because such a scheme requires *O*(*N*^{kmax}^{+1}) computations in the initialization step. To relieve this computational inefficiency, our selection algorithm can be readily combined with other computational methods to search for possible solutions. Because we use small changes in gene expression as an initial selection criteria (Step 2 of our algorithm), we can perform several initial perturbation experiments without inspecting *all* consistent possible solutions, thereby reducing *N*_{CPS} in the initialization step. Many methods, such as singular value decomposition (26) and dynamic programming techniques (27), can then be used to construct an initial set of possible solutions from such a data set (Step 3.1 of our algorithm). The number of remaining solutions can be further reduced by using carefully selected perturbations (Step 4 of our algorithm). We can therefore flexibly adjust the computational efficiency, depending on the complexity of the network and the experimental resources available.

The utility of our reverse engineering algorithm depends on how well controlled gene expression technologies can be applied and the quality of gene expression measurements. The magnitude of the perturbation is bounded by two constraints. On the one hand, if the perturbation is too large, the gene network will be pushed outside of its linear response regime and this will weaken the algorithm's assumption of linearity. As a result, the error in the predicted network will increase. On the other hand, if the perturbation is too small, the response of the gene network will be masked by instrument and biological noise. The exact size of the acceptable perturbation must be determined experimentally. Although it does not matter whether the expression of the perturbed gene is increased or decreased as a result of the perturbation, regulated overexpression technologies are preferable because of their simpler implementation and more reliable performance. In addition to the genetic toggle switch, there are multiple prokaryotic and eukaryotic expression systems available that provide the negligible baseline and controllable expression required, including the Pbad system (28, 29), the ARGENT system (30, 31), and the Pip system (32). Quantitative, real-time PCR provides the precision required for reliable measurement of fine changes in gene expression resulting from small perturbations. Real-time PCR also provides sufficient throughput to efficiently measure the response of 100 or more genes.

With maps of gene regulatory networks in our hands, we can ask new questions about diseases in terms of genetic circuits and attempt to manipulate the functional outputs of gene networks. Moreover, we can efficiently identify and validate specific drug targets and deconvolve the effects of chemical compounds. This could lead to the development of novel classes of drugs that are based on a network approach to cellular dynamics.

## Acknowledgments

We thank Drs. Jens Lagergren and Tim Gardner for valuable input and discussions. Research was supported by Defense Advanced Research Planning Agency (Grant F30602-01-2-0579), the National Science Foundation Bio-QuBIC Program (Grant EIA-0130331), and the Fetzer Institute. J.T. also thanks the Wennergren Foundation, the Swedish Research Council (VR), and the Royal Academy of Science for support.

## Footnotes

This paper was submitted directly (Track II) to the PNAS office.

## References

**National Academy of Sciences**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (263K)

- A computational algebra approach to the reverse engineering of gene regulatory networks.[J Theor Biol. 2004]
*Laubenbacher R, Stigler B.**J Theor Biol. 2004 Aug 21; 229(4):523-37.* - Gene perturbation and intervention in probabilistic Boolean networks.[Bioinformatics. 2002]
*Shmulevich I, Dougherty ER, Zhang W.**Bioinformatics. 2002 Oct; 18(10):1319-31.* - Reverse engineering of regulatory networks: simulation studies on a genetic algorithm approach for ranking hypotheses.[Biosystems. 2002]
*Repsilber D, Liljenström H, Andersson SG.**Biosystems. 2002 Jun-Jul; 66(1-2):31-41.* - Computational and experimental approaches for modeling gene regulatory networks.[Curr Pharm Des. 2007]
*Goutsias J, Lee NH.**Curr Pharm Des. 2007; 13(14):1415-36.* - Reverse engineering and verification of gene networks: principles, assumptions, and limitations of present methods and future perspectives.[J Biotechnol. 2009]
*He F, Balling R, Zeng AP.**J Biotechnol. 2009 Nov; 144(3):190-203. Epub 2009 Jul 22.*

- A Range Finding Protocol to Support Design for Transcriptomics Experimentation: Examples of In-Vitro and In-Vivo Murine UV Exposure[PLoS ONE. ]
*Bruning O, Rodenburg W, van Oostrom CT, Jonker MJ, de Jong M, Dekker RJ, Rauwerda H, Ensink WA, de Vries A, Breit TM.**PLoS ONE. 9(5)e97089* - The transcriptional network for mesenchymal transformation of brain tumors[Nature. 2010]
*Carro MS, Lim WK, Alvarez MJ, Bollo RJ, Zhao X, Snyder EY, Sulman EP, Anne SL, Doetsch F, Colman H, Lasorella A, Aldape K, Califano A, Iavarone A.**Nature. 2010 Jan 21; 463(7279)318-325* - Regularized EM algorithm for sparse parameter estimation in nonlinear dynamic systems with application to gene regulatory network inference[EURASIP Journal on Bioinformatics and Syste...]
*Jia B, Wang X.**EURASIP Journal on Bioinformatics and Systems Biology. 2014; 2014(1)5* - Delivering systems pharmacogenomics towards precision medicine through mathematics[Advanced drug delivery reviews. 2013]
*Wang Y, Wang N, Wang J, Wang Z, Wu R.**Advanced drug delivery reviews. 2013 Jun 30; 65(7)905-911* - Receptor Tyrosine Kinases Fall into Distinct Classes Based on Their Inferred Signaling Networks[Science signaling. ]
*Wagner JP, Wolf-Yadlin A, Sevecka M, Grenier JK, Root DE, Lauffenburger DA, MacBeath G.**Science signaling. 6(284)ra58*

- GeneGeneGene links
- GEO ProfilesGEO ProfilesRelated GEO records
- HomoloGeneHomoloGeneHomoloGene links
- PubMedPubMedPubMed citations for these articles
- SubstanceSubstancePubChem Substance links
- TaxonomyTaxonomyRelated taxonomy entry
- Taxonomy TreeTaxonomy Tree

- Reverse engineering gene networks: Integrating genetic perturbations with dynami...Reverse engineering gene networks: Integrating genetic perturbations with dynamical modelingProceedings of the National Academy of Sciences of the United States of America. May 13, 2003; 100(10)5944PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...