• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of pnasPNASInfo for AuthorsSubscriptionsAboutThis Article
Proc Natl Acad Sci U S A. Apr 30, 2002; 99(9): 6163–6168.
PMCID: PMC122920
Genetics

Reverse engineering gene networks using singular value decomposition and robust regression

Abstract

We propose a scheme to reverse-engineer gene networks on a genome-wide scale using a relatively small amount of gene expression data from microarray experiments. Our method is based on the empirical observation that such networks are typically large and sparse. It uses singular value decomposition to construct a family of candidate solutions and then uses robust regression to identify the solution with the smallest number of connections as the most likely solution. Our algorithm has O(log N) sampling complexity and O(N4) computational complexity. We test and validate our approach in a series of in numero experiments on model gene networks.

With recent advances in cDNA and oligonucleotide microarray technologies (1), it has become possible to measure mRNA expression levels on a genome-wide scale. Data thus collected provide valuable descriptions of gene activities under various biochemical (2) and physiological (3) circumstances and allow one to reverse-engineer the gene networks, i.e., to infer the underlying network structures from experimental measurements. However, naturally occurring gene regulatory networks are embedded in genomes that typically consist of thousands of genes. To extract the topology of such networks and hence isolate the functional subnetworks represents a computationally daunting task; it also requires a very large amount of experimental data, which are expensive to obtain.

To circumvent this problem of data deficiency, many current research efforts have focused on clustering, i.e., grouping genes into hierarchical functional units based on correlations in expression patterns (38). This hierarchical approach has been fruitful in identifying coregulated genes in certain functional units (36). It has also been generalized to self-organizing maps (7) and supervised learning schemes (8) to cope with the sensitivity to noise and other deficiencies intrinsic to hierarchical clustering (9), at the cost of increasing computational cost. However, a fundamental shortcoming of such clustering schemes is that they are based on the assumptions that: (i) gene regulatory networks are hierarchical in structure (36), and (ii) genes performing related biological functions exhibit similar expression patterns (and vice versa). These assumptions may not always be valid. At a structural level, there are data suggesting that gene regulatory networks are not strictly hierarchical in nature; rather, they are interwoven like a web (10), as in the cases of metabolic (11) and protein networks (12), with multiple pathways for similar functions to provide redundancy to protect against mutations and other deleterious effects (13). At a dynamical level, mRNA and protein expression levels for certain genes may not be correlated (14), suggesting a similar lack of strict correlation between gene expression and function. Therefore, although clustering is useful on a local scale to identify isolated coexpressing units, it is not suitable for large-scale reverse engineering.

Recently, there have been attempts to reconstruct models for gene regulatory networks on a global, genome-wide scale using ideas from system identification (15), such as genetic algorithms (16), neural networks (17), and Bayesian models (18). Although useful in specific contexts, these approaches are of restricted scope, as they typically require a large amount of data and computation to generate connectivity maps for large networks, such as those of genomic scales. To overcome these problems of data shortage and computational inefficiency, several researchers (1922) have adopted a linear model and have used singular value decomposition (SVD) (23) to reverse-engineer the network architecture. As we will explain in greater detail below, although SVD provides a useful and condensed description of the data, it alone may not correctly identify the connectivity matrix and therefore may not accurately predict the behavior of the gene networks in response to novel stimuli. The method of SVD has to be supplemented by extra conditions, based on biological knowledge, to recover the network topology correctly.

Here we propose such a scheme to identify the entire network structure in gene regulatory networks on a genome-wide scale. Our approach is built on previous work using SVD (1922) and is based on an insight provided by earlier studies on gene regulatory networks (24, 25) and bioinformatics databases (11, 12), namely, that gene networks in most biological systems are sparse. Thus, we first use SVD to construct a family of candidate networks, all being consistent with the experimental data, and then uses robust regression (26) to identify the sparsest network in this family as the most likely solution. As such, our scheme has O(log N) sampling complexity and O(N4) computational complexity. Much in the spirit of systems biology (27), our goal is to extract the gene regulatory networks on a global scale and to do so efficiently, in order to identify individual subnetworks in a first draft of the topology of the entire network, on which further, more local, analysis can be based.

Method

The method we propose to reverse-engineer gene networks consists of two steps. We first use SVD to construct a set of feasible solutions that are consistent with the measured data and then we use robust regression to select the sparsest one as the solution.

For simplicity, we will consider only systems that are operating near a steady state, so that the dynamics can be approximated by a linear system of ordinary differential equations:

equation M1
1

Here the xis are the concentrations of the mRNAs that reflect the expression levels of the genes, the λis are the self-degradation rates, the bis are the external stimuli, and the ξis represent noise. The matrix elements, Wijs, which are real numbers, describe the type and strength of the influence of the jth gene on the ith gene, with a positive sign indicating activation, a negative sign indicating repression, and a zero indicating no interaction.

In an experiment, we can apply a prescribed stimulus (b1, b2, … , bN)T and use a microarray to measure simultaneously the concentrations of all of the N different mRNAs, i.e., (x1, x2, … , xN)T. Repeating this procedure M times, we get M measurements and can tabulate the results as

equation M2

Here the subscript i indexes individual genes, and the superscript j denotes experiment number. That is, xequation M3 is the concentration of the ith mRNA for the jth trial, with similar notations for Ẋ and B. Then we can rewrite Eq. 1 as

equation M4

where we have neglected noise and absorbed the self-degradation rates λis into the coupling constants Wijs to simplify notation. That is, Aij := Wij − δijλi.

The goal of reverse engineering is to use the measured data B, X, and Ẋ to deduce A and hence the connectivity matrix W. In this context, we may take the transpose of the system and rewrite it as

equation M5

to emphasize the fact that A is the unknown. If M = N and X is full-ranked, we can simply invert the matrix X to find A. However, typically M [double less-than sign] N because of the high cost of perturbations and measurements. We therefore have an underdetermined problem. One way to get around this is to use SVD (23) to decompose XT into

equation M6

where U and V are each orthogonal:

equation M7

with I being the identity matrix, and W is diagonal:

equation M8

Without loss of generality, we may assume that all nonzero elements of wk are listed at the end, i.e., w1, w2, … , wL = 0 and wL+1, wL+2, … , wN ≠ 0, where L := dim(ker(XT)). Then one particular solution for A is:

equation M9
2

with 1/wj taken to be zero if wj = 0, whereas the general solution is given by the affine space

equation M10
3

with C = (cij)N×N, where cij is zero if j > L and is otherwise an arbitrary scalar coefficient. This family of solutions in Eq. 3 represents all the possible networks that are consistent with the microarray data. Among these solutions, the particular solution A0 is the one with the smallest L2 norm.

This idea of using SVD to reverse-engineer gene networks with a limited amount of data is not new (1922). Nevertheless, these earlier efforts stopped at Eq. 2 and took A0 as the solution, a choice that may not always recover the connectivity matrix correctly, as we shall see in Example 1.

Because SVD leads to nonunique solutions, we need additional constraints to isolate the true solution from the entire family in Eq. 3. Many choices are possible, and the particular choice depends on our knowledge of the biological system. For example, if we know a priori that certain genes are functionally related, we may impose this as a constraint to sieve through the family of solutions given by Eq. 3. Here we adopt the viewpoint that we have no prior knowledge of the network. In such cases, we may rely on insights provided by earlier works on gene regulatory networks (24, 25) and bioinformatics databases (11, 12), which suggest that naturally occurring gene networks are sparse, i.e., generally each gene interacts with only a small percentage of all the genes in the entire genome. It is in this global sense, on a genome-wide scale, that the entire network, which encompasses all the individual gene regulatory networks, is sparse; and it is this feature that we will exploit to resolve the ambiguity introduced by SVD.

Because the family Eq. 3 represents all connectivity matrices that are consistent with the measurement data, we may concentrate on this set to look for the true connectivity matrix. Imposing sparseness on the family of solutions given by Eq. 3 means that we need to choose the coefficients cij to maximize the number of zero entries in A. This is a nontrivial problem, because we do not know in advance which entries are nonzero. As we shall see in Example 1, the solution A0 as found by SVD alone may not be close to the true solution. We therefore cannot assume that a small entry in A0 corresponds to a zero entry in the true solution (28). It is necessary to explore the family of solutions in Eq. 3. Nonetheless, a brute-force approach, enumerating all possibilities to see which choice will lead to a self-consistent solution (29), is computationally costly as it will take O(N!/(k!(Nk)!)) operations to solve a system of N equations with k nonzero entries whose locations are unknown. A more efficient method is needed for large N.

Our idea is to consider the dual problem, where we proceed as if we could have all entries being zero, namely, setting A = 0 in Eq. 3 to obtain

equation M11

which is an overdetermined problem as there are only NL variables (cij for jL), whereas there are N2 equations. In this context, the solution A = A0, given by Eq. 2, is the closest solution in the L2 sense. However, that is not what we want. Instead, we want to satisfy as many equations as possible. Viewed as such, our task is equivalent to the problem of finding the exact-fit plane (30) in robust statistics, where we try to fit a hyperplane to a set of points containing a few outliers, with the objective being to pass through as many points as possible. This is a well-studied problem, and many methods have been developed to do this, each with its merits and shortcomings (26). Here we have chosen L1 regression (31), where the figure of merit is the minimization of the sum of the absolute values of the errors, for its efficiency. Among the many numerical algorithms for L1 regression, we have adopted the simplex method in refs. 32 and 33 for its simplicity. This algorithm takes approximately O(pn2) computations to solve a regression problem with n data points and p parameters (31). Here we have n = N and p = LNM for each row in Eq. 3. This relation implies that more experimental data points can relieve the burden on computation, whereas a small set of experimental data points will demand more computations to resolve the uncertainty. Because experiments are much more expensive than computations, it is desirable to reduce M even though that will increase p. As we shall demonstrate in the in numero experiments below, it is possible to reverse-engineer networks with M = O(log N) experimental measurements. In such cases, we have M [double less-than sign] N and so p = O(N). It therefore takes O(N3) computations to reconstruct one row of the connectivity matrix, and the recovery of the entire matrix is an O(N4) process.

In Numero Experiments.

In this section, we report on three numerical experiments that we have conducted to test our reverse engineering scheme. In the first experiment, we calibrated the scheme using a large sparse linear system. In the second experiment, we applied the scheme to a nonlinear model for a cascade of repressors. In the third experiment, we tested the scheme with a large sparse nonlinear network consisting of both activators and repressors. As we shall show, in all three cases, we were able to recover the network connectivity using a small number (compared to the network size) of measurements even in the presence of noise.

Example 1: A large sparse linear network.

In this first experiment, we considered a linear system governed by Eq. 1. To generate the connectivity matrix W, we proceeded row by row. For each row, we randomly picked an integer k from a power-law distribution with a cutoff (12) at kmax with kmax [double less-than sign] N. We then randomly selected k entries and assigned each of them a nonzero value randomly chosen from a uniform distribution. The condition kmax [double less-than sign] N ensures sparseness.

To test the reverse engineering scheme, we perturbed the system 1 with transient random perturbations bi(t). While the system was relaxing back to the steady state, we took measurements for X and estimated Ẋ by linear interpolation. We repeated this process M times to collect M data points and applied the reverse engineering scheme to attempt to reconstruct the connectivity matrix. The connectivity matrix thus reconstructed, denoted by AR, was then compared with the true connectivity matrix AT = W − Λ, entry by entry. Specifically, we measured the error by counting the number of discrepancies:

equation M12
4

where

equation M13

with δ being some prescribed small value for error tolerance, which was chosen in accordance with the noise level. We emphasize the importance of considering the difference ARAT directly when calibrating a reverse engineering scheme. Demonstrating that AR, when substituted back into Eq. 1, reproduces the experimental data closely (1922) is insufficient, as this amounts to showing that the residual ||ARxbx|| is small. But a small residual does not imply an accurate solution, especially when ill-conditioned equations are involved (34), which is the case here as we are attempting to invert nonsquare matrices by SVD. An AR that leads to a small residual, although faithfully describing the experimental data, may not correctly predict the network's response to novel stimuli.

The results from one such experiment are depicted in Fig. Fig.1.1. For small M, there are many errors in the reconstruction, but the number of errors E drops rapidly as M increases. A more detailed study where we kept track of the eijs row by row revealed (data not shown) that our algorithm recovers sparser rows earlier than less sparse rows as the number of measurements M increases. This result suggests the possibility of partially recovering gene networks even with a very small amount of data; we shall not pursue this idea here and shall concentrate on the recovery of entire networks.

Figure 1
Number of errors, E, made by the reverse engineering scheme as a function of M, the number of measurements, for four linear networks of the form 1 with different sizes N.

We have also plotted in Fig. Fig.22 the smallest number of measurements, Mc, that are needed to recover the entire N × N matrix correctly, i.e., with E = 0. It reveals that Mc = O(log N). This data requirement is maximally efficient, as the minimum number of measurements needed to recover the entire matrix is bounded by the information content in the matrix, which is conjectured (17) to be Mc = Ω(k log(N/k)).

Figure 2
Critical number of measurements, Mc, required to recover the entire connectivity matrix correctly, versus N, the size of the network for linear systems of the form 1. Circles: numerical data. Line: least-squares fit of the form Mc = a + b log ...

For comparison, we attempted to reverse-engineer the network using SVD alone (1922), without imposing sparseness by robust regression, i.e., by taking A0 in Eq. 2 as the solution. We found that we could recover the connectivity matrix only when we had a large number of measurements (M [greater, similar] 0.99N) (Fig. (Fig.3).3). For small M, we found that AR was far from AT, although the AR thus found could reproduce the time series data.

Figure 3
Number of errors, E, made by SVD alone (without the imposition of sparseness) as a function of M, the number of measurements, for four linear networks of the form 1 with different sizes N.

Example 2: A repressing cascade.

In this second in numero experiment, we considered a one-dimensional cascade of genes, each of which is induced by an external stimulus and repressed by an immediate neighbor, as illustrated in Fig. Fig.4.4. This system can be modeled by the following nonlinear system of ordinary differential equations (35, 36):

equation M14
5

where ui is the concentration of the ith mRNA, λi is the degradation rate of the ith mRNA, αi is the synthesis rate of the ith repressor, βi is the repression cooperativity of the ith repressor, and ξi(t) represents noise. Here we adopt the convention that u0 [equivalent] 0.

Figure 4
Schematic of a one-dimensional gene network with a cascade structure.

In the absence of noise, Eq. 5 has a unique stable fixed point, given recursively by

equation M15
6

Near this steady state, the dynamics is governed by the linearization 1 with xi = uiu*i, bi = 0, and

equation M16

As in Example 1, to test the reverse engineering scheme, we repeatedly perturbed the system 5 from the steady state 6 with transient small random perturbations and took measurements while the system was relaxing back to the steady state. We iterated this process M times to collect M data points and applied our reverse engineering scheme to attempt to reconstruct the connectivity matrix. For small M, there are many errors in the reconstruction, but the number of errors E, as defined in Eq. 4, drops rapidly as M increases (Fig. (Fig.5).5). In particular, with the parameter values chosen, we could recover the entire connectivity matrix correctly with MMc = 70 [double less-than sign] N = 400.

Figure 5
Number of errors, E, made by the reverse engineering scheme as a function of M, the number of measurements, for the repressing cascade 5 with N = 400 genes.

We emphasize the importance of keeping the system close to a steady state for the reverse engineering scheme presented here to be applicable. We have tried applying large perturbations to the system 5 so that the dynamics is far from equilibrium, i.e., u is no longer close to u*, as may occur during development, disease, injury, or certain genetic or biochemical perturbations. In these cases, we could not recover the connectivity matrix even with a large (>N) number of measurements. This failure is not surprising, given that under such conditions the linearization 1 is no longer valid. In such cases, we need a nonlinear model to capture the dynamics and to recover the connectivity topology.

Example 3: A large sparse gene network.

In this third in numero experiment, we considered a random network of genes. Each gene is induced by an external stimulus while also activated and repressed by other genes that are randomly chosen from a power-law distribution with a cutoff kmax [double less-than sign] N, as in Example 1. This system is illustrated in Fig. Fig.6.6. Assuming that different regulatory effects do not interfere with one another, we can model the system by the following nonlinear system of ordinary differential equations (35, 36):

equation M17
7

equation M18

where ui is the concentration of the ith mRNA, λi is the degradation rate of the ith mRNA, αi is the synthesis rate of the ith mRNA, γij is the activation cooperativity of the jth gene on the ith gene, βik is the repression cooperativity of the kth gene on the ith gene, and ξi(t) represents noise. The sets [mathematical script A]i and Ri, whose cardinalities obey a power-law distribution, specify the genes that activate or repress, respectively, the ith gene.

Figure 6
Schematic of a nonlinear gene network with a random structure.

Unlike Example 2, here we were unable to find the steady states analytically; we could not even ascertain analytically whether a steady state exists. We therefore resorted to numerics. By following the dynamics directly with numerical integration, we found that a stable fixed point exists for certain choices of parameters. Near a steady state u*, the dynamics is governed by the linearization 1 with xi = uiu*i, bi = 0, and

equation M19

As before, to test the reverse engineering scheme, we repeatedly perturbed the system 7 from a steady state with transient small random perturbations and took measurements while the system was relaxing back to the steady state. Repeating this process M times, we obtained M data points and applied our reverse engineering scheme to attempt to reconstruct the connectivity matrix. For small M, there are many errors in the reconstruction, but the number of errors E, as defined in Eq. 4, drops rapidly to zero as M increases (data not shown), indicating that we can elucidate the whole network correctly with a small number of measurements. As Fig. Fig.77 shows, the smallest number of measurements, Mc, that are needed to recover the entire N × N matrix again scales logarithmically with N, i.e., Mc = O(log N).

Figure 7
Critical number of measurements, Mc, required to recover the entire connectivity matrix correctly, versus N, the size of the network for nonlinear systems of the form 7. Circles: numerical data. Line: least-squares fit of the form Mc = a + b log ...

Discussion

We have proposed a reverse engineering scheme suitable for global reconstruction of gene networks, which are large and sparsely connected on a genome-wide scale. Our algorithm requires O(log N) sample points and O(N4) computations. We have tested and validated our scheme in three in numero experiments. In this section, we compare our method with other reverse engineering schemes. We also discuss how our scheme can be improved and generalized.

One method to reverse-engineer gene networks is to use genetic algorithms (16), which allow network models to evolve under selective pressure in an attempt to fit the data. A related idea is to train neural networks (17) to learn the network topology. However, these approaches are of restricted scope, as they typically require an unrealistically large amount of data and computation to generate connectivity maps for large networks. Similarly, Bayesian models (18), although useful for evaluating the likelihood of a particular hypothesis and for identifying the most likely network from a small set of competing candidate models, are inefficient if used to reconstruct network architectures de novo for networks of genomic scales. In comparison, adopting a linear model and then using SVD to reverse-engineer the network architecture (1922) is more efficient in terms of both data requirement and computation. The SVD method attempts to reconstruct the networks directly from experimental data without prior knowledge of their structures and is therefore useful in generating a first draft of the network topology in novel situations. Nonetheless, one has to use SVD with care to get biologically meaningful results. What SVD does is to provide a family of candidate networks that are consistent with the microarray data. It does not identify which one of those candidates is the correct solution. The solution with the smallest L2 norm may not correspond to the real structure, as discussed in Example 1. We therefore have to amend the SVD method with some additional criteria to select the most likely solution. We have proposed using the sparseness of the networks as one such criterion. Our numerical experiments have shown that this amended method can recover the network topology correctly, using O(log N) sample points and O(N4) computations.

One attractive feature of our reverse engineering scheme is that it can be easily parallelized, as it recovers the connectivity matrix row by row, with mutually independent operations. Moreover, if we focus on a particular gene, it takes O(NM2) computations to perform SVD and then O(N3) computations to impose sparseness to find the elements that are immediately regulating this particular gene. Successive iterations then identify upstream genes layer by layer (Fig. (Fig.8)8) without solving for the entire network, which requires O(N4) computations. Such a recovery process allows rapid elucidation of pathways that lead to the particular gene under study, identifying its regulating genes as potential drug targets for pharmaceutical purposes.

Figure 8
Layer-by-layer recovery of network topology. Focusing on Gene 1, we can identify Genes 2 and 3 as the immediate upstream elements directly regulating Gene 1, and then Genes 4, 5, and 6 as next-immediate upstream elements indirectly regulating Gene 1.

Nevertheless, our reverse engineering scheme has a counter-intuitive feature. Although efficient in recovering the architectures of large networks, it is less efficient for small networks, requiring almost as many measurements as the number of genes to reconstruct the network topology, because the notion of sparseness is relative and is ill-defined for small networks. In a network with only a few genes, even if each gene is interacting with a small number of genes, it is still interacting with a significant portion of the network. Thus, the network may no longer be regarded as sparse. We have been unable to quantify this notion of sparseness and to pinpoint the critical size of a network as a function of the average number of connections. This issue is related to the concept of exact-fit point (30) in L1 regression and has been solved only under very specific conditions governing the sample points. As a rule of thumb, our numerical experiments suggest that if the average number of connections is fewer than 10, which is a reasonable estimate for biological systems (24, 25), then our algorithm starts to show its efficiency when the network size is larger than 200 genes or so. As a consequence, our method is useful for rapidly furnishing on a global scale a first draft of the topology of the entire network that encompasses all of the gene regulatory networks in naturally occurring genomes but is not suitable for fine-tuning to improve the local resolution of small subnetworks that govern individual biological functions. To extract these local features to build biologically meaningful models for the various genetic and biochemical pathways, we need to integrate our method with other statistical methods and experimental data (27). Such methods may include using Bayesian networks to verify the likelihood of the paths (18) or probing the networks iteratively to gain more information (J.T., M.K.S.Y., J. Hasty, and J.J.C., unpublished work).

As for the data, we may need to incorporate the activities of proteins (14, 37, 38) in addition to the mRNA expression levels. Our scheme is applicable to protein networks as well, because they are also sparsely connected large networks (38). Advances in experimental techniques have provided proteomic data on a massive scale and have sparked attempts to reverse-engineer protein networks (3739). Most of these reverse engineering methods are based on clustering (40) and suffer from the same drawbacks as the clustering schemes for reverse engineering gene networks, as noted earlier.

One potential drawback of our scheme is that it requires data on time derivatives (Ẋ in Eq. 2), which can be difficult to obtain especially in the presence of noise. However, with careful instrumentation (41) and statistics (42), it is possible to estimate the gene expressions relatively accurately by repeating microarray measurements only a small number of times. The large number of gene expressions per microarray provides the large number of samples to avoid small-sample biases for reliable estimation of the noise parameters. We conjecture that a similar approach can be used to obtain reasonable estimates of the time derivatives of the gene expressions. In our in numero experiments, we adopted an unsophisticated approach, where we measured only the gene expressions X, and then estimated the time derivatives Ẋ by linear interpolation. Even with such naïveté, we could correctly reverse-engineer the network architectures, as demonstrated in the in numero experiments.

There is considerable theoretical work that can be done to improve our method, both algorithmically and in terms of modeling. On the algorithmic side, there is much to investigate regarding how to impose sparseness. For convenience we have adopted the simplex method from refs. 32 and 33 to solve the L1 regression problem. This computational algorithm may not be the most efficient one given our unconventional situation, with the number of regression parameters being approximately NM, which increases linearly with the number of data points N. Other methods, such as interior point methods (43) may be worth considering. Indeed, minimizing the L1 norm may not be the optimal way to unmask the outliers. Many different methods, such as least median of squares and least trimmed squares, have been proposed as alternatives (26). It is as yet unclear which of these methods is best suited for the task at hand. This promises to be a rich problem in robust statistics. Another idea that we may borrow from the study of robust statistics is to use designed experiments rather than random sampling to achieve higher exact-fit points (30), which translate into smaller numbers of measurements to recover the connectivity matrix. This construction can be experimentally realized by using genetic toggle switches (44) to perturb the gene networks systematically. A similar idea has been exploited (in J.T., M.K.S.Y., J. Hasty, and J.J.C., unpublished work) to reverse-engineer gene networks.

On the modeling side, for simplicity we have adopted Eq. 1, a deterministic linear system of ordinary differential equation with constant coefficients, to model gene networks. We have neglected the effects of nonlinearities, noise, time delays (45), and combinatorial effects (46). Another complication is that network connectivity can change dynamically in response to changes in the experimental conditions, as proteins and metabolites are synthesized or destroyed to create or block pathways. It is futile to try to capture such a dynamically changing network by using a static model. Instead, we have to adopt a dynamical model with time-dependent matrix elements Wij(t) in Eq. 1. Similarly, we can combine time series data from different experiments only if the data are obtained under comparable conditions, and in particular, with the system operating near the same steady state. Otherwise, the system may be governed by x = Cx in one case and by x = Dx in another, with CD, so that we cannot capture all the data by using one model x = Ax.

We expect that our reverse engineering scheme will be useful for reconstructing gene networks on a genome-wide scale when more experimental data become available. Because the number of sample points that are needed to recover the network scales logarithmically with the size of the network, we expect to be able to recover the network topology of real genomes, which consist of O(104) genes, with several hundred measurements instead of tens of thousands of measurements. This amount of data should be obtainable in the near future as the cost of experimental data collection drops (1, 47).

Acknowledgments

We thank the referees for carefully reading an earlier draft of this manuscript and for their comments, which have significantly improved the quality of this manuscript. This work was supported by the National Science Foundation through the Bio-QuBIC Program (Grant no. EIA-0130331) and by the Defense Advanced Research Projects Agency (Grant no. F30602-01-2-0579).

Abbreviation

SVD
singular value decomposition

Footnotes

This paper was submitted directly (Track II) to the PNAS office.

References

1. Lockhart D J, Winzeler E A. Nature (London) 2000;405:827–836. [PubMed]
2. DeRisi J L, Iyer V R, Brown P O. Science. 1997;278:680–686. [PubMed]
3. Wen X, Fuhrman S, Michaels G S, Carr D B, Smith S, Barker J L, Somogyi R. Proc Natl Acad Sci USA. 1998;95:334–339. [PMC free article] [PubMed]
4. Spellman P T, Sherlock G, Zhang M Q, Iyer V R, Anders K, Eisen M B, Brown P O, Botstein D, Futcher B. Mol Biol Cell. 1998;9:3273–3297. [PMC free article] [PubMed]
5. Eisen M B, Spellman P T, Brown P O, Botstein D. Proc Natl Acad Sci USA. 1998;95:14863–14868. [PMC free article] [PubMed]
6. Tavazoie S, Hughes J D, Campbell M J, Cho R J, Church G M. Nat Genet. 1999;22:281–285. [PubMed]
7. Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander E S, Golub T R. Proc Natl Acad Sci USA. 1999;96:2907–2912. [PMC free article] [PubMed]
8. Brown M P, Grundy W N, Lin D, Cristianini N, Sugnet C W, Furey T S, Ares M, Jr, Haussler D. Proc Natl Acad Sci USA. 2000;97:262–267. [PMC free article] [PubMed]
9. Morgan B J T, Ray A P G. Appl Statist. 1995;44:117–134.
10. Marcotte E M. Nat Biotechnol. 2001;19:626–627. [PubMed]
11. Jeong H, Tombor B, Albert R, Oltvai Z N, Barabási A-L. Nature (London) 2000;407:651–654. [PubMed]
12. Jeong H, Mason S P, Barabási A-L, Oltvai Z N. Nature (London) 2001;411:41–42. [PubMed]
13. Wagner A. Nat Genet. 2000;24:355–361. [PubMed]
14. Gygi S P, Rochon Y, Franza B R, Aebersold R. Mol Cell Biol. 1999;19:1720–1730. [PMC free article] [PubMed]
15. Ljung L. System Identification: Theory for the User. 2nd Ed. Englewood Cliffs, N.J.: Prentice–Hall; 1999.
16. Wahde M, Hertz J. BioSystems. 2000;55:129–136. [PubMed]
17. D'haeseleer P, Liang S, Somogyi R. Bioinformatics. 2000;16:707–726. [PubMed]
18. Hartemink A J, Gifford D K, Jaakkola T S, Young R A. Pac. Symp. Biocomputing. Vol. 6. 2001. pp. 422–433. [PubMed]
19. D'haeseleer P, Wen X, Fuhrman S, Somogyi R. Pac. Symp. Biocomputing. Vol. 4. 1999. pp. 41–52. [PubMed]
20. Alter O, Brown P O, Botstein D. Proc Natl Acad Sci USA. 2000;97:10101–10106. [PMC free article] [PubMed]
21. Raychaudhuri S, Stuart J M, Altman R B. Pac. Symp. Biocomputing. Vol. 5. 2000. pp. 452–463.
22. Holter N S, Maritan A, Cieplak M, Fedoroff N V, Banavar J R. Proc Natl Acad Sci USA. 2001;98:1693–1698. [PMC free article] [PubMed]
23. Press W H, Teukolsky S A, Vetterling W T, Flannery B P. Numerical Recipes in Fortran: The Art of Scientific Computing. 2nd Ed. New York: Cambridge Univ. Press; 1992.
24. Arnone M I, Davidson E H. Development. Vol. 124. U.K.: Cambridge; 1997. pp. 1851–1864. [PubMed]
25. Thieffry D, Huerta A M, Pérez-Rueda E, Collado-Vides J. BioEssays. 1998;20:433–440. [PubMed]
26. Hampel F R, Ranchetti E M, Rousseeuw P J, Stahel W A. Robust Statistics: The Approach Based on Influence Functions. New York: Wiley; 1986.
27. Ideker T, Galitski T, Hood L. Annu Rev Genom Hum Genet. 2001;2:343–372. [PubMed]
28. Weaver, D. C., Workman, C. T. & Stormo, G. D. (1999) in Pac. Symp. Biocomputing, 112–123. [PubMed]
29. van Someren E P, Wessels L F A, Reinders M J T. Proc. of the 21st Symp. on Information Theory in the Benelux. 2000. pp. 215–222.
30. Ellis S P, Morgenthaler S. J Am Stat Assoc. 1992;87:143–148.
31. Bloomfield P, Steiger W L. Least Absolute Deviations: Theory, Applications, and Algorithms. Boston: Birkhäuser; 1983.
32. Barrodale I, Roberts F D K. SIAM J Numer Anal. 1973;10:839–848.
33. Barrodale I, Roberts F D K. Comm ACM. 1974;17:319–320.
34. Noble B. Applied Linear Algebra. Englewood Cliffs, NJ: Prentice–Hall; 1969.
35. Yagil G, Yagil E. Biophys J. 1971;11:11–27. [PMC free article] [PubMed]
36. Edelstein-Keshet L. Mathematical Models in Biology. New York: McGraw–Hill; 1988.
37. Marcotte E M, Pellegrini M, Thompson M J, Yeates T O, Eisenberg D. Nature (London) 1999;402:83–86. [PubMed]
38. Uetz P, Giot L, Cagney G, Mansfield T A, Judson R S, Knight J R, Lockshon D, Narayan V, Srinivasan M, Pochart P, et al. Nature (London) 2000;403:623–627. [PubMed]
39. Zhu H, Bilgin M, Bangham R, Hall D, Casamayor A, Bertone P, Lan N, Jansen R, Bidlingmaier S, Houfek T, et al. Science. 2001;293:2101–2105. [PubMed]
40. Eisenberg D, Marcotte E M, Xenarios I, Yeates T O. Nature (London) 2000;405:823–826. [PubMed]
41. Eisen M B, Brown P O. Methods Enzymol. 1999;303:179–205. [PubMed]
42. Ideker T, Thorsson V, Siegel A F, Hood L E. J Comput Biol. 2000;7:805–817. [PubMed]
43. Wright S J. Primal-Dual Interior-Point Methods. Philadelphia: SIAM; 1996.
44. Gardner T S, Cantor C R, Collins J J. Nature (London) 2000;403:339–342. [PubMed]
45. McAdams H H, Arkin A. Proc Natl Acad Sci USA. 1997;94:814–819. [PMC free article] [PubMed]
46. Pilpel Y, Sudarsanam P, Church G M. Nat Genet. 2001;29:153–159. [PubMed]
47. Kalir S, McClure J, Pabbaraju K, Southward C, Ronen M, Leibler S, Surette M G, Alon U. Science. 2001;292:2080–2083. [PubMed]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • Cited in Books
    Cited in Books
    PubMed Central articles cited in books
  • MedGen
    MedGen
    Related information in MedGen
  • PubMed
    PubMed
    PubMed citations for these articles