- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

# Dynamic modeling of gene expression data

^{*}Amos Maritan,

^{†}

^{‡}Marek Cieplak,

^{*}

^{§}Nina V. Fedoroff,

^{¶}and Jayanth R. Banavar

^{*}

^{‖}

^{*}Department of Physics and Center for Materials Physics, 104 Davey Laboratory, and

^{¶}Department of Biology and the Life Sciences Consortium, 519 Wartik Laboratory, Pennsylvania State University, University Park, PA 16802;

^{†}International School for Advanced Studies, Via Beirut 2-4, 34014 Trieste, Italy;

^{‡}Istituto Nazionale de Fisica Materia and the Abdus Salam International Center for Theoretical Physics, 34014 Trieste, Italy; and

^{§}Institute of Physics, Polish Academy of Sciences, 02-668 Warsaw, Poland

^{‖}To whom reprint requests should be addressed. E-mail: ude.usp.syhp@htnayaj.

## Abstract

We describe the time evolution of gene expression levels by using a time translational matrix to predict future expression levels of genes based on their expression levels at some initial time. We deduce the time translational matrix for previously published DNA microarray gene expression data sets by modeling them within a linear framework by using the characteristic modes obtained by singular value decomposition. The resulting time translation matrix provides a measure of the relationships among the modes and governs their time evolution. We show that a truncated matrix linking just a few modes is a good approximation of the full time translation matrix. This finding suggests that the number of essential connections among the genes is small.

The development and application of DNA and oligonucleotide–microarray techniques (1, 2) for measuring the expression of many or all of an organism's genes have stimulated considerable interest in using expression profiling to elucidate the nature and connectivity of the underlying genetic regulatory networks (3–9). Biological systems, whether organismal or suborganismal, are robust, adaptable, and redundant (10). It is increasingly apparent that such robustness is inherent in the evolution of networks (11). More particularly, it is the result of the operation of certain kinds of biochemical and genetic mechanisms (12–18).

Analysis of global gene expression data to group genes with similar expression patterns has already proved useful in identifying genes that contribute to common functions and are therefore likely to be coregulated (19–23). Whether information about the underlying genetic architecture and regulatory interconnections can be derived from the analysis of gene expression patterns remains to be determined. Both the subcellular localization and activity of transcription factors can be influenced by posttranslational modifications and interactions with small molecules and proteins. These can be extremely important from a regulatory perspective but undetectable at the gene expression level, complicating the identification of causal connections among genes. Nonetheless, a number of conceptual frameworks for modeling genetic regulatory networks have been proposed (3–9).

Several groups have recently applied standard matrix analysis to large gene expression data sets, extracting dominant patterns or “modes” of gene expression change (24–26). It has become evident that the complexity of gene expression patterns is low, with just a few modes capturing many of the essential features of these patterns. The expression pattern of any particular gene can be represented precisely by a linear combination of the modes with gene-specific coefficients (25). Furthermore, a good approximation of the exact pattern can be obtained by using just a few of the modes, underscoring the simplicity of the gene expression patterns.

In the present communication, we consider a simple model in which the
expression levels of the genes at a given time are postulated to be
linear combinations of their levels at a previous time. We show that
the temporal evolution of the gene expression profiles can be described
within such a linear framework by using a “time translation”
matrix, which reflects the magnitude of the connectivities between
genes and makes it possible to predict future expression levels from
initial levels. The basic framework has been described previously,
along with initial efforts to apply the model to actual data sets (5,
7–9). The number of genes, *g*, typically far exceeds the
number of time points for which data are available, making the problem
of determining the time translation matrix an ill-posed one. The basic
difficulty is that to uniquely and unambiguously determine the
*g*^{2} elements of the time translation
matrix, one needs a set of *g*^{2} linearly
independent equations. D'haeseleer *et al.* (8) used a
nonlinear interpolation scheme to guess the shapes of gene expression
profiles between the measured time points. As noted by the authors,
their final results depend crucially on the precise interpolation
scheme and are therefore speculative. Van Someren *et al.* (9)
instead chose to cluster the genes and study the interrelationships
between the clusters. In this situation, it is possible to determine
the time translation matrix unambiguously, provided the clustering is
meaningful. However, most clustering algorithms are based on profile
similarity, the biological significance of which is not entirely clear.

Here we construct the time translation matrix for the characteristic
modes obtained by using singular value decomposition (SVD). The
polished expression data (22) for each gene may be viewed as a unit
vector in a hyperspace, each of whose axes represents the expression
level at a measurement time of the experiment. The SVD construction
ensures that the modes correspond to linearly independent basis
vectors, a linear combination of which exactly describes the expression
pattern of *each* gene. Furthermore, this basis set is
optimally chosen by SVD so that the contributions of the modes
progressively decrease as one considers higher-order modes (24–26).

Our results suggest that the causal links between the modes, and thence the genes, involve just a few essential connections. Any additional connections among the genes must therefore provide redundancy in the network. An important corollary is that it may be impossible to determine detailed connectivities among genes with just the microarray data, because the number of genes greatly exceeds the number of contributing modes.

## Methods

It was shown recently (24–26) that the essential features of the gene expression patterns are captured by just a few of the distinct characteristic modes determined through SVD. In the previous work (25), we treated the gene expression pattern of all of the genes as a “static” image and derived the underlying genome-wide characteristic modes of which it is composed. Here we carry out a dynamical analysis, exploring the possible causal relationships among the genes by deducing a time translation matrix for the characteristic modes defined by SVD.

To deduce the time translation matrix, we consider an exact
representation (25) of the gene expression data as a linear combination
of all of the *r* modes obtained from SVD. Each gene is
characterized by *r* gene specific coefficients, where
*r* is one less than the number of time points in the polished
data set (22). The key goal is to attack the inverse problem and infer
the nature of the gene network connectivity. However, the number of
time points is smaller than the number of genes, and thus the problem
is underdetermined. Nevertheless, the inverse problem is mathematically
well defined and tractable if one considers the causal relationships
among the *r* characteristic modes obtained by SVD. This is
because, as noted earlier, the *r* modes form a linearly
independent basis set.

Let

represent the expression levels of the *r* modes at time
*t*. Then, mathematically, our linear model is expressed as

where *M* is a time-independent *r* ×
*r* time translation matrix, which provides key information on
the influence of the modes on each other. The time step,
Δ*t*, is chosen to be the highest common factor among all of
the experimentally measured time intervals so that the time of the
*j*th measurement is *t _{j}* =

*n*Δ

_{j}*t*, where

*n*is an integer. For equally spaced measurements,

_{j}*n*

_{j = }

*j*.

To determine *M*, we define a quantity
*Z*(*t*) with the initial condition
*Z*(*t*_{0}) =
*Y*(*t*_{0}) and, for all
subsequent times, *Z* determined from
*Z*(*t* + Δ*t*) =
*M**Z*(*t*). For any integer
*k*, we have

The *r*^{2} coefficients of
*M* are chosen to minimize the cost function

For equally spaced measurements, *M* can be determined
exactly by using a linear analysis so that *CF* = 0. For
unequally spaced measurements, the problem becomes nonlinear, and it is
necessary to deduce *M* by using an optimization technique
such as simulated annealing (27). The outcome of this analysis is that
the gene expression data set can be reexpressed precisely by using the
*r* specific coefficients for each gene (a linear combination
of the *r* modes with these coefficients gives the gene
expression profile), the *r* × *r* time
translation matrix, *M*, deduced as described above, and the
initial values of each of the *r* modes.

## Results

We have determined *M*, the *r* ×
*r* time translation matrix, for three different data sets of
gene expression profiles: yeast cell cycle (CDC15) (20) by using the
first 12 equally spaced time points representing the first two cycles,
yeast sporulation (21), which has 7 time points, and human fibroblast
(22), which has 13 time points (Table (Table1).
1).
The matrix element *M _{i,j}* describes the
influence of mode

*j*on mode

*i*. Specifically, the coefficient

*M*multiplied by the expression level of gene

_{i,j}*j*at time

*t*contributes to the expression level of gene

*i*at time (

*t*+ Δ

*t*). A positive matrix element leads to the

*i*th gene being positively reinforced by the

*j*th gene expression level at a previous time.

*M*is determined exactly and uniquely for the yeast cell-cycle data. The unequal spacing of the time points in the two other data sets precluded an exact solution, and

*M*is an approximation derived by using simulated annealing techniques (27). We have verified that the accuracy of

*M*is very high by showing that the temporal evolution of the modes is reproduced well and that the reconstructed gene expression patterns are virtually indistinguishable from the experimental data. The singular values are spread out, and the amplitudes of the modes decrease as one considers higher-order modes (25). This fact implies that the influence of the dominant modes on the other modes is generally small. Interestingly, for the cdc15 and sporulation data sets, the converse is also true, and the dominant modes are not strongly impacted by the other modes, especially when one takes into account the lower amplitudes of the higher-order modes. This finding suggests that a few-mode approximation ought to be excellent for these two cases.

Once the matrix *M* characterizing the interrelationship
between the *r* modes is determined, it is a simple matter to
deduce a matrix that similarly describes the interactions between any
other set of *r* linearly independent profiles. Specifically,
one can straightforwardly determine the interrelationships between
*r* clusters of genes. As an example, consider the sporulation
data (14), which is characterized by *r* = 6. The problem
of deriving the time translation matrix is underdetermined if the
number of clusters exceeds six, and then there is no unique solution.
When the number of clusters is less than six, there is no guarantee
that there exists even one solution. We therefore consider six clusters
(metabolic, early I, early II, middle, midlate, and late), excluding
the early-mid cluster, which forms the least coherent group. The
average expression patterns of the six clusters
(*c*_{1},…
,*c*_{6}) are obtained as averages over the
genes within the cluster and can be expressed as linear combinations of
the six modes as

where *S* is a 6 × 6 matrix. The rows of
*S* are the components of each of the characteristic modes
that make up the average expression pattern for the six clusters. The
interrelationships between the cluster expression patterns is
determined with a time translation matrix of the form

so that

The averages of the experimental measurements (circles) and the
predicted expression patterns (lines) of the six clusters are shown in
Fig. Fig.11 and are in excellent agreement,
confirming the accuracy of the *M* matrix for the sporulation
data in Table Table1.1. The matrix *N* is shown in Table
Table2.2. The significance of the entries in
*N* is similar to that described earlier for *M*.
That is, the matrix element *M _{i,j}* describes
the influence of cluster

*j*on cluster

*i*. Specifically, the coefficient

*M*multiplied by the expression level of cluster

_{i,j}*j*at time

*t*contributes to the expression level of cluster

*i*at time (

*t*+ Δ

*t*). A positive matrix element leads to the

*i*th cluster being positively reinforced by the

*j*th cluster expression level at a previous time.

Does one need the full *r* × *r* time
translation matrix to describe the gene expression patterns? Or is an
appropriately chosen truncated time translation matrix adequate to
reconstruct the expression patterns with reasonable fidelity? We now
consider a linear interaction model (Eq. 2) within which
*M* is a 2 × 2 matrix, and only the two most important
modes are used. The values of the four entries in the matrix
*M* are determined by using an optimization scheme that
minimizes the cost function similar to that given in Eq. 4.
The resulting *M* matrices are shown in Table
Table3,3, and a comparison of the calculated
modes (solid lines) with those obtained by SVD (dashed lines) for the
three sets of gene expression profiles is shown in Fig.
Fig.2.2. It is interesting to compare these
2 × 2 matrices with the corresponding portion of the full
matrices shown in Table Table1.1. The two-mode approximation is excellent for
the cdc15 data set (CF = 0.05), moderate for the sporulation data
set (CF = 0.18), and not as good for the fibroblast data set
(CF = 0.31) as for the others. As noted before, the use of the
full *r* × *r* time translation matrix leads to
an exact reproduction of the data set. Not unexpectedly, the quality of
the fit improves as the number of modes considered is increased. Fig.
Fig.33 shows the reconstructed expression
profiles starting with the initial values, and by using the 2 × 2
time translation matrix (denoted by *a*), the profiles
obtained as a linear combination of the top two modes with appropriate
gene-specific coefficients (*b*) and the experimental data
(*c*) for the three data sets. In all three cases, the main
features of the expression patterns are reproduced quite well by the
time translation matrix with just two modes. The two-mode
reconstruction of the CDC15 profiles is the most accurate of the three.

*a*) cdc15, (

*b*) sporulation, and (

*c*) fibroblast data sets. The circles correspond to the measured data, and the lines show the approximations based on the best-fit 2 × 2 time translation matrices.

*Left*), sporulation (

*Center*), and fibroblast (

*Right*) data sets. For each set,

*a*shows the results obtained by using the 2 × 2 time translation matrix to determine the temporal evolution

**...**

It can be shown that, in general, a 2 × 2 time translation matrix produces only two types of behavior, depending on its eigenvalues. If the eigenvalues are real, the generated modes will independently grow or decay exponentially. When the eigenvalues are complex conjugates of each other, as they are for all three cases we have examined, the two generated modes are oscillatory with growing or decaying amplitudes. Mathematically, the two modes are constrained to have the form:

Both modes are described by a single time period, τ, and a
single growth or decay factor, *G.* Because there are four
parameters in the matrix *M*, there can be only four
independent attributes in the generated modes. Two other parameters,
*c* and Δ, are determined from the initial conditions. In
addition to τ and *G*, we can also determine the phase
difference between the two modes, , and the relative amplitude of
the two modes, *A*. These attributes can be determined from
the coefficients in *M* by using the equations in Table Table4.
4.
Table Table55 shows the four attributes for
each of the three data sets. The self consistency of our analysis is
underscored by the fact that the magnitude of the growth factor,
*G*, is close to one for all three cases, which is a
biologically pleasing result in that the modes do not grow explosively
or decay. For the cell-cycle data, the characteristic period is about
115 min. In the other two cases, the data are not periodic, and hence
the best-fit periods are comparable to the duration of the measurement.
For the yeast cell-cycle data, , the phase difference between the
top two modes is 90°, suggesting a simple sine–cosine relationship,
as noted by Alter *et al.* (26). Indeed, this result is
self-consistent. When *G* is equal to 1 and an integer number
of periods is considered, orthogonality of the top two modes requires
that the phase difference be 90°.

In summary, we have shown that it is possible to describe genetic expression data sets by using a simple linear interaction model with only a small number of interactions. One important implication is that it is impossible to determine the exact interactions among individual genes in these data sets. The problem is underdetermined, because the number of genes is much larger than the number of time points in the experiments. Nonetheless, we have shown that it is possible to accurately describe the interactions among the characteristic modes. Moreover, an interaction model with only two connections reconstructs the key features of the gene expression in the simplest cases with good fidelity. Our results imply that, because there are only a few essential connections among modes and therefore among genes, additional links provide redundancy in the network.

## Acknowledgments

This work was supported by an Integrative Graduate Education and Research Training Grant from the National Science Foundation, Istituto Nazionale di Fisica Nucleare (Italy), Komitet Badan Naukowych Grant 2P03B-146–18, Ministero dell'Università e della Ricerca Scientifica, National Aeronautics and Space Administration, and National Science Foundation Plant Genome Research Program Grant DBI-9872629.

## Abbreviation

- SVD
- singular value decomposition

## References

**National Academy of Sciences**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (1.3M)

- Fundamental patterns underlying gene expression profiles: simplicity from complexity.[Proc Natl Acad Sci U S A. 2000]
*Holter NS, Mitra M, Maritan A, Cieplak M, Banavar JR, Fedoroff NV.**Proc Natl Acad Sci U S A. 2000 Jul 18; 97(15):8409-14.* - Dynamic models of gene expression and classification.[Funct Integr Genomics. 2001]
*Dewey TG, Galas DJ.**Funct Integr Genomics. 2001 Mar; 1(4):269-78.* - Modeling the temporal evolution of the Drosophila gene expression from DNA microarray time series.[Phys Biol. 2009]
*Haye A, Dehouck Y, Kwasigroch JM, Bogaerts P, Rooman M.**Phys Biol. 2009 Jan 27; 6(1):016004. Epub 2009 Jan 27.* - From microarray to biological networks: Analysis of gene expression profiles.[Methods Mol Biol. 2006]
*Wu X, Dewey TG.**Methods Mol Biol. 2006; 316:35-48.* - A systematic review of large scale and heterogeneous gene array data in heart failure.[J Mol Cell Cardiol. 2005]
*Sharma UC, Pokharel S, Evelo CT, Maessen JG.**J Mol Cell Cardiol. 2005 Mar; 38(3):425-32.*

- Using Dynamic Gene Module Map Analysis To Identify Targets That Modulate Free Fatty Acid Induced Cytotoxicity[Biotechnology progress. 2008]
*Li Z, Srivastava S, Findlan R, Chan C.**Biotechnology progress. 2008; 24(1)29-37* - Modeling Genome-Wide Dynamic Regulatory Network in Mouse Lungs with Influenza Infection Using High-Dimensional Ordinary Differential Equations[PLoS ONE. ]
*Wu S, Liu ZP, Qiu X, Wu H.**PLoS ONE. 9(5)e95276* - Inference of the Xenopus tropicalis embryonic regulatory network and spatial gene expression patterns[BMC Systems Biology. ]
*Zheng Z, Christley S, Chiu WT, Blitz IL, Xie X, Cho KW, Nie Q.**BMC Systems Biology. 83* - Universality in network dynamics[Nature physics. 2013]
*Barzel B, Barabási AL.**Nature physics. 2013; 910.1038/nphys2741* - Global Features of Gene Expression on the Proteome and Transcriptome Levels in S. coelicolor during Germination[PLoS ONE. ]
*Strakova E, Bobek J, Zikova A, Vohradsky J.**PLoS ONE. 8(9)e72842*

- Cited in BooksCited in BooksPubMed Central articles cited in books
- MedGenMedGenRelated information in MedGen
- PubMedPubMedPubMed citations for these articles

- Dynamic modeling of gene expression dataDynamic modeling of gene expression dataProceedings of the National Academy of Sciences of the United States of America. Feb 13, 2001; 98(4)1693PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...