# Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms

^{*}Genetics and

^{‡}Biochemistry, Stanford University, Stanford, CA 94305

^{†}To whom correspondence should be addressed. E-mail: ude.drofnats.emoneg@ylro.

## Abstract

We describe a comparative mathematical framework for two genome-scale expression data sets. This framework formulates expression as superposition of the effects of regulatory programs, biological processes, and experimental artifacts common to both data sets, as well as those that are exclusive to one data set or the other, by using generalized singular value decomposition. This framework enables comparative reconstruction and classification of the genes and arrays of both data sets. We illustrate this framework with a comparison of yeast and human cell-cycle expression data sets.

**Keywords:**DNA microarrays, cell cycle, yeast Saccharomyces cerevisiae, human HeLa cell line

Recent advances in high-throughput genomic technologies enable acquisition of different types of molecular biological data, e.g., DNA-sequence and mRNA-expression data, on a genomic scale. Comparative analysis of these data among two or more model organisms promises to enhance fundamental understanding of the universality as well as the specialization of molecular biological mechanisms. It also may prove useful in medical diagnosis, treatment, and drug design. Comparisons of the DNA sequence of entire genomes already give insights into evolutionary, biochemical, and genetic pathways.

Comparative analysis of mRNA-expression data requires mathematical tools that are able to distinguish the similar from the dissimilar among two or more large-scale data sets. These tools should provide mathematical frameworks for the description of the data, where the variables and operations may represent some biological reality. Recently we showed that singular value decomposition (SVD) provides such a framework for genome-wide expression data (refs. 1–3; see also refs. 4–7).

Now we show that generalized SVD (GSVD) (8) provides a comparative mathematical framework for two genome-scale expression data sets. GSVD is a linear transformation of the two data sets from the two genes × arrays spaces to two reduced and diagonalized “genelets” × “arraylets” spaces. The genelets are shared by both data sets. Each genelet is expressed only in the two corresponding arraylets, with a corresponding “angular distance” indicating the relative significance of this genelet, i.e., its significance, in one data set relative to that in the other.

We show that a genelet of equal significance in both data sets may represent a process common to both data sets. The two corresponding arraylets may represent the cellular states in each data set that correspond to this common process. A genelet of no significance in one data set relative to the other may represent a process exclusive to the latter data set. The corresponding arraylet of this data set may represent the cellular state that corresponds to this exclusive process.

We also show that mathematical reconstruction of gene expression in a subset of genelets may simulate experimental observation of only the process that these genelets are inferred to represent. Similarly, reconstruction of array expression in the subset of corresponding arraylets may simulate observation of only the corresponding cellular state. Reconstruction of each data set in two or more subspaces may simulate observation of genome-scale differential expression in the processes, which these subspaces are inferred to span. We demonstrate comparative classification of both sets of genes and arrays based on similarity in their reconstructed rather than overall expression.

We illustrate this framework with a comparison of yeast (9) and human (10) cell cycle-expression data sets.

## Mathematical Methods: GSVD

A single microarray probes the relative expression levels of *N _{1}* genes in a single sample. A series of

*M*arrays probes the genome-scale expression levels in

_{1}*M*different samples, i.e., under

_{1}*M*different experimental conditions. Let the matrix

_{1}*ê*of size

_{1},*N*-genes ×

_{1}*M*-arrays, tabulate the full expression data. The vector in the

_{1}*n*th row of the matrix

*ê*lists the expression of the

_{1}, 〈g_{1,n}| ≡ 〈n|ê_{1},*n*th gene across the different samples that correspond to the different arrays.

^{§}The vector in the

*m*th column of the matrix

*ê*lists the genome-scale expression measured by the

_{1}, |a_{1,m}〉 ≡ ê_{1}|m〉,*m*th array. Let the matrix

*ê*of size

_{2},*N*-genes ×

_{2}*M*-arrays, tabulate the relative expression levels of

_{2}*N*genes under

_{2}*M*max

_{2}= M_{1}≡ M <*{N*experimental conditions that correspond one to one to the

_{1}, N_{2}}*M*conditions underlying

_{1}*ê*This one-to-one correspondence between the two sets of conditions is at the foundation of the GSVD comparative analysis of the two data sets and should be mapped out carefully.

_{1}.GSVD then is simultaneous linear transformation of the two expression data sets *ê _{1}* and

*ê*from the two

_{2}*N*-genes ×

_{1}*M*-arrays and

*N*-genes ×

_{2}*M*-arrays spaces to the two reduced

*M*-genelets ×

*M*-arraylets spaces (see Fig. 5, which is published as supporting information on the PNAS web site, www.pnas.org, and also at http://genome-www.stanford.edu/GSVD/),

In these spaces the data are represented by the diagonal nonnegative matrices ɛ̂_{1} and ɛ̂_{2}, which satisfy *〈k|ɛ̂ _{1}|m〉 ≡ ɛ_{1,m}δ_{km} ≥ 0* and

*〈k|ɛ̂*for all

_{2}|m〉 ≡ ɛ_{2,m}δ_{km}≥ 0*1 ≤ k, m ≤ M.*The

*m*th genelet is expressed only in the two

*m*th arraylets, each of which corresponds to one of the two data sets. Therefore, each genelet is decoupled from all other genelets in both data sets simultaneously.

The antisymmetric angular distance between the data sets,

indicates the relative significance of the *m*th genelet, i.e., its significance in the first data set relative to that in the second in terms of the ratio of the expression information captured by this genelet in the first data set to that in the second. An angular distance of 0 indicates a genelet of equal significance in both data sets, with *ɛ _{1,m} = ɛ_{2,m};* ±π/4 indicates no significance in the second data set relative to the first, with

*ɛ*or in the first relative to the second, respectively. The angular distances are arranged in decreasing order of significance in the first data set relative to the second such that

_{1,m}≫ ɛ_{2,m},*π/4 ≥ θ*The “generalized fractions of eigenexpression” of each data set separately indicate the significance of each genelet and its corresponding arraylet in this data set in terms of the fraction of the overall expression information that they capture in this data set alone (see

_{1}≥ ⋯ ≥ θ_{M}≥ −π/4.*Appendix*, Eqs.

**4**and

**5**, and Fig. 6, which are published as supporting information on the PNAS web site).

The transformation matrix *$\widehat{x}$ ^{−1}* defines the

*M*-genelets ×

*M*-arrays basis set that is shared by both data sets. The transformation matrices

*û*and

_{1}*û*define the

_{2}*N*-genes ×

_{1}*M*-arraylets and

*N*-genes ×

_{2}*M*-arraylets basis sets that correspond to the first and second data sets, respectively. The vector in the

*m*th row of

*$\widehat{x}$*lists the expression of the

^{−1}, 〈γ_{m}| ≡ 〈m|$\widehat{x}$^{−1},*m*th genelet across the different arrays in both data sets simultaneously. The vectors in the

*m*th columns of

*û*and

_{1}*û*and

_{2}, |α_{1,m}〉 ≡ û_{1}|m〉*|α*list the genome-scale expression in the

_{2,m}〉 ≡ û_{2}|m〉,*m*th arraylets of the first and second data sets, respectively. The genelets are normalized, such that

*〈γ*for all

_{m}|γ_{m}〉 = 1*1 ≤ m ≤ M*, but not necessarily orthogonal superpositions of the genes of the first and, at the same time, the second data set. The arraylets of either data set are orthonormal superpositions of the arrays of this data set such that, in general,

*$\widehat{x}$*is nonorthogonal, whereas

^{−1}*û*and

_{1}*û*are both orthogonal,

_{2} where *Î* is the identity matrix. Therefore, each arraylet of either data set is decoupled and decorrelated from all other arraylets of this data set. The genelets and arraylets are unique, and therefore also data-driven, up to a phase factor of ±1, because each genelet and arraylet capture both parallel and antiparallel gene- or array-expression patterns, respectively, except in degenerate subspaces, defined by subsets of equal angular distances.

### GSVD Calculation.

From Eqs. 1 and 3, the *M*-arrays × *M*-arrays symmetric correlation matrices *â _{1} = ê*

*ê*

_{1}= ($\widehat{x}$^{−1})^{T}ɛ̂*$\widehat{x}$*and

^{−1}*â*

_{2}= ê*ê*are represented in the

_{2}*M*-genelets ×

*M*-genelets space by the simultaneously diagonal matrices ɛ̂ and ɛ̂, respectively. In theory, it is possible to calculate the GSVD of the two data sets

*ê*and

_{1}*ê*by (

_{2}*i*) diagonalizing

*â*

*â*

_{1}= $\widehat{x}$(ɛ̂*ɛ̂*to obtain

_{1})^{2}$\widehat{x}$^{−1}*$\widehat{x}$;*(

*ii*) projecting

*$\widehat{x}$*onto

*ê*and

_{1}*ê*to obtain

_{2}*ɛ̂*

*= (û*and

_{1}ɛ̂_{1})^{T}(û_{1}ɛ̂_{1}) = (ê_{1}$\widehat{x}$)^{T}(ê_{1}$\widehat{x}$)*ɛ̂*

*;*and (

*iii*) projecting

*$\widehat{x}$, ɛ̂*and

_{1},*ɛ̂*onto

_{2}*ê*and

_{1}*ê*to obtain

_{2}*û*and

_{1}= ê_{1}$\widehat{x}$ɛ̂*û*In practice, we avoid computing the quotient of the correlation matrices,

_{2}.*â*

*â*and use the numerically robust GSVD algorithm (8, 9) to obtain

_{1},*$\widehat{x}$.*

### Comparative Pattern Inference.

The decorrelation of the arraylets suggests that some of the significant arraylets of each data set, i.e., these with the largest generalized fractions of eigenexpression (see *Appendix*, Eqs. **4** and **5**, and Fig. 6), may represent independent cellular states, where the corresponding genelets represent the corresponding regulatory programs, biological processes, or experimental artifacts that contribute to the overall expression signal in each data set. The one-to-one correspondence between the two sets of experimental conditions that underlie the two data sets suggests that among these genelets, a genelet of equal significance in both data sets with angular distance of ≈0 may represent a process common to both data sets; a genelet of no significance in one data set relative to the other with angular distance of ≈± may represent a process exclusive to the latter data set. We infer that a genelet represents a process exclusive to one or common to both data sets when its expression pattern across the corresponding one or both sets of arrays is biologically or experimentally interpretable. We associate this genelet with a biological process when this inference is supported by one or two coherent biological themes, reflected in the functions of the genes of the corresponding one or both data sets, whose coefficients of this genelet in the GSVD expansion, as listed in the corresponding one or both arraylets, are largest in magnitude compared to those coefficients of all other genes. With this we assume that the corresponding one or both arraylets represent the cellular states of this exclusive or common process, respectively. We estimate the probabilistic significance of these associations by annotations using combinatorics (ref. 10; see *Appendix*, Fig. 7, and Table 1, which are published as supporting information on the PNAS web site).

### Comparative Data Reconstruction.

The decoupling of the genelets and both sets of arraylets allows reconstructing either data set in a given subspace of *K*-genelets and corresponding arraylets without eliminating genes or arrays, *ê _{i} → ∑*

*ɛ*where

_{i,k}|α_{i,k}〉〈γ_{k}|,*i = 1, 2.*For visualization and classification, we set the arithmetic mean of each genelet across the arrays and that of each arraylet across the genes to 0, such that the expression of each gene and array in the reconstructed data set is centered at its array- or gene-invariant level, respectively.

### Comparative Data Classification.

Inferring that subsets of genelets and arraylets represent independent processes or states, exclusive to one or common to both data sets, allows classifying the genes and arrays of one or simultaneously both data sets by similarity in their expression of these genelets or arraylets, respectively, rather than their overall expression. We least-squares-approximate a subspace spanned by *K > 2* genelets with that spanned by the two orthonormal vectors *|x〉* and *|y〉,* which maximize *∑** 〈γ _{k}|(|x〉〈x| + |y〉〈y|)|γ_{k}〉.* We plot the projection of each gene of either data set

*〈g*where

_{i,n}|,*i = 1, 2,*from the

*K*-genelets subspace onto

*|y〉, ∑*

*ɛ*along the

_{i,k}〈n|α_{i,k}〉〈γ_{k}|y〉/N_{i,n},*y*axis vs. that onto

*|x〉*along the

*x*axis, normalized by its ideal amplitude, where the contribution of each genelet to the overall projected expression of the gene adds up rather than cancels out,

*N*

*= ∑*

*∑*

*ɛ*In this plot, the distance of each gene from the origin,

_{i,k}ɛ_{i,l}|〈n|α_{i,k}〉〈α_{i,l}|n〉〈γ_{k}|(|x〉〈x| + |y〉〈y|)|γ_{l}〉|.*r*is the amplitude of its normalized projection. An amplitude of 1 indicates that the genelets add up; 0 indicates that they cancel out. The phase difference of each gene from the

_{i,n},*x*axis,

*φ*is its phase in the progression of expression across the genes from

_{i,n},*|x〉*to

*|y〉*and back to

*|x〉,*going through the projections of all

*K*-genelets in this subspace

*(|x〉〈x| + |y〉〈y|)|γ*We sort the genes according to

_{k}〉.*φ*Similarly, we plot the projection of each array,

_{i,n}.*|a*from the

_{i,m}〉,*K*-arraylets subspace onto

*∑*

*|α*

_{i,k}〉〈γ_{k}|y〉, ∑*ɛ*along the

_{i,k}〈y|γ_{k}〉〈γ_{k}|m〉/N_{i,m},*y*axis vs. that onto

*∑*

*|α*along the

_{i,k}〉〈γ_{k}|x〉*x*axis, normalized by its ideal amplitude,

*N*

*= ∑*

*∑*

*ɛ*We sort the arrays according to their phase differences from the

_{i,k}ɛ_{i,l}|〈m|γ_{k}〉〈γ_{l}|m〉〈γ_{k}|(|x〉〈x| + |y〉〈y|)|γ_{l}〉|.*x*axis,

*φ*

_{i,m}.## Biological Results: Comparison of Yeast and Human Cell-Cycle Expression Data Sets

Spellman *et al.* (11) monitored mRNA levels for 6,113 putative ORFs of the yeast *Saccharomyces cerevisiae* over two cell-cycle periods in a yeast culture synchronized initially in the cell-cycle stage M/G_{1} by the pheromone α factor, relative to reference mRNA from an asynchronous culture, at 7-min intervals for 119 min. The data set for the yeast experiments we analyze (see Data Sets 1–4, which are published as supporting information on the PNAS web site and mathematica notebook at http://genome-www.stanford.edu/GSVD/) tabulates the ratios of gene-expression levels for the *N _{1}* = 4,523 genes with no missing data in at least 15 of the

*M*= 18 arrays. Of these genes, 604 were classified as cell cycle-regulated by Spellman

_{1}*et al.*, and 77 were classified by traditional methods. Whitfield

*et al.*(12) monitored mRNA levels for 43,198 human gene clones over two and a half cell-cycle periods in a HeLa cell-line culture synchronized initially in S by a double-thymidine block, relative to reference mRNA from an asynchronous HeLa culture, at 2-h intervals for 34 h. The data set for the human experiments we analyze (see Data Sets 5–8, which are published as supporting information on the PNAS web site) tabulates the ratios of gene-expression levels for the

*N*= 12,056 clones with no missing data in at least 15 of the

_{2}*M*= 18 arrays. Of these clones, 750 were classified as cell cycle-regulated by Whitfield

_{2}*et al.*, and 73 were classified by traditional methods. We estimate the missing data in each data set using SVD (ref. 2; see

*Appendix*and Figs. 8–11, which are published as supporting information on the PNAS web site) and calculate the GSVD of both data sets.

### Common Yeast and Human Cell-Cycle Subspace.

The time, i.e., array variations of the third, fourth, and fifth genelets, *〈γ _{3}|, 〈γ_{4}|,* and 〈γ

_{5}|, that are almost equally significant in both data sets (slightly more in the yeast data), with 0 < θ

_{3}, θ

_{4}, θ

_{5}< π/16 (Fig. (Fig.1),1), fit normalized cosine functions of two periods and initial phases of π/3, 0, and −π/3, respectively, superimposed on time-invariant expression (Fig. (Fig.2).2). The genelets 〈γ

_{14}|, 〈γ

_{15}|, and 〈γ

_{16}|, which are also almost equally significant in both data sets (slightly more in the human data), with −π/6 < θ

_{14}, θ

_{15}, θ

_{16}< 0, fit normalized cosines of two and a half periods and initial phases of −π/3, π/3, and 0, respectively. Coherent themes of yeast and human cell-cycle programs emerge from the annotations of the 100 yeast and 100 human genes (13, 14), with largest parallel and separately also antiparallel contributions from each one of these six genelets as listed in the corresponding yeast and human arraylets (see Data Sets 9 and 10, which are published as supporting information on the PNAS web site). We associate all these six genelets with the cell-cycle gene-expression oscillations common to both the yeast and human genomes and manifested in both data sets. We assume that the corresponding six yeast and six human arraylets represent the yeast and human cell-cycle cellular states, respectively. The probabilistic significance of these associations by annotations, estimated using combinatorics, is high: Most of the

*P*values, calculated assuming hypergeometric probability distribution of the annotations among the genes, are orders of magnitude <0.01 (ref. 10; see

*Appendix*, Fig. 7, and Table 1). Following the traditional classifications, the 0-phase genelet 〈γ

_{4}| is associated in parallel with the yeast cell-cycle stage M/G

_{1}, in which the yeast culture is initially synchronized, and both 0-phase genelets 〈γ

_{4}| and −〈γ

_{16}| are associated in parallel with the human cell-cycle stage S, in which the human culture is initially synchronized.

*a*) Raster display of

*$\widehat{x}$*the expression of 18 genelets in 18 yeast and human arrays simultaneously, centered at their array-invariant levels. (

^{−1},*b*) Bar chart of the angular distances showing 〈γ

**...**

*a*) 〈γ

_{3}| (red), 〈γ

_{4}| (blue), and 〈γ

_{5}| (green), which are associated with the common yeast and human cell-cycle gene-expression

**...**

Projecting the expression of the 18 yeast arrays from this six-dimensional yeast arraylets subspace onto the two-dimensional subspace that approximates it, ≥50% of the contributions of the six arraylets add up (rather than cancel out) in the overall expression of 16 arrays, the normalized amplitudes of which satisfy *0.5 ≤ r _{1,m} < 1* (Fig. (Fig.3).3). Sorting the arrays according to their phases,

*{φ*gives an array order similar to that of the cell-cycle time points measured by the arrays that describes the yeast cell-cycle progression from the M/G

_{1,m}},_{1}stage through G

_{1}, S, S/G

_{2}, and G

_{2}/M back to M/G

_{1}twice. Because the projection of the 0-phase arraylets |α

_{1,4}〉 and −|α

_{1,16}〉, which correspond to the 0-phase genelets, 〈γ

_{4}| and −〈γ

_{16}|, is correlated with the arrays

*|a*and

_{1,1}〉, |a_{1,2}〉,*|a*and also

_{1,10}〉*|a*and

_{1,9}〉*|a*we associate both yeast 0-phase arraylets with the cell-cycle cellular state of transition from G

_{1,18}〉,_{2}/M to M/G

_{1}, in which the yeast culture is synchronized initially. Projecting the expression of the 18 human arrays from the six-dimensional human arraylets subspace onto the two-dimensional subspace that approximates it, ≥50% of the contributions of the six arraylets add up in the expression of 16 arrays. Sorting the arrays describes the human cell-cycle progression from S through G

_{2}, G

_{2}/M, M/G

_{1}, and G

_{1}/S back to S two and a half times. Because the projection of the 0-phase arraylets, |α

_{2,4}〉 and −|α

_{2,16}〉, is correlated with the arrays

*|a*and

_{2,2}〉*|a*we associate both human 0-phase arraylets with the cell-cycle stage S, in which the human culture is synchronized.

_{2,9}〉,*a–c*) and human (

*d–f*) expression reconstructed in the six-dimensional cell-cycle subspaces approximated by two-dimensional subspaces. (

*a*) Yeast array expression, projected onto π/2-phase along the

*y*axis vs. that onto 0-phase

**...**

Projecting the expression of the yeast and human genes from the six-dimensional genelets subspace onto the two-dimensional subspace that approximates it, ≥50% of the contributions of the six genelets add up in the overall expression of 547 of the 604 yeast genes that were classified as cell cycle-regulated by Spellman *et al.* (11), 709 of the 750 human genes classified by Whitfield *et al.* (12), and 71 of the 77 yeast and 71 of the 73 human genes classified by traditional methods (including, e.g., 14 of 16 human histones, that were not classified by Whitfield *et al.* as cell cycle-regulated based on their overall expression). Simultaneous classification of the yeast and human genes into the five cell-cycle stages describes the yeast and human cell cycles' progression along the yeast and human genes, respectively, and is in good agreement with the classifications by Spellman *et al.* and Whitfield *et al.* and also the traditional ones. Because the projection of the 0-phase genelets, 〈γ_{4}| and −〈γ_{16}|, is correlated with yeast genes that peak late in G_{2}/M and early in M/G_{1} and human genes that peak in S, we associate 〈γ_{4}| and −〈γ_{16}| with cell-cycle expression oscillations of yeast at the transition from G_{2}/M to M/G_{1} and human at S. This simultaneous classification therefore outlines a correspondence between the groups of yeast genes and those of human genes, e.g., yeast genes that peak at M/G_{1} correspond to human genes that peak at S, the cell-cycle stages in which the yeast and human cultures are synchronized initially, respectively.

With all 4,523 yeast and 12,056 human genes sorted, the gene variations of the six yeast and six human arraylets approximately fit one-period cosines of π/3, 0, and −π/3 initial phases (Fig. (Fig.4)4) such that the initial phase of each arraylet is similar to that of its corresponding genelet. Both sorted and reconstructed yeast and human expressions approximately fit traveling waves of one-period cosinusoidal variation across the genes and of two or two and a half periods across the arrays, respectively.

### Exclusive Yeast Pheromone-Response Subspace.

The genelets 〈γ_{1}| and 〈γ_{2}|, insignificant in the human data set relative to that of the yeast, with θ_{1}, θ_{2} > π/7 (Fig. (Fig.1),1), describe initial transient increase and decrease in expression, respectively (Fig. (Fig.2).2). A theme of yeast response to pheromone synchronization emerges from the annotations of those yeast genes with contributions from 〈γ_{1}| and 〈γ_{2}| that are largest in magnitude. The genelet 〈γ_{6}|, equally significant in both data sets with θ_{6} ∼ 0, describes an initial transient increase in expression superimposed on cosinusidial variation. A theme of transition from pheromone response to cell-cycle progression emerges from the annotations of those yeast genes with contributions from 〈γ_{6}|, as listed in the corresponding yeast arraylet |α_{1,6}〉, that are largest in magnitude (see Data Set 9). We associate these three genelets and corresponding three yeast arraylets with the pheromone response, which is exclusive to the yeast genome. Classification of the yeast genes and arrays into pheromone-response stages in the subspaces spanned by these genelets and arraylets, respectively, is in good agreement with the traditional understanding of this program (ref. 13; Figs. 12–14, which are published as supporting information on the PNAS web site).

### Exclusive Human Stress-Response Subspace.

The genelets 〈γ_{17}| and 〈γ_{18}| are insignificant in the yeast data set relative to that of the human, with θ_{17}, θ_{18} < −π/6. A theme of human synchronization stress response emerges from the annotations of those human genes with contributions from 〈γ_{17}| and 〈γ_{18}| that are largest in magnitude. Also, from the annotations of those human genes with contributions from 〈γ_{6}|, as listed in the corresponding human arraylet |α_{2,6}〉, that are largest in magnitude emerges a theme of transition from stress response to cell-cycle progression (see Data Set 10). We associate these three genelets and corresponding three human arraylets with this human-exclusive stress response. Classification of the human genes and arrays into stress-response stages in the subspaces spanned by these genelets and arraylets, respectively, is in agreement with current understanding of this program (ref. 12; Figs. 15–17, which are published as supporting information on the PNAS web site).

### Differential Expression of Yeast Genes in the Exclusive Pheromone-Response and the Common Cell-Cycle Subspaces.

According to their expression in the yeast-exclusive pheromone-response subspace, mRNA expression of both yeast genes *KAR4* and *CIK1* peak early in the time course (together with that of other genes known to be involved in the α-factor response) (Fig. (Fig.3).3). In the common cell-cycle subspace, *KAR4* peaks at the G_{1} cell-cycle stage, whereas *CIK1* peaks almost half a cell-cycle period later (and also earlier) at S/G_{2} (Fig. 12). This differential expression of *CIK1* and *KAR4* in the response to pheromone program vs. that of the cell cycle is in agreement with the experimental observation of Kurihara *et al.* (15), who showed that induction of *CIK1* depends on that of *KAR4* during mating, and is independent of *KAR4* during mitosis.

### Differential Expression of Human Genes in the Exclusive Stress-Response and the Common Cell-Cycle Subspaces.

In the human-exclusive stress-response subspace, most human histones reach their expression minima early (Fig. (Fig.3).3). In the common cell-cycle subspace, most histones peak early, together with other genes known to peak in the cell-cycle stage S (Fig. 14). This differential expression of most histones may explain why these histones do not appear to be cell cycle-regulated based on their overall expression.

## Conclusions

We have shown that GSVD provides a comparative mathematical framework for two genome-scale expression data sets, in which the variables and operations may represent some biological reality. Using GSVD in a comparison of yeast and human cell-cycle expression data sets, we were able to find (*i*) biological similarity in these two disparate organisms in terms of their mRNA expression during their cell-cycle programs; (*ii*) experimental dissimilarity in terms of yeast and human mRNA expression during their different synchronization-response programs; and (*iii*) differential gene expression in the yeast and human cell-cycle programs vs. their synchronization-response programs, respectively.

Possible additional applications of GSVD include comparison of two genomic data sets, each corresponding to (*i*) the same experiment repeated, e.g., using different experimental protocols, to separate the biological signal that is similar in both data sets from the dissimilar experimental artifacts; (*ii*) one of two different types of genomic information (e.g., DNA copy number, mRNA expression, or protein abundance) collected from the same set of samples (e.g., tumor samples) to elucidate the molecular composition of the overall biological signal in these samples; (*iii*) one of two chromosomes of the same organism to illustrate the relation, if any, between these chromosomes in terms of their, e.g., mRNA expression in a given set of samples; and (*iv*) one of two interacting organisms, e.g., during infection, to illuminate the exchange of biological information in these interactions.

## Acknowledgments

We thank G. H. Golub for insightful discussions of matrix computation, M. L. Whitfield for discussions of the human cell-cycle data and careful reading, and G. M. Church, S. R. Eddy, and E. Rivas for thoughtful reviews of this manuscript. This work was supported by National Cancer Institute Grants CA77097 (to D.B.) and CA85129 (to P.O.B.) and National Institute of General Medical Sciences Grant GM46406 (to D.B.). O.A. is a Sloan Foundation/Department of Energy Postdoctoral Fellow in Computational Molecular Biology (DE-FG03-99ER62836) and a National Human Genome Research Institute Individual Mentored Research Scientist Development Awardee in Genomic Research and Analysis (5 K01 HG00038-01). P.O.B. is a Howard Hughes Medical Institute Investigator.

## Abbreviations

- SVD, singular value decomposition
- GSVD, generalized SVD

## Notes

^{§}In this article *$\widehat{m}$* denotes a matrix, *|v〉* denotes a column vector, and *〈u|* denotes a row vector such that *$\widehat{m}$|v〉, 〈u|$\widehat{m}$,* and *〈u|v〉* all denote inner products, and *|v〉〈u|* denotes an outer product.

## References

**,**10101-10106. [PMC free article] [PubMed]

*et al.*(2002) Lancet 359

**,**1301-1307. [PubMed]

**,**334-339. [PMC free article] [PubMed]

**,**453-459. [PubMed]

**,**8409-8414. [PMC free article] [PubMed]

**,**398-405.

**,**281-285. [PubMed]

**,**3273-3297. [PMC free article] [PubMed]

**,**1977-2000. [PMC free article] [PubMed]

*et al.*(2002) Nucleic Acids Res. 30

**,**69-72. [PMC free article] [PubMed]

*et al.*(2001) Nucleic Acids Res. 29

**,**152-155. [PMC free article] [PubMed]

**,**3990-4002. [PMC free article] [PubMed]

**National Academy of Sciences**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (763K) |
- Citation

- Protein pathway and complex clustering of correlated mRNA and protein expression analyses in Saccharomyces cerevisiae.[Proc Natl Acad Sci U S A. 2003]
*Washburn MP, Koller A, Oshiro G, Ulaszek RR, Plouffe D, Deciu C, Winzeler E, Yates JR 3rd.**Proc Natl Acad Sci U S A. 2003 Mar 18; 100(6):3107-12. Epub 2003 Mar 7.* - Comparative analysis of multiple genome-scale data sets.[Genome Res. 2002]
*Werner-Washburne M, Wylie B, Boyack K, Fuge E, Galbraith J, Weber J, Davidson G.**Genome Res. 2002 Oct; 12(10):1564-73.* - Genome wide oscillations in expression. Wavelet analysis of time series data from yeast expression arrays uncovers the dynamic architecture of phenotype.[Mol Biol Rep. 2001]
*Klevecz RR, Murray DB.**Mol Biol Rep. 2001; 28(2):73-82.* - Genomic signal processing: from matrix algebra to genetic networks.[Methods Mol Biol. 2007]
*Alter O.**Methods Mol Biol. 2007; 377:17-60.* - Toxicogenomics using yeast DNA microarrays.[J Biosci Bioeng. 2010]
*Yasokawa D, Iwahashi H.**J Biosci Bioeng. 2010 Nov; 110(5):511-22. Epub 2010 Jul 10.*

- Predicting breast cancer using an expression values weighted clinical classifier[BMC Bioinformatics. ]
*Thomas M, Brabanter KD, Suykens JA, Moor BD.**BMC Bioinformatics. 15(1)411* - Vitamin D related genes in lung development and asthma pathogenesis[BMC Medical Genomics. ]
*Kho AT, Sharma S, Qiu W, Gaedigk R, Klanderman B, Niu S, Anderson C, Leeder JS, Weiss ST, Tantisira KG.**BMC Medical Genomics. 647* - Quantifying periodicity in omics data[Frontiers in Cell and Developmental Biology...]
*Amariei C, Tomita M, Murray DB.**Frontiers in Cell and Developmental Biology. 240* - Structure-revealing data fusion[BMC Bioinformatics. ]
*Acar E, Papalexakis EE, Gürdeniz G, Rasmussen MA, Lawaetz AJ, Nilsson M, Bro R.**BMC Bioinformatics. 15(1)239* - Data integration in the era of omics: current and future challenges[BMC Systems Biology. ]
*Gomez-Cabrero D, Abugessaisa I, Maier D, Teschendorff A, Merkenschlager M, Gisel A, Ballestar E, Bongcam-Rudloff E, Conesa A, Tegnér J.**BMC Systems Biology. 8(Suppl 2)I1*

- Cited in BooksCited in BooksPubMed Central articles cited in books
- MedGenMedGenRelated information in MedGen
- PubMedPubMedPubMed citations for these articles
- SubstanceSubstancePubChem Substance links
- TaxonomyTaxonomyRelated taxonomy entry
- Taxonomy TreeTaxonomy Tree

- Generalized singular value decomposition for comparative analysis of genome-scal...Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organismsProceedings of the National Academy of Sciences of the United States of America. 2003 Mar 18; 100(6)3351

Your browsing activity is empty.

Activity recording is turned off.

See more...