- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

- Journal List
- NIHPA Author Manuscripts
- PMC2813864

# A Multivariate Approach for Integrating Genome-wide Expression Data and Biological Knowledge

^{1}Department of Cardiology, Children’s Hospital Boston, 300 Longwood Avenue, Boston, MA 02115, USA

^{2}Informatics Program, Children’s Hospital Boston, 300 Longwood Avenue, Boston, MA 02115, USA

^{3}Harvard-Partners Center for Genetics and Genomics, 77 Avenue Louis Pasteur, Boston, MA 02115, USA

## Abstract

### Motivation

Several statistical methods that combine analysis of differential gene expression with biological knowledge databases have been proposed for a more rapid interpretation of expression data. However, most such methods are based on a series of univariate statistical tests and do not properly account for the complex structure of gene interactions.

### Results

We present a simple yet effective multivariate statistical procedure for assessing the correlation between a subspace defined by a group of genes and a binary phenotype. A subspace is deemed significant if the samples corresponding to different phenotypes are well separated in that subspace. The separation is measured using Hotelling’s *T*^{2} statistic, which captures the covariance structure of the subspace. When the dimension of the subspace is larger than that of the sample space, we project the original data to a smaller orthonormal subspace. We use this method to search through functional pathway subspaces defined by Reactome, KEGG, BioCarta, and Gene Ontology. To demonstrate its performance, we apply this method to the data from two published studies, and we visualize the results in the principal component space.

## INTRODUCTION

Microarray technology enables the simultaneous monitoring of expression profiles on a genome-scale and has numerous areas of application. With the identification of differentially expressed genes between different cancer types, for example, the microarray technique can aid in diagnosis (Pomeroy *et al.*, 2002). By correlating expression with phenotypic data such as patient survival time and responsiveness to treatment, it may also help in the prognosis and treatment of diseases (Park *et al.*, 2002; Holleman *et al.*, 2004).

Recently proposed analytical tools for microarray data combine statistical analysis with *a priori* biological knowledge from expert-curated databases. In general, a univariate statistical score is first computed for each gene. Then a second statistical procedure is used to determine whether a particular category of genes are over-represented among the top-scoring genes. Annotations from different databases provide a multitude of different gene categories for this process. Fisher’s exact statistic, Kolmogorov-Smirnov statistic, and a simple average of univariate statistical scores (e.g., average of negative log *p*-values) are among the common measures of significance for a group of genes. A large number of tools are available for this type of analysis including MAPPFinder (Doniger *et al.*, 2003), DAVID/EASE (Dennis *et al.*, 2003), Gene Set Enrichment Analysis (GSEA) (Lamb *et al.*, 2003; Mootha *et al.*, 2003; Subramanian *et al.*, 2005), and ermineJ (Pavlidis *et al.*, 2004). Especially notable is GSEA, in which genes are rank-ordered according to a signal-to-noise ratio and the distribution on this list of the genes from a gene set is used to compute the score for each gene set. If a large fraction of genes in a gene set shows up near the top of the ordered list, that gene set gets a high score, as measured by the Kolmogorov-Smirnov test. Model-based methods include global test by Goeman *et al.* (2004) and the analysis of covariance (ANCOVA) approach by Mansmann and Meister (2005). These two tests are equivalent in the case of independent genes but appear to lose some power for correlated genes (Mansmann and Meister, 2005).

Advantages of the approach combining expression data and biological knowledge are at least two-fold. First, a biological interpretation of the results can be facilitated by categorizing the differentially expressed genes into functional groups. This is especially true when the user is not familiar with the statistically significant genes. Second, a weak effect in a group of genes may be missed when each gene is considered individually, but it may be captured when they are considered together (Mootha *et al.*, 2003).

However, in the simple methods using Fisher’s test or *χ*^{2} test from a contingency table, one problem has been the instability of the results that depends on the threshold value for statistical significance (Pan *et al.*, 2005). This problem has been alleviated to an extent with some more recent methods, which employ a statistic that considers the entire list of genes and avoids the thresholding. Another shortcoming shared by most current tools, however, is the use of a univariate statistic to score each gene, prior to the calculation of a group score. Genes in a gene set are functionally related and are not independent; the complex structure of gene interactions within a gene set are not fully captured using univariate approaches. Sample groups that do not seem to be separated according to a series of univariate measures may be well separated when a joint distribution is considered.

In this paper, we present a multivariate approach to address this problem. As in other methods, we seek to determine the significance of every pre-defined group of genes in order to choose the most relevant ones. The novelty in the proposed method is that we recast this problem as measuring the separation of the samples by their phenotype in each subspace of genes. Considered this way, it becomes natural to employ a variety of multivariate methods commonly used for class discovery and prediction. In the context of two-group comparison, the *t*-statistic or some variation based on it is a standard measure of significance. Such a univariate method is used in most of the methods described above. In the proposed method, we instead use Hotelling’s *T*^{2} (or, equivalently, the Mahalanobis distance, which differs only by a constant). This is the multidimensional analogue of the *t*-statistic that accounts for the correlation structure. A number of methods, such as the Between-Group Analysis (Culhane *et al.*, 2002) based on correspondence analysis, also attempt to find a subspace in which the samples are maximally separated. In the current work, we use a formal statistic that allows us to compare different subspaces.

Hotelling’s *T*^{2} has been already employed for identification of differentially expressed genes (Szabo *et al.*, 2003; Kim *et al.*, 2005; Lu *et al.*, 2005). Kim *et al.* (2005) compared Hotelling’s *T*^{2} statistic and a univariate procedure in detection of differentially expressed genes and found Hotelling’s *T*^{2} to be more efficient. Lu *et al.* (2005) used the same statistic with a search algorithm to identify a set of differentially expressed genes. They reported that Hotelling’s *T*^{2} gave fewer false positives and false negatives than the univariate *t*-test when a spike-in data set from Affymetrix was analyzed. One shortcoming in that work, however, was that the number of differentially expressed genes found by *T*^{2} had to be smaller than the number of samples to avoid singularity in the inversion of the covariance matrix. In this study, we address this problem by transforming the data onto an orthonormal subspace using principal components first. This allows us to calculate the score even for the subspaces whose dimension is larger than that of the samples.

In the following, we describe the methodology in detail and apply it to two data sets of our interest. The first data set on the effect of TOR inhibitor on Akt transgenic mice (Majumder *et al.*, 2004) was originally analyzed using GSEA. In the re-analysis of this data set, we compare the performance of the proposed method to GSEA as well as the global test for a group of genes by Goeman *et al.* (2004). While we analyze the first data set using a collection of gene sets similar to the one used in the original paper for comparison, we search through a much larger set of subspaces for the second data set on the effect of left ventricular assist device (Hall *et al.*, 2004). These subspaces are defined by several biological knowledge databases. Previously defined sets include those curated from KEGG, BioCarta, and Gene Ontology (GO) (Tian *et al.*, 2005). For this work, we also derived new sets from the Reactome (Joshi-Tope *et al.*, 2005) database. The R source code and the gene sets implemented as an object in R are available from the authors upon request.

## METHODS

The current problem is naturally phrased in the language of vector spaces, where *n* samples are points in a *q* dimensional space of genes. An element in a gene set is an additional dimension of the vector space. Hence, concepts such as projection and subspaces developed in vector algebra provide a natural framework (Kuruvilla *et al.*, 2002). In the subsequent discussions, we use the term ‘subspace’ in place of ‘gene set’ or ‘functional category’ to emphasize that a statistical test is performed in a multidimensional space where the dimension is the number of genes.

### Comparing multivariate means of two groups using Hotelling’s *T*^{2}

Let **X**_{0} be the data matrix containing expression values, with *n*_{1} samples from first group and *n*_{2} samples from second group. *n* = *n*_{1} + *n*_{2} columns correspond to samples and *p* rows correspond to genes. Suppose the subspace that we wish to test is a *q* × *n* matrix **X** that contains *q* functionally related genes. The null hypothesis is that the multivariate means of the two groups are equal. The covariance matrices of the two groups are assumed to be the same and the pooled within-group covariance matrix is denoted by **S**.

Our test statistic is based on Hotelling’s *T*^{2}:

where
${\overline{\mathbf{X}}}_{i}=1/{n}_{i}{\sum}_{j=1}^{{n}_{i}}{\mathbf{X}}_{ij}$ denotes the mean vector of the *i*th group obtained by summing over the *j*th *q*-dimensional vector **X*** _{ij}* in group

*i*;

**S**denotes the pooled covariance matrix ((

*n*

_{1}− 1)

**S**

_{1}+ (

*n*

_{2}− 1)

**S**

_{2})/(

*n*− 2), where ${\mathbf{S}}_{i}=1/({n}_{i}-1){\sum}_{j=1}^{{n}_{i}}({\mathbf{X}}_{ij}-{\overline{\mathbf{X}}}_{i}){({\mathbf{X}}_{ij}-{\overline{\mathbf{X}}}_{i})}^{t}$. For the null hypothesis

*H*

_{0}:

_{1}=

_{2}, the sampling distribution of

*T*

^{2}follows (

*n*− 2)

*q*/(

*n*−

*q*− 1)

*F*

_{q,n}_{−}

_{q}_{−1}. If the covariance matrix cannot be assumed to be the same, a similar statistic (with

**S**

_{1}/

*n*

_{1}+

**S**

_{2}/

*n*

_{2}as the covariance term and a different constant) follows a

*χ*

^{2}distribution but only for a large sample size. But the unequal variance case can be easily incorporated since our significance testing is based on permutation. Multiple testing problem is addressed by false discovery rate (FDR) as described previously in Storey and Tibshirani (2003).

### Dimensionality Reduction

When the dimension of the subspace is smaller than that of the sample size (*q < n* − 1), Hotelling’s *T*^{2} can be applied in a straightforward way. When the dimension of the subspace is larger (*q* ≥ *n* − 1), a modification is necessary to deal with the singularity of the within-group covariance matrix **S**. For the gene set subspaces we compiled, the dimensions vary widely and handling the latter case well is crucial. There are several common ways to deal with this issue. The simplest is to ignore the correlations among genes and to set the off-diagonals in **S** to be zero, which results in a squared Euclidean distance between the two mean vectors. This is the approach taken in, e.g., von Heydebreck *et al.* (2001). The correlations among the genes, however, should play a critical role in determining the significance of the subspace, and should not be disregarded. Another way is to regularize **S** by adding a small constant *ε* to the diagonal. This has the effect of shifting the eigenvalues by *ε*. In one simple but effective variation, only the diagonal of **S** is still used but shrunken centroids for each group are used (Tibshirani *et al.*, 2002). Other sophisticated methods include Regularized Discriminant Analysis (Friedman, 1989), in which the separate covariance matrices as in Quadratic Discriminant Analysis are shrunk toward the pooled covariance matrix, and Penalized Discriminant Analysis (Hastie *et al.*, 1995), in which *ε***S**′ is added to **S** with a small constant *ε >* 0 and a suitably chosen symmetric and non-negative definite penalty matrix **S**′.

In our approach, we diagonalize the within-group covariance matrix by projecting the data onto an orthonormal subspace spanned by principal components of the covariance matrix. After this transformation, the coordinates are uncorrelated and each principal component has unit variance. This type of pretreatment of data are common in linear discriminant analysis (LDA) and independent component analysis (ICA). Mathematically, this involves computing the new data matrix

where the diagonal **D** and orthogonal **U** matrices are obtained from the decomposition of the covariance matrix **S** = **UDU*** ^{t}*. Since

**S**is singular, only the columns corresponding to the nonzero eigenvalues in

**D**are used for this transformation. Numerically, the eigenvalues decay rapidly in our examples and it is not difficult to find a reasonable threshold (e.g., 10

^{−4}) for determining the rank. We have verified that in working with the transformed data, the transition between the singular and the non-singular cases is smooth and that addition or subtraction of few genes from a subspace does not change the significance of a subspace unexpectedly. We note that a relatively simple method for projecting the data into a lower dimensional space was chosen for this step here and that more sophisticated methods may result in improved performance.

Computationally, the statistic for each subspace can be calculated fast. The data matrix **X** is generally small (*q* *p*), and, even when we have *q* *n*, the matrix decomposition can be done quickly using a trick. Instead of decomposing *q* × *q* matrix **S**, we adjust **X** by subtracting their group means and use the singular value decomposition relationship **X** = **UDV*** ^{t}*. Using the fact that the columns of

**V**in are the eigenvectors of a smaller

*n*×

*n*matrix

**X**

^{t}**X**, we first obtain

**V**and then calculate

**U**by

**X**=

**UDV**

*. It is easy to show that this is the desired*

^{t}**U**.

## RESULTS

### Application to the microarray data sets

We used two publicly available gene expression data sets to evaluate our method. For each subspace, Hotelling’s *T*^{2} statistic was calculated to test the null hypothesis that the two groups are indistinguishable in the subspace. Statistical significance was computed based on the null distribution obtained by permuting the group labels. To differentiate among many highly significant genes, a large number of permutations (20,000) was performed.

For the first data set, nearly the same gene sets used in the original study were used; for the second data set, functional annotations were collected from BioCarta, KEGG, Reactome, and Gene Ontology. The Affymetrix probe sets were mapped to the subspaces through the Entrez Gene identifiers. The conversion from probe set ID to Entrez Gene ID was performed using the annotation packages available through Bioconductor (http://www.bioconductor.org) in the statistical language R. When there were several probe set IDs for a single Entrez Gene ID, all of them were included. The annotations were dated March, 2005.

### Effect of mTOR inhibitor on Akt transgenic mice

Target of rapamycin (TOR) is a key protein that regulates the translation of ribosomal proteins in eukaryotes. Pathways upstream of TOR as well as the TOR pathway itself are frequently activated in cancer, and it has been extensively studied as a novel anticancer therapeutic target (Bjornsti and Houghton, 2004). In Majumder *et al.* (2004), Akt transgenic (Akt-Tg) mice were treated with mTOR inhibitor RAD001 and placebo, and the samples after 12 or 48 hours of RAD001 or placebo treatment in both wild-type and Akt-Tg mice were hybridized to microarrays. From this experiment, they reported that mTOR inhibition induced apoptosis of epithelial cells and a complete reversal of a neoplastic phenotype in the prostate of mice expressing human AKT1 in the ventral prostate. To identify the targets altered by AKT expression and by subsequent mTOR inhibition, they searched BioCarta gene sets and other manually curated gene sets for enriched categories of genes using GSEA. It was found that the main transcriptional response to AKT activation and mTOR inhibition involved the targets of Hif (Hypoxia-Inducible Factor)-1*α*.

To reanalyze this data set, we downloaded the Mouse 430A gene sets similar to the ones used in the original study from http://www.broad.mit.edu/gsea/msigdb/msigdb_index.html. The two are not exactly the same, as the authors have updated some pathways and added new ones. The current version contains 539 gene sets, which includes 274 BioCarta pathways. We calculated Hotelling’s *T*^{2} between the RAD001-treated (n = 9) and placebo-treated (n = 10) groups in the subspace defined by each of these pathways. The main results are shown in Table 1. In the original study, the Hif pathway was reported as the only significant gene set, with the *p*-value 0.034. In our analysis, the Hif pathway was also significant with the permutation *p*-value 0.0001 (*q*-value 0.000313) but was ranked 4th among the BioCarta pathways. The most significant pathway was Erk1/Erk2 Mapk Signaling Pathway (‘erkPathway’ in the GSEA result). The second ranked one was Hemoglobin’s Chaperon (‘ahspPathway’), which includes ALAS1/2, Hbs, and GATA1. Its role in Akt-Tg treated with RAD001 was not clear, but the Hif targets were considered to play an important role in the experiment and so the hypoxia-related hemoglobin changes can be expected. We also found that multiple pathways related to carbohydrate metabolism on the top of our list. Six out of the top 10 KEGG gene sets were from carbohydrate metabolism, which includes Glycolysis/Gluconeogenesis (KEGG: MAP00010). Among the BioCarta pathways, 6th ranked Feeder Pathway for Glycolysis (‘feederPathway’) was related to carbohydrate metabolism. This finding is also in accordance with the observations in the original paper.

*p*-value is based on 20,000 permutations of samples labels, and false discovery rates (

**...**

It is also important to note that our method appears to have much greater statistical power than GSEA, resulting in 234 curated pathways under the significance level of the permutation *p*-value *<* 0.05 (The top ones are shown in Table 1). In the original publication using GSEA, only one pathway was found to be significant. We also compared our method to the global test in the generalized linear model setting (Goeman *et al.*, 2004). For this example, the global test appears to detect few more subspaces, e.g., it detected 270 under the permutation *p*-value of 0.05, while our method detected 234. In this case, the overlap between the two were 152. Important subspaces such as the Erk1/Erk2 Mapk Signaling Pathway, Hif pathway, and KEGG TCA cycle were found in both cases.

Among the significant pathways identified, there often are a large number of shared genes. For instance, the vegfPathway (VEGF, Hypoxia, and Angiogenesis), which was significant with permutation *p*-value 0.00465 (*q*-value 0.00305), shares 5 genes (Hif-1*α*, NOS 3, Vegf, von Hippel-Lindau syndrome, and Arnt) with the Hif pathway. The relationship among pathways that share some of their members and their proper interpretation are subject to further investigation.

### Cardiac remodeling by left ventricular mechanical unloading

To investigate the underlying mechanism of cardiac remodeling after mechanical unloading in patients with heart failure, Hall *et al.* (2004) studied 19 paired human left ventricular apex samples that were harvested at the time of implant of a left ventricular assist device (pre-LVAD) and at the time of explant (post-LVAD). They identified 107 genes with false discovery rate of less than 1% that were differentially regulated in pre-LVAD versus post-LVAD groups. Among down-regulated genes, neurophilin-1 (a VEGF receptor), FGF9, Sprouty 1 (FGFR/EGFR antagonist), stromal-derived factor 1, and endomucin have been implicated in the regulation of vascular organization. Consequently the authors inferred that vascular gene expression is modulated by mechanical unloading. In addition, they found that GATA-4, a central regulator of cardiac gene transcription, was downregulated after mechanical unloading.

We applied our multivariate approach to discover gene subspaces in which gene expression was coordinately regulated between pre-LVAD and post-LVAD conditions. Unlike the previous example in which the same gene sets as in the original publication was used for comparison, we used a much larger database of subspaces. We collected 5,400 gene set subspaces (5000 from Gene Ontology and 400 from Reactome, KEGG, and BioCarta; described in Tian *et al.* (2005) and available from authors’ website), from which we selected 2,338 subspaces containing 5 to 250 genes for further analysis. The lower limit on size was imposed to eliminate subspaces that were too easily influenced by single genes and did not correspond to a pathway. The upper limit was necessary to avoid categories that were too general to be helpful in interpretation.

Thirty-eight subspaces were significant at permutation *p*-value 0.01 (Table 2). Compared to the first example in which 119 had *p*-values *<*0.01, not as many subspaces were significant and the *q*-values were relatively high. This is most likely due to the greater heterogeneity expected in the human clinical samples than in transgenic models. The distribution of univariate *p*-values confirms that the number of differentially expressed genes is much greater for the first data set for any threshold. Interestingly, the BioCarta ‘NFAT and Hypertrophy of the heart transcription in the broken heart’ subspace was ranked 5th (permutation *p*-value 0.003). As the title suggests, this subspace has multiple genes related to the cardiac hypertrophy and remodeling such as GATA-4, Nkx2.5, MAP kinases, NFAT, IGF, Akt, and calcium/calmodulin-dependent protein kinases. This subspace summarizes the changes of many genes that are known to play a role in the heart failure and cardiac remodeling, and its occurrence supports the ability of our method to identify significantly co-regulated genes many of which are not significant by univariate analysis.

**...**

The BioCarta gene subspace ‘MAP Kinase inactivation of SMRT corepressor’ was the most significant. Modulation of MAP kinase cascades was also supported by results from the GO Biological Process subspaces, where ‘GO:000185 activation of MAPKKK’ (permutation *p*-value *<* 0.00005) and its parent node ‘GO:0000165 MAPKKK cascade’ (permutation *p*-value 0.00825) were found to be significant. MAP kinase signaling pathways are pivotal mediators of diverse cellular functions, including growth, differentiation, and apoptosis. MAP kinase pathways are activated in heart failure (Haq *et al.*, 2001), and chronic activation in transgenic over-expression models has been associated with dilated cardiomyopathy and heart failure (reviewed in Liang and Molkentin (2003)).

The third ranked BioCarta pathway was ‘TACI and BCMA stimulation of B cell immune responses.’ As members of the TNF receptor gene family, TACI and BCMA interact with TNF receptor associated factors (TRAFs) to activate NF-*κ*B activation and MAP kinase pathways. The significant rank of this subspace again points to modulation of MAP kinase signaling, and also suggests that myocardial unloading alters TNF signaling. TNF and related cytokines are produced by myocardial cells after injury, and activation of TNF pathways in heart failure is well documented (Mann, 2003). Chronic TNF activation results in cardiomyopathy and detrimental cardiac remodeling (Mann, 2003). NF-*κ*B is a major target of TNF signaling pathways, and it has been found to be activated in failing human myocardium, where it is an important regulator of cardiomyocyte hypertrophy and apoptosis (Purcell and Molkentin, 2003). NF-*κ*B and its negative regulator IKB were also important components of two additional BioCarta pathways with highly significant permutation *p*-values: ‘Activation of PKC through G protein coupled receptor’ (permutation *p*-value 0.0022) and ‘NF-*κ*B activation by Nontypeable Hemophilus influenzae’ (permutation *p*-value 0.0092).

Activation of receptor tyrosine kinases, either by direct stimulation by ligand or by receptor transactivation through G protein coupled receptors, is important for the pathogenesis of heart failure (Asakura *et al*., 2002; Iwamoto *et al*., 2003). The BioCarta pathway ‘Role EGF Receptor Transactivation by GPCRs in Cardiac Hypertrophy’ received a highly significant permutation *p*-value (0.0058), consistent with a significant co-regulation of genes in this subspace during mechanical unloading. This was reinforced by the BioCarta pathway ‘Sprouty regulation of tyrosine kinase signals’, which delineates tyrosine kinase signaling pathways and their negative regulation by Sprouty. Placed in this broader context, the significant upregulation of Sprouty found in the original report may be related to regulation of myocardial receptor tyrosine kinase signaling, which is known to be perturbed in heart failure.

As shown above, a detailed examination of the BioCarta pathways with highly significant permutation *p*-values reveals a number of signaling pathways that play an important role in the pathogenesis of heart failure, as determined by experimental data from a number of model systems. Identification of these pathways highlights the ability of this approach to identify significant patterns of differential gene expression that are not apparent by single-gene analysis.

For this example, we again compared our method to the global test (Goeman *et al*., 2004) using their R package, which allows one to iterate through the possible subspaces given by the user easily. The results in this example were more discordant than those of the first. Similar number of subspaces were identified, but a couple of key subspaces were not found by the global test. For example, ‘TACI and BCMA stimulation of B cell immune responses’ had the permutation *p*-value of 0.0027 in our method but 0.193 in the global test. Similarly, ‘Sprouty regulation of tyrosine kinase signals,’ which was validated in the original paper, obtained the *p*-value of 0.0029 in our method and it obtained 0.11 in the global test. It is difficult to draw a strong conclusion of this evaluation, but this result appears to favor the proposed method in this example. The bigger advantage of the proposed method may be its simplicity and intuitive interpretation.

## DISCUSSION

Our goal in this study was to integrate biological knowledge with microarray data using a multivariate statistical approach. The main advantage of the proposed method is the simple conceptual framework of finding a subspace in which the two groups belonging to different phenotypes are most separated. This method is made possible by the dimensionality step described earlier, in which we project a subspace of high dimension onto a orthonormal space by matrix decomposition and sphering. This step is essential because most subspaces contain more genes than the number of samples in the study. A problem might still arise if the subspace is too large, but most subspaces do not contain more than one or two hundred genes. The proposed method should be intuitively appealing to biologists, many of whom may be already familiar with visualizing the samples in a two- or three-dimensional principal component plots.

A statistical method adjusting for different covariance structure and dimension of subspaces was described previously (Tian *et al.*, 2005), but the current approach has a simpler interpretation, especially with the aid of visualization. The statistical test proposed here is similar to the second of the two null hypothesis described in that paper but is in some sense less aggressive and more omnibus. For example, the current statistic does not distinguish between up- and down-regulated genes. If there is no correlation among genes, the statistic becomes the sum of squared *t*-statistics. The first null hypothesis in Tian *et al.* (2005) can still be used in addition if desired but we felt that this was not necessary here. Exactly which features of the subspaces should be considered most relevant from a biological perspective is not yet clear. With respect to other available methods, the proposed method should be able capture those relationship missed by univariate methods. Evidence on the data sets of our interest strongly suggests that the proposed method performs better, but a more extensive comparison is made difficult due to the differences in the set of subspaces used. Due to the evolution of biological databases, the original subspaces are not available for comparison.

The usefulness of this type of analysis depends on the quality of the gene sets as well as the statistical method used. The BioCarta, KEGG, and Reactome databases were curated by human experts and the functional annotation from these databases are considered to be specific and of high quality. The contents of many GO terms, however, are not as accurate, as their annotations come from several different sources with varying degrees of reliability. In particular, those derived by algorithmic approaches and not by human experts tend to be low in accuracy. Some of the inaccuracies may be due to the ambiguities in names and symbols (Weeber *et al.*, 2003). The IEA (Inferred from Electronic Annotation) type especially is regarded as the evidence of lowest quality, but this type of annotation is extremely common: in case of human Biological Process category, only 8,377 out of 20,566 annotations are from non-IEA sources as of April 2005. In addition, the relationships among the three categories, Biological Process, Molecular Function, and Cellular Component are not clear. For this study, we collected the annotations in the Biological Process domain and excluded IEAs. In the current analysis, we also limited the subspace size to between 5 and 250 based on our experience. It is conceivable that a subspace with larger or small dimension can give useful information, but we have found that the difficulty of examining a longer list of gene sets outweighs any potential benefit. The limits are also similar to 5–100 used in Segal *et al.* (2004) and same as 5–250 used in Pavlidis *et al.* (2004).

The proposed analysis generates a list of potential pathways but interpreting that list itself can be time-consuming. Understanding the relationship between the significant subspaces then becomes critical to interpretation. The obvious candidates are the complex parent-children relationships in the GO graph. For example, in the result for the second data set, ‘GO:000185 Activation of MAPKKK’ is a part of ‘GO:000165 MAPKKK cascade’ and is also a part of the different GO term ‘GO:0045860 Positive regulation of protein kinase activity.’ In this case, the first two were significant while the third was not. A couple of methods for adjusting statistical significance in such cases have been proposed recently (Alexa *et al.*, 2006; Grossmann *et al.*, 2006). However, we have also noticed that pathways distant in the graph, from different GO graph (e.g., biological function vs. cellular component), or referred to by different names in databases without hierarchical structure can involve significant overlaps. Hence, a tool for facilitating the process of relating different pathways would be helpful. When the relationships between pathways are clarified, the analysis may resemble the method of understanding the data in terms of modules. Defining a network of modules rather than of genes to describe an underlying phenomenon has become popular recently (Segal *et al.*, 2004). By examining the genes contained in each significant pathway, it should be possible to generate a network of interactions by groups of genes.

The idea of identifying a relevant subspace can be particularly helpful in pharmacogenomic screening. By comparing before- and after-treatment data, the pathway targeted by a compound can be more easily identified. The mTOR example was one such example. The proposed approach is general and can be applied to other data types as well as other phenotypes. When the phenotype involves more than two classes, for example, MANOVA (Multiple Analysis of Variance) is the natural extension.

## Acknowledgments

This work was supported by the grants from the Specialized Centers of Clinically Oriented Research in Pediatric Heart Development and Disease (to S.W.K. and W.T.P), the National Institute of General Medical Sciences and the NIH Roadmap for Medical Research Grant U54LM008748 (to S.W.K. and P.J.P.).

## References

- Alexa A, Rahnenfuhrer J, Lengauer T. Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics. 2006 p. To appear. [PubMed]
- Asakura M, Kitakaze M, Takashima S, Liao Y, Ishikura F, Yoshinaka T, Ohmoto H, Node K, Yoshino K, Ishiguro H, et al. Cardiac hypertrophy is inhibited by antagonism of ADAM12 processing of HB-EGF: metalloproteinase inhibitors as a new therapy. Nat Med. 2002;8:35–40. [PubMed]
- Bjornsti MA, Houghton PJ. The TOR pathway: a target for cancer therapy. Nat Rev Cancer. 2004;4:335–348. [PubMed]
- Culhane AC, Perriere G, Considine EC, Cotter TG, Higgins DG. Between-group analysis of microarray data. Bioinformatics. 2002;18:1600–1608. [PubMed]
- Dennis JG, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA. DAVID: Database for annotation, visualization, and integrated discovery. Genome Biol. 2003;4:P3. [PMC free article] [PubMed]
- Doniger SW, Salomonis N, Dahlquist KD, Vranizan K, Lawlor SC, Conklin BR. Mappfinder: using gene ontology and genmapp to create a global gene-expression profile from microarray data. Genome Biol. 2003;4:R7. [PMC free article] [PubMed]
- Friedman JH. Regularized discriminant analysis. Journal of the American Statistical Association. 1989;84:165–175.
- Goeman JJ, van de Geer SA, de Kort F, van Houwelingen HC. A global test for groups of genes: testing association with a clinical outcome. Bioinformatics. 2004;20:93–99. [PubMed]
- Grossmann S, Bauer S, Robinson PN, Vingron M. An improved statistic for detecting over-representated gene ontology annotations in gene sets. Research in Computational Molecular Biology: 10th Annual International Conference; Proceedings: Lecture Notes in Computer Science; 2006. pp. 85–98.
- Hall JL, Grindle S, Han X, Fermin D, Park S, Chen Y, Bache RJ, Mariash A, Guan Z, Ormaza S, et al. Genomic profiling of the human heart before and after mechanical support with a ventricular assist device reveals alterations in vascular signaling networks. Physiol Genomics. 2004;17:283–91. [PubMed]
- Haq S, Choukroun G, Lim H, Tymitz KM, del Monte F, Gwathmey J, Grazette L, Michael A, Hajjar R, Force T, Molkentin JD. Differential activation of signal transduction pathways in human hearts with hypertrophy versus advanced heart failure. Circulation. 2001;103:670–677. [PubMed]
- Hastie T, Buja A, Tibshirani R. Penalized discriminant analysis. The Annls of Statistics. 1995;23:73–102.
- Holleman A, Cheok MH, den Boer ML, Yang W, Veerman AJ, Kazemier KM, Pei D, Cheng C, Pui CH, Relling MV, et al. Gene-expression patterns in drug-resistant acute lymphoblastic leukemia cells and response to treatment. N Engl J Med. 2004;351:533–42. [PubMed]
- Iwamoto R, Yamazaki S, Asakura M, Takashima S, Hasuwa H, Miyado K, Adachi S, Kitakaze M, Hashimoto K, Raab G, et al. Heparin-binding EGF-like growth factor and ErbB signaling is essential for heart function. Proc Natl Acad Sci U S A. 2003;100:3221–3226. [PMC free article] [PubMed]
- Joshi-Tope G, Gillespie M, Vastrik I, D’Eustachio P, Schmidt E, de Bono B, Jassal B, Gopinath GR, Wu GR, Matthews L, Lewis S, Birney E, Stein L. Reactome: a knowledgebase of biological pathways. Nucleic Acids Res. 2005;33:D428–32. [PMC free article] [PubMed]
- Kim BS, Kim I, Lee S, Kim S, Rha SY, Chung HC. Statistical methods of translating microarray data into clinically relevant diagnostic information in colorectal cancer. Bioinformatics. 2005;21:517–528. [PubMed]
- Kuruvilla FG, Park PJ, Schreiber SL. Vector algebra in the analysis of genome-wide expression data. Genome Biol. 2002;3:research0011.1–0011.11. [PMC free article] [PubMed]
- Lamb J, Ramaswamy S, Ford HL, Contreras B, Martinez RV, Kittrell FS, Zahnow CA, Patterson N, Golub TR, Ewen ME. A mechanism of cyclin D1 action encoded in the patterns of gene expression in human cancer. Cell. 2003;114:323–334. [PubMed]
- Liang Q, Molkentin JD. Redefining the roles of p38 and JNK signaling in cardiac hypertrophy: dichotomy between cultured myocytes and animal models. J Mol Cell Cardiol. 2003;35:1385–1394. [PubMed]
- Lu Y, Liu PY, Xiao P, Deng HW. Hotelling’s T2 multivariate profiling for detecting differential expression in microarrays. Bioinformatics. 2005;21:3105–3113. [PubMed]
- Majumder PK, Febbo PG, Bikoff R, Berger R, Xue Q, McMahon LM, Manola J, Brugarolas J, McDonnell TJ, Golub TR, et al. mTOR inhibition reverses Akt-dependent prostate intraepithelial neoplasia through regulation of apoptotic and HIF-1-dependent pathways. Nat Med. 2004;10:594–601. [PubMed]
- Mann DL. Stress-activated cytokines and the heart: from adaptation to maladaptation. Annu Rev Physiol. 2003;65:81–101. [PubMed]
- Mansmann U, Meister R. Testing differential gene expression in functional groups. Goeman’s global test versus an ANCOVA approach. Methods Inf Med. 2005;44:449–453. [PubMed]
- Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop LC. Pgc-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet. 2003;34:267–73. [PubMed]
- Pan KH, Lih CJ, Cohen SN. Effects of threshold choice on biological conclusions reached during analysis of gene expression by DNA microarrays. Proc Natl Acad Sci U S A. 2005;102:8961–8965. [PMC free article] [PubMed]
- Park PJ, Tian L, Kohane IS. Linking gene expression data with patient survival times using partial least squares. Bioinformatics, 18 Suppl. 2002;1:120–127. [PubMed]
- Pavlidis P, Qin J, Arango V, Mann JJ, Sibille E. Using the gene ontology for microarray data mining: a comparison of methods and application to age effects in human prefrontal cortex. Neurochem Res. 2004;29:1213–1222. [PubMed]
- Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, McLaughlin ME, Kim JYH, Goumnerova LC, Black PM, Lau C, Allen JC, Zagzag D, Olson JM, Curran T, Wetmore C, Biegel JA, Poggio T, Mukherjee S, Rifkin R, Califano A, Stolovitzky G, Louis DN, Mesirov JP, Lander ES, Golub TR. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature. 2002;415:436–442. [PubMed]
- Purcell NH, Molkentin JD. Is nuclear factor kappaB an attractive therapeutic target for treating cardiac hypertrophy? Circulation. 2003;108:638–640. [PubMed]
- Segal E, Friedman N, Koller D, Regev A. A module map showing conditional activity of expression modules in cancer. Nat Genet. 2004;36:1090–8. [PubMed]
- Storey JD, Tibshirani R. Statistical significance for genome-wide studies. Proc Natl Acad Sci U S A. 2003;100:9440–9445. [PMC free article] [PubMed]
- Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005;102:15545–15550. [PMC free article] [PubMed]
- Szabo A, Boucher K, Jones D, Tsodikov AD, Klebanov LB, Yakovlev AY. Multivariate exploratory tools for microarray data analysis. Biostatistics. 2003;4:555–567. [PubMed]
- Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ. Discovering statistically significant pathways in expression profiling studies. Proc Natl Acad Sci U S A. 2005;102:13544–13549. [PMC free article] [PubMed]
- Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci U S A. 2002;99:6567–6572. [PMC free article] [PubMed]
- von Heydebreck A, Huber W, Poustka A, Vingron M. Identifying splits with clear separation: a new class discovery method for gene expression data. Bioinformatics, 17 Suppl. 2001;1:107–114. [PubMed]
- Weeber M, Schijvenaars BJ, Van Mulligen EM, Mons B, Jelier R, Van Der Eijk CC, Kors JA. Ambiguity of human gene symbols in Locus-Link and MEDLINE: creating an inventory and a disambiguation test collection. AMIA Annu Symp Proc. 2003:704–708. [PMC free article] [PubMed]

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (1.6M)

- A factor analysis model for functional genomics.[BMC Bioinformatics. 2006]
*Kustra R, Shioda R, Zhu M.**BMC Bioinformatics. 2006 Apr 21; 7:216. Epub 2006 Apr 21.* - GO-Diff: mining functional differentiation between EST-based transcriptomes.[BMC Bioinformatics. 2006]
*Chen Z, Wang W, Ling XB, Liu JJ, Chen L.**BMC Bioinformatics. 2006 Feb 16; 7:72. Epub 2006 Feb 16.* - Hotelling's T2 multivariate profiling for detecting differential expression in microarrays.[Bioinformatics. 2005]
*Lu Y, Liu PY, Xiao P, Deng HW.**Bioinformatics. 2005 Jul 15; 21(14):3105-13. Epub 2005 May 19.* - Putting microarrays in a context: integrated analysis of diverse biological data.[Brief Bioinform. 2005]
*Troyanskaya OG.**Brief Bioinform. 2005 Mar; 6(1):34-43.* - Dissecting complex transcriptional responses using pathway-level scores based on prior information.[BMC Bioinformatics. 2007]
*Bussemaker HJ, Ward LD, Boorsma A.**BMC Bioinformatics. 2007 Sep 27; 8 Suppl 6:S6. Epub 2007 Sep 27.*

- Gene set analysis methods: statistical models and methodological differences[Briefings in Bioinformatics. 2014]
*Maciejewski H.**Briefings in Bioinformatics. 2014 Jul; 15(4)504-518* - MAVTgsa: An R Package for Gene Set (Enrichment) Analysis[BioMed Research International. 2014]
*Chien CY, Chang CW, Tsai CA, Chen JJ.**BioMed Research International. 2014; 2014346074* - Pathway-based outlier method reveals heterogeneous genomic structure of autism in blood transcriptome[BMC Medical Genomics. ]
*Campbell MG, Kohane IS, Kong SW.**BMC Medical Genomics. 634* - Gene set analysis using variance component tests[BMC Bioinformatics. ]
*Huang YT, Lin X.**BMC Bioinformatics. 14210* - Harnessing the complexity of gene expression data from cancer: from single gene to structural pathway methods[Biology Direct. ]
*Emmert-Streib F, Tripathi S, Matos Simoes RD.**Biology Direct. 744*

- PubMedPubMedPubMed citations for these articles

- A Multivariate Approach for Integrating Genome-wide Expression Data and Biologic...A Multivariate Approach for Integrating Genome-wide Expression Data and Biological KnowledgeNIHPA Author Manuscripts. Oct 1, 2006; 22(19)2373PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...