- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

- Journal List
- J Comput Biol
- PMC3131840

# Analysis of Gene Sets Based on the Underlying Regulatory Network

^{}Corresponding author.

*Ali Shojaie, Department of Statistics, University of Michigan, 269 West Hall, 1085 South University Avenue, Ann Arbor, MI 48109. E-mail:*Email: ude.hcimu@eiajohs

## Abstract

Networks are often used to represent the interactions among genes and proteins. These interactions are known to play an important role in vital cell functions and should be included in the analysis of genes that are differentially expressed. Methods of gene set analysis take advantage of external biological information and analyze *a priori* defined sets of genes. These methods can potentially preserve the correlation among genes; however, they do not directly incorporate the information about the gene network. In this paper, we propose a latent variable model that directly incorporates the network information. We then use the theory of mixed linear models to present a general inference framework for the problem of testing the significance of subnetworks. Several possible test procedures are introduced and a network based method for testing the changes in expression levels of genes as well as the structure of the network is presented. The performance of the proposed method is compared with methods of gene set analysis using both simulation studies, as well as real data on genes related to the galactose utilization pathway in yeast.

**Key words:**gene networks, gene set analysis, latent variable model, mixed linear model

## 1.Introduction

In standard analysis of differential expression, statistical significance of each gene is assessed independently, and some method of multiple testing correction is then used to adjust the estimated *p*-values. Such methods are usually less sensitive in detecting genes that have smaller differences in mRNA abundance between different experimental conditions and may therefore be less powerful than desired. Furthermore, analyzing individual genes (single-gene analysis) often generates results that are not reproducible and lack meaningful biological interpretations. The focus of current research has thus shifted to analyzing *a priori* defined sets of genes (gene set analysis) and using external information to strengthen the analysis of differential expression. Analysis of gene sets results in increased power compared to single gene analysis. Furthermore, methods of gene set analysis can preserve the correlation among genes which may lead to more reliable inference. These methods however, do not directly incorporate the external information about the interactions among genes represented by the gene network. In this paper, we develop a model that directly incorporates the network information, and propose a general inference framework for testing the significance of genetic pathways.

### 1.1.A motivating example

In an interesting approach, Ideker et al. (2001) integrated gene expression and protein level data to study significant signaling and metabolic pathways in yeast *Saccharomyces cerevisiae*. They reported interactions among genes and proteins in different pathways along with information on the estimated correlation among genes in the network. The authors also grouped the genes into subnetworks (pathways) based on their biological functions. Figure 1, which was originally presented in Ideker et al. (2001), illustrates the network of genes under consideration. We also update the network of Ideker et al. (2001) based on newly defined interactions among genes reported in Bader et al. (2004). This results in a network of 343 genes with 419 interactions for which estimates of correlations among genes are also available (this data is referred to as the *Ideker data* henceforth).

The mRNA expression levels of genes in the Ideker data are measured in 9 different perturbations of GAL genes along with the wild type yeast. For each perturbation, two samples of data are available. The first set of samples represents the expression levels of genes in cells grown in presence of galactose (gal+), while the second set includes expression levels for cells grown in absence of galactose (gal−), where the main source of carbon is raffinose. Our primary goal is to determine the pathways that are *involved* (either induced or suppressed) in cell growth in gal+ compared to gal− environments. In other words, we would like to test whether each of 15 gene sets defined by yeast pathways in the network of Ideker et al. (2001) is differentially expressed in gal+ compared to gal− medium.

In this section, we analyze the Ideker data using methods of gene set analysis. More specifically, we apply the *Gene Set Enrichment Analysis* (GSEA) method of Subramanian et al. (2005). This method uses a permutation-based test (permuting the class labels) to determine whether genes in *a priori* defined gene sets have non-random associations with the phenotype. To that end, we first normalize the data so that the expression levels only represent the effect of the growth environment.^{1} The results of the analysis are displayed in Table 1.

The first line of the table presents an expected outcome; the expression levels of genes in the Galactose Utilization pathway is expected to change in response to perturbations of GAL genes in the gal+ environment. On the other hand, although some of the pathways seem to have differential expression when cells lack galactose (e.g., Stress and Vesicular Transport), no other pathway appears significant after adjusting for multiple testing using the False Discovery Rate (FDR) controlling procedure of Benjamini and Hochberg (1995) with a *q*-value of 0.05. In Section 5, we revisit the analysis of the Ideker data based on the method proposed in this paper, which directly incorporates the network information represented by the gene network in Figure 1.

### 1.2.Background

Recent research on gene set analysis can be broadly classified into permutation-based methods motivated by the GSEA paper and model-based approaches that make specific distributional assumptions about the gene expression data. The literature can be further categorized on whether direct or indirect external information on the gene network is employed. Tian et al. (2005) considered the problem of gene set analysis and described two hypotheses that should be considered when studying the significance of sets of genes. One of these hypotheses, which is the same as the hypothesis considered in GSEA, focuses on non-random association of genes in the gene set with the phenotype. The other hypothesis, considers non-random correlations between genes in a gene set. The test method proposed for the first hypothesis is based on permuting the class labels (column permutation) and the second hypothesis is tested by permuting genes (row permutation). Efron and Tibshirani (2007) formalized the idea of gene set analysis in a coherent statistical framework and examined the hypotheses presented in Tian et al. (2005). They also proposed an alternative test statistic with superior power properties and analyzed the effects of row and column permutations. Goeman and Bühlmann (2007) reviewed different methods proposed for testing significance of gene sets and highlighted important issues in selecting appropriate methods.

Although the above permutation-based methods are computationally intensive, they include minimum assumptions about the underlying biological model and are therefore more robust to model misspecification. An alternative approach is based on model-based tests procedures, where specific distributions for the expression data are assumed. In one such approach, Jiang and Gentleman (2007) extended the idea of gene set analysis by adapting a linear model approach and adjusting for other covariates. They presented the gene sets in the form of an index matrix and offered a heuristic argument for using a normal approximation for testing per gene set sums. One major difficulty regarding model-based methods is the large number of variables (genes) compared to the small number of samples—the large *p*, small *n* problem (West, 2000). In such situations, estimation of model parameters becomes a challenging task and may result in unstable outcomes. However, additional sources of information besides the expression levels of genes could be used to make the estimation more accurate. One such source of external information is the underlying relationship between genes which itself is of independent interest. It is known that genes interact with each other through their protein products and form gene regulatory networks. Also, the protein products of groups of genes are involved in controlling specific functions in cells through genetic pathways. Increasing amount of information about these relationships is becoming available in public repositories, like the KEGG (Kyoto Encyclopedia of Genes and Genomes) (Kanehisa and Goto, 2000) and the Gene Ontology (GO) (Ashburner et al., 2000), and can be used to improve the estimates of model parameters.

A number of researchers have recently used external information about gene networks to improve the analysis of gene sets. Rahnenführer et al. (2004) demonstrated that the sensitivity of detecting relevant pathways can be improved by integrating information about pathway topology. Barry et al. (2005) presented a permutation based procedure, called SAFE, that considers the underlying network structure. More recently, Wei and Li (2007) have proposed a Markov random field model to incorporate the information on the gene network in the analysis. In a related approach, Wei and Pan (2008) have modeled the network information via latent variables into a spatially correlated mixture model. Both of these methods, consider the problem of analysis of *single genes* on the network.

The above methods either assume that the underlying network does not change as the experimental conditions change or they do not incorporate this change directly into the model. However, changes in the underlying network structure can amplify the change in expression patterns and should be included in the analysis. For instance, Li (2002) demonstrated that the correlation patterns among ARG2 and other members of the urea-cycle pathway can change drastically as the expression level of ARG2 changes. Another concern in analyzing network data is to decorrelate subnetworks from the effects of other nodes in the network and to deal with nodes that belong to multiple networks. Alexa et al. (2006) present one such method which is an attempt to decorrelate GO graph structures. Their method focuses on decorrelating nodes at lower levels (children) from upper level nodes (parents).

In this paper, we propose a latent variable model to directly incorporate the underlying gene network and present test statistics for testing the significance of arbitrary sub-networks based on the theory of mixed linear models. One major advantage of the method proposed in this paper is that not only does it consider the change in the expression levels of the genes in different conditions, but also reflects the change in network structures and correlations among genes. We also present a systematic approach that decorelates each subnetwork from the other nodes while maintaining the interactions among genes in the subnetwork.

The rest of the paper is organized as follows. In the next section, the proposed latent variable model is introduced and some basic graph theoretical properties related to this model are discussed. In Section 3, we represent the latent variable model using the framework of mixed linear models and propose a general testing scheme based on the theory of mixed linear models. Section 3 ends with a result that is used to test the *pure* effect of each subnetwork. This result prevents tests of significance of subnetworks to be confounded with the effects of other subnetworks and also allows testing the effect of genes that belong to multiple networks. Section 4 includes three simulation studies for evaluating the performance of the new model under different testing conditions as well as studying the effect of noise in the network information on the proposed inference procedure. In Section 5, we revisit the Ideker data, introduced in Section 1.1, and test the significance of pathways using the proposed model. Section 6 includes a discussion on limitations of the proposed model and future extensions.

## 2.The Latent Variable Model

Consider gene expression data organized as a *p* × *n* matrix comprised of the expression levels of *p* genes for *n* samples, and let *Y* be the *k*th sample in the expression data (*k*th column of ).

To model the correlation structure caused by the gene network, we represent the network as a directed graph *G* = (*V*, *E*) with vertex set *V*, and edge set *E*, where *E* is represented by the *p* × *p* adjacency matrix *A*. Each nonzero element of the adjacency matrix, *A _{ij}*, represents a directed edge in the network. Elements of the adjacency matrix correspond to the strength of association among genes in the graph and are real values in (−1, 1).

Consider the simple network of Figure 2: Suppose *Y* = *X* + , where *X* represents the *signal* and the *noise*. Consider two adjacent genes *i* and *j*, where *i* affects *j*. One can represent the relationship between *i* and *j* using a simple linear model *X _{j}* =

*ρ*. However, to account for unknown associations among genes and/or errors in the association weights,

_{ij}X_{i}*ρ*, we also add

_{ij}*latent variables*to represent the baseline expression level of gene

*j*. For instance,

*γ*

_{2}represents the expression level of gene 2 without the effect from gene 1. Thus, for the simple gene network of Figure 2, we obtain

These equations can be summarized in vector notation as:

where Λ is called the *influence matrix* of the graph. In the simple example above, we have

Under such a model, *Y* is a normal random variable with mean and variance , where Λ′ denotes the transpose of matrix Λ.

In the remainder of this section, we study the relationship between the influence matrix, Λ, and the adjacency matrix of the graph, *A*. We provide a general result for the relationship between Λ and *A* as well as a compact expression that can be used to efficiently evaluate Λ for specific classes of graphs. We also discuss conditions under which the matrix Λ has full rank, which will be used in the analysis of the proposed inference procedure in Section 3.

#### Lemma 2.1.

*For any graph G* = (*V*, *A*) *we have* *(here A*^{0} *is defined to be the identity matrix).*

**Proof.** From the matrix representation of the latent variable model in (1)

where Λ_{ii} = 1 and Λ_{ij} ≠ 0 only if there is a path (of some length) on the graph from node *i* to node *j*. But for any graph *G*, the number of paths of length from *υ _{i}*,- to

*υ*is given by the (

_{j}*i*,

*j*) element of

*A*(Diestel, 2006). Therefore, Λ

^{r}_{ij}≠ 0 whenever there exists

*r*such that [

*A*]

^{r}_{ij}> 0. Hence, all possible paths from

*i*to

*j*are given by . This implies that .

#### Corollary 2.2.

*For any Directed Acyclic Graph (DAG),* .

**Proof.** This follows immediately from Lemma 2.1 by noting that since there are no loops in DAGs, the maximum length of paths equals *p*.

The following results provide sufficient conditions for the matrix Λ to be of full rank. Although this guarantees validity of the model for at least some classes of directed graphs, it does not provide a necessary condition. Based on experiments with randomly generated adjacency matrices, there are in fact larger classes of graphs satisfying this property.

#### Lemma 2.3.

*For any Directed Acyclic Graph (DAG), the matrix* Λ *has full rank.*

**Proof.** The full rankness of Λ is proved by showing that Λ can be re-arranged into a lower triangular matrix with 1's on the diagonal.

First observe that Λ_{ij} × Λ_{ji} = 0, since otherwise there will be a cycle in the graph. Also, from 2.1 we have Λ_{ii} = 1.

Consider a reordering of rows (and correspondingly of columns) of the matrix in decreasing number of zeros. Every DAG has at least one root (a node that is not affected by any other node). This means that there is at least one row with Λ_{kk} = 1 and Λ_{kj} = 0 for all *j*. Permute Λ so that row *k* is the first row of the matrix and continue in the same way. Denote the number of zero elements of row *i* by *ϕ _{i}* and number of zeros in column

*j*as

*ϕ*. Then by the above observation,

_{Cj}*ϕ*≥

_{Ri}*p*−

*ϕ*(here

_{Ci}*p*−

*ϕ*is the number of nonzero elements in column

_{Ci}*i*).

To complete the proof, we need to show that the rearranged matrix Λ can be further permuted to result in a lower diagonal matrix. Suppose there exists *j* > *i* such that Λ_{ij} > 0 and therefore Λ_{ji} = 0. If *ϕ _{Rj}* =

*ϕ*switch

_{Ri}*i*and

*j*to get a lower triangular matrix. However, if

*ϕ*<

_{Rj}*ϕ*(i.e., if

_{Ri}*i*is affected by a row with less number of zeros) there exists

*l*such that Λ

_{il}> 0 but Λ

_{il}= 0. However, Λ

_{jl}> 0 means there exists a path from

*l*to

*j*and Λ

_{ij}> 0 means that there exists a path from

*j*to

*i*. Thus there exists a path from

*l*to

*i*, i.e. Λ

_{il}> 0, a contradiction. Therefore Λ must be a lower triangular matrix with Λ

_{ii}= 1.

#### Lemma 2.4.

*Consider a graph G = (V,A) with influence matrix* Λ

*a) If G is a Directed Acyclic Graph (DAG), then A* = *I* − Λ^{−1}.

*b) If the sum of absolute values of weights of edges ending at every node of the graph G is less than 1 (i.e. A is sub-stochastic), then A = I* − Λ^{−1} *and* Λ *has full rank*.

**Proof.** a) From Corollary 2.2, and hence

But when *G* is a DAG, *A ^{p}*

^{+1}= 0 hence

*A*Λ = Λ −

*I*. By full rankness of Λ,

*A*=

*I*− Λ

^{−1}.

b) The condition in (b) implies that the sum of the absolute values of off-diagonal elements of *A* is less than 1. Let *s _{i}* be the sum of absolute values of off-diagonal elements of the

*i*th row of

*A*. Since the diagonal elements of

*A*are 0, by the Gershgorin's Ring Theorem (Friedberg et al., 1996) if λ is an eigenvalue of

*A*, we have Now let . Then and using an argument similar to part (a),

Since eigenvalues of *A* are less than 1 in magnitude, lim_{m→∞} Λ_{m} exists (Friedberg et al., 1996) and by the eigen-decomposition of *A*, *A ^{m}*

^{+1}→ 0 as

*m*→ ∞. Hence, taking the limit, we get

*A*Λ = Λ −

*I*+

*A*. On the other hand, the established bound on the eigenvalues of

*A*implies that all eigenvalues of

*I*−

*A*are nonzero, which means that

*I*−

*A*and therefore, Λ are full rank. Thus

*A*=

*I*− Λ

^{−1}.

Lemma 2.4 establishes an alternative relationship between Λ and *A* and determines two classes of graphs for which such a relationship is valid. As noted before, conditions presented in this result are only sufficient. For the general graph *G* = (*V*, *A*), if the spectral radius of *A* is less than 1, Λ has full rank and the relationship between *A* and Λ established in Lemma 2.4 holds. On the other hand, in special cases where Λ is not of full rank, it may be possible to modify the graph and therefore apply the model presented here. For instance, one large class of graphs where Λ is not full rank consists of *cyclic* graphs. The cycles in biological networks are often representatives of feedback loops which are common features of cell cycle related networks. However, the feedback is usually effective after a time delay and therefore, when time series data is used to study these networks, the cycles can be broken down by distinguishing between nodes at the beginning and end of each cycle. Undirected edges (e.g., protein-protein interactions) can also be transformed into two directed edges using a common latent variable affecting both nodes. More generally, it is often possible to transform the graph by introducing dummy nodes and can hence apply the model presented here.

## 3.Inference

### 3.1.Preliminaries

In this section, we study the inference procedure for the proposed model. Although this method can be used to test a variety of hypotheses, in order to simplify the presentation, we focus on testing the equality of means of two experimental conditions. The extension to more complicated settings is discussed at the end of the section. As before, let *Y* be a given sample in the expression data (*k*th column of data matrix ) and let *Y ^{C}* and

*Y*represent

^{T}*control*and

*treatment*conditions, with

*n*

_{1}columns of corresponding to control samples and

*n*

_{2}= (

*n*−

*n*

_{1}) columns to treatment samples. Also let two sets of parameters (

*μ*, Λ

^{C}^{C}) and (

*μ*, Λ

^{T}^{T}) represent mean vectors and influence matrices under control and treatment conditions, respectively.

Let **b** be an indicator vector determining genes that belong to a specific gene set (pathway). In other words, **b**_{j} = 1 if gene *j* is in gene set and 0 otherwise. We can test the significance of the gene sets by defining the test statistic **V** = **b***Y ^{T}* −

**b**

*Y*and testing:

^{C}Then under *H*_{0}:

and

Although the hypothesis in (2) can be tested using a generalized likelihood ratio test, it turns out that the latent variable model of Section 2 can be represented as a *Mixed Linear Model* (MLM). Using this framework, we can study a variety of spatio-temporal models and consider more general hypothesis testing problems.

### 3.2.Mixed linear model representation

Let **Y**, ** γ** and represent the rearrangement of vectors

**,**

*Y**γ*, and into

*np*× 1 column vectors. Then

**Y**=

**Ψ**

*β*+

**Π**

**+**

*γ***where:**

In this model, ** γ** is the vector of (unknown)

*random effects*and

**and are normally distributed random vectors with:**

*γ*and

For the latent variable model presented in the previous section, and and the variance of is given by .

The estimate of ;ib in the mixed linear model is given by (Searle, 1971):

. The estimate of β depends on estimates of and which can be estimated via *Restricted Maximum Likelihood* procedure (REML).

The framework of mixed linear models allows us to test a variety of hypotheses about *β* by considering tests of the form:

Here *l* is in general any *estimable* linear combination of *β*'s (Searle, 1971). An example of such a vector is a *contrast vector*, which satisfies the constraint **1**′*l* = 0. In the ensuing discussion, any linear combination of β's satisfying the estimability requirement is referred to as a *contrast vector*.

Based on the theory of mixed linear models, we can test (3) using the test statistic:

where ** C** = (

**Ψ**′

**W**

^{−}

^{1}

**Ψ**)

^{−1}.

Under the null hypothesis in (3), *T* has approximately a *t* distribution with *v* degrees of freedom, where the degrees of freedom is estimated using the Satterthwaite approximation method (McLean and Sanders, 1988):

with and *K* is the empirical covariance matrix of .

### 3.3.Computational issues and the use of the mixed linear model

The mixed linear model facilitates the representation of the latent variable introduced in Section 2. However, estimation and inference in this framework involves forming the matrices **Ψ** and **Π**, and performing operations involving products and inverses of these matrices. In the context of analysis of genetic data, the dimensions of these matrices (*np* × 2*p* and *np* × *np*) can cause serious difficulties in terms of computation time, memory requirement and numerical stability of the estimation algorithms. It is therefore necessary to derive alternative methods for estimation of parameters in the model. It turns out that due to the special structure of the model presented in Section 2, and the sparsity pattern of matrices **Ψ** and **Π**, the formulas presented in the previous section can be substantially simplified. More specifically, for the problem stated in Section 3.2 we have:

and

In the particular case considered here, the REML estimates of the variance components can be directly computed as the maximizers of the REML equation without any need for iterative methods. However, profiling out one of the variance components may result in more stable solutions.

### 3.4.Role of the contrast vector

The estimates of *β* based on the mixed linear model represent the individual expression level of each gene in the network. Thus, in order to evaluate the combined effect of each gene set using the test statistic *T*, the choice of contrast vector *l* proves fairly crucial. More specifically, the choice of *l* determines the null and alternative hypotheses of the test in (3), which in turn affects its significance level and power. In this section, we present different choices of contrast vectors and study their properties and effects on the power of tests.

A simple choice for the contrast vector *l* is to use the indicator vector of the gene set. In other words,

This simple choice of *l* corresponds to testing the following hypothesis:

which for each gene set g is equivalent to

Such a contrast vector however, only considers the mean expression levels of genes and does not reflect the combined effect of the set of genes in **b**, which is affected by interactions among genes in the network.

When the underlying network structure and therefore the correlation among genes is known, a natural alternative to *l*^{(1)} is to also include the influence matrices Λ^{C} and Λ^{T}. This leads to the following choice of contrast vector:

which corresponds to testing the following hypotheses:

The null hypothesis presented in (9) may first seem less intuitive and the choice of *l*^{(2)} rather arbitrary. However, the rationale behind the latter choice of contrast vector becomes clearer when we examine the test statistics corresponding to each one of the two null hypotheses in (6) and (9). In the case of the two-population test considered here, the above choices of contrast vectors lead to (after some algebra) the following test statistics:

and

From the above two equations it becomes clear than choosing *l*^{(2)} as the contrast vector leads to a very familiar test statistic. The numerator of test statistic *T*_{2} considers the difference in average observed values of expression levels and its denominator represents the variance of based on the mixed linear model.

It is also important to study the effect of the contrast vector on the power of tests. The two null hypotheses presented in (6) and (9) are different and therefore the usual power analysis cannot be applied to choose the right test. However, when Λ^{C} = Λ^{T} = Λ, the hypothesis presented in (6) is a special case of (9) (assuming that Λ has full rank) and it is possible to compare the powers of the two tests in this special case. When Λ^{C} = Λ^{T} = Λ, the null and alternative hypotheses are given in (6) and the test statistics *T*_{1} and *T*_{2} have the following simplified forms:

From these equations we can see that when no underlying network structure is taken into account, (Λ = *I*) the two test statistics are the same. However, if there is an underlying network structure (Λ ≠ *I*), the test statistic in (13) represents the likelihood ratio test for testing the null hypothesis in (6), which is asymptotically most powerful. On the other hand, as ||Λ^{T} − Λ^{C}|| increases, the test presented by *l*^{(1)} will no longer be appropriate and we could expect *l*^{(2)} to have a better performance.

In the more general case, where Λ^{C} ≠ Λ^{T}, it is desirable for the test statistic to account for all of the interactions between genes in the specific subnetwork and to not include any effects from genes outside the subnetwork. Consider again the simple gene network in Figure 2 and let **b** = (0, 1, 1). It is then desirable for the test statistic to include the interaction between genes 2 and 3, while excluding the effect of gene 1 (Fig. 3). The following result describes a choice of a contrast vector that achieves this goal.

#### Lemma 3.1.

*Consider a* 1 × *p indicator vector* **b** *and let* x · y *represent the element-wise product of x and y.*

*Then* (**b**Λ · **b**)*γ* *includes the effects of all the nodes in* **b** *on each other, but it is not affected by any node outside of the set of nodes indexed by* **b**.

**Proof.** Let . Based on the latent variable model, the *j*th column of Λ includes the influences of node *j* on all other nodes in the network. Therefore, (**bΛ**)_{j} is the influence of the *j*th node on all nodes in **b**. Also, note that Λ_{ii} = 1 for all *i* and Λ_{ji} is non-zero only if there is a path from *j* to *i*. Thus,

But (**bΛ** · **b**)_{j} is non-zero only if *j* *I*_{b} and therefore

which means that (**bΛ** ·**b**)*γ* only includes the effects of elements of **b** on each other.

The estimated *β*'s in the latent variable model reflect the individual effect of each gene and therefore, can be thought of as the ``pure signals.’’ Based on Lemma 3.1, in order to include interactions among genes in each subnetwork and prevent any confounding effects, we define the *network contrast vector* by

### 3.5.Comparison with other gene set analysis techniques

In this section, we discuss the main differences between the approach proposed in this paper and the idea of gene set enrichment analysis (GSEA) presented in Subramanian et al. (2005) and generalized by Efron and Tibshirani (2007).

Permutation based methods of gene set analysis, including GSEA, first compute an association measure relating the expression levels of each gene in the list to the phenotype (e.g., the *p*-value from the two sample *t*-test). The individual association measures are then combined into an *enrichment score* for each gene set (GSEA uses a version of Kolmogorov-Smirnov test statistic, while a maxmin function is used in GSA). The main strength of the GSEA method, that is also inherited by its extensions, is that the correlation structure of genes in the gene set is preserved, and the permutation based distribution of the enrichment score also represents the correlation among genes. However, these methods compute the individual association measures of each gene separately and do not directly include the correlation among genes when calculating the enrichment score.

Alternatively, if efficient estimation of the covariance matrix is possible, parametric test statistics may be used to test the difference between the expression levels of the two treatment groups. This is not usually possible since in most microarray analysis applications the number of parameters needed to be estimated is considerably larger than the number of samples available (*n**p*). However, the external information about the underlying gene network can make this estimation problem tractable. For instance, in the mixed linear model proposed in this paper, the covariance matrix is modeled as a function of few parameters which can be efficiently estimated from the data. Thus, it is possible to test the significance of each gene set using tests that include the expression levels of *all* genes in the gene set and also directly incorporates the covariance structure of the genes in each subnetwork. An example of such a test statistic is the *T*_{2} test statistic discussed in Section 3.4, which is a version of the two-sample *t*-test. If the model is correctly specified, one could expect such a test statistic to be sensitive to changes in both the expression levels and also in the covariance structure. However, in the absence of external information about the network, estimation of the covariance matrix may be impractical and non-parametric methods like GSEA, may offer better inference properties.

In the next section, we carry out simulation studies to illustrate the difference between the proposed model and the GSEA method. We will also examine the effect of the choice of the contrast vector on the performance of the proposed test statistic.

## 4.Performance Analysis

Three sets of simulation studies are considered in this section. In the first simulation, we study different choices of contrast vectors and compare their performance with GSEA in a simple network. The second simulation study is designed to analyze the combined effect of change in mean and covariance between control and treatment conditions. In the last simulation, we evaluate the sensitivity of the proposed inference procedure to the presence of noise in the association weights. Note that in simulation studies of this section, it is assumed that the effect of the gene network is appropriately modeled using the latent variable model of Section 2 and that the the topology of the network is correctly specified.

### 4.1.Simulation 1: Different choices of contrast vector

In the first setting, a simple network structure consisting of an eight-level binary tree with 255 nodes is used. It is assumed that there are no interactions in the network under the control condition (Λ^{C} = *I*). Under the treatment condition, genes on the network are assumed to be positively correlated with different association strengths: The association for the first three levels of the genes in the network (top seven genes in the tree) is assumed to be 0.8, genes in the next three levels (56 genes) have association equal to 0.5 and the remainder of the genes are weakly associated with *ρ* = 0.2. Under control, the mean vector for mRNA expression levels of genes is set to zero (*μ ^{C}* = 0). Scenarios for mean expression levels under treatment are presented in Table 2 and Gene sets considered in this simulation are given in Table 3. The gene sets are chosen so that for each mean scenario there exists gene sets with highly expressed genes and also gene sets that represent non-differentially expressed genes.

Table 4 presents the estimated powers of the GSEA method and tests based on the three contrast vectors, *l*^{(1)}, *l*^{(2)}, and *l*^{(N)}, introduced in Section 3.3 based on 1000 simulations. The powers are calculated based on the FDR controlling procedure of Benjamini and Hochberg (1995) with a *q*-value of 0.05.

The positive correlation structure of the network affects the significance of the subnetworks selected for this comparison. When a specific gene in the network becomes differentially expressed, the other genes in the network that are influenced by that gene will also have modified expression levels in the same direction and the combined subnetwork becomes strongly significant. This propagation mechanism explains the abundance of powers of 1 in the table. The first mean scenario in this study corresponds to the case that Λ^{C} μ^{C} = Λ^{T} μ^{T}. All the methods have nominal significance level of 0.05 for this test. On the other hand, there are some differences between the tests based on different contrast vectors and the GSEA method. As one expects from the discussion in Section 3.4, the test based on *l*^{(2)} has higher power than the test based on *l*^{(1)}. It can also be seen that in all but one case, the power resulted from test based on *l*^{(2)} is higher than the power for the GSEA method verifying the discussion of Section 3.5. There are few cases that deserve special attention. The GSEA method indicates no power for testing all the genes in the network under scenario 2. However, in this case the top 1/3 levels of the tree are significant and therefore it is natural to expect significant differences in overall expression levels. The same pattern can be observed when comparing the two methods for testing the right branch of the tree under the second scenario and the top 1/3 of genes under the third scenario. On the other hand, the test based on *l*^{(2)} has a high false positive rate for testing the right branch of the tree in the situation where only the left branch is up-regulated (scenario 4), while the GSEA method correctly shows no deviation from the null hypothesis. The same phenomenon can be seen for the results of testing the last level of the tree in the case where the top 2/3 levels of the tree are significant. The test based on *l*^{(2)} is not able to isolate the significance of the genes under consideration from the effect of other genes in the network and can therefore result in high false positive rates. As expected based on Lemma 3.1, the test based on *l*^{(N)} resolves these shortcomings. The power of this test is close to the nominal significance level for testing the above two cases while it offers a high power in cases where the GSEA method fails to distinguish the significance of the subnetworks.

### 4.2.Simulation 2: Simultaneous changes in mean & covariance

The second simulation study is designed to evaluate simultaneous changes in expressions levels as well as associations among genes. The network structure in this simulation consists of three root nodes and seven five-level trees (220 genes total). The network consists of low and high association subnetworks and also includes both positive and negative correlations. Three of the subnetworks are considered to be differentially expressed (the level of expression increases in increments of 0.2) and the other subnetworks have equal values of mean in treatment and control conditions. Figure 4 illustrates the setting of parameters of this simulation study.

**...**

Table 5 presents the estimates of powers for the GSEA method and the test based on *l*^{(N)} for testing different trees with increasing expression levels in a simulation with 1000 repetitions. It can be seen from the results that both of these methods reject the null hypothesis for tests related to trees with high positive correlation (subtrees 1, 2, and 7 in Fig. 4). The GSEA method can only detect the significance of subtree 3 for large values of increase in the expression level while the test based on *l*^{(N)}, can detect this change for smaller values of increase. Subtrees 4 and 5 correspond to cases where the correlation among genes is minimal. Subtree 4 is affected by root genes 1 and 2 that are both up regulated but they have opposite correlations with genes in subtree 4. As one would expect, the powers for subtree 4 are similar to those of subtree 5, which suggests that the combined effect of genes 1 and 2 on subtree 4 is the same as the effect of gene 3 on subtree 5. Subtree 6 illustrates the fact that the test based on *l*^{(N)} takes advantage of the known correlation structure even if the genes in the network are negatively correlated while the GSEA method cannot detect the change in the correlation structure between control and treatment conditions.

### 4.3.Simulation 3: Effect of noise in network information

In the last simulation, we evaluate the sensitivity of the proposed inference procedure to presence of noise in association weights of the gene network. The network consists of four similar subnetworks, each with 40 genes. Under control, genes have mean *μ ^{C}* = 1 and the weights of the adjacency matrix are set to 0.2. The settings of the parameters under treatment are given in Table 6. The estimated powers of tests of significance of each subnetwork using a test based on

*l*

^{(N)}are plotted in Figures 5 and and6.6. Figure 5 represents the case where the errors are introduced at random, that is, each weight in the adjacency matrix under treatment is perturbed by a uniform noise in the range [−

*e*,

*e*] where

*e*is a value between 0 and 0.4. On the other hand, Figure 6 represents the estimated powers of tests when a systematic bias is included in the weights of the adjacency matrix under treatment. It can be seen that if the underlying model is correctly specified, presence of random noise in weights of adjacency matrix will not significantly affect the power of the test. However, presence of systematic bias in the estimated weights can introduce both type I, as well as type II errors. This is illustrated by the increase of power of the test as the difference between weights under treatment and control becomes more significant (Fig. 6). It is important to note that the simulation considered here does not include errors in the topology of the network. These errors become more critical if the topology of the network, as well as the association weights, are estimated from expression data, which is beyond the scope of this article.

## 5.Analysis Of Yeast Galactose Utilization Pathway Data

In Section 1.1, we analyzed the yeast GAL pathway data (Ideker data) using the GSEA method, which revealed that the Galactose Utilization pathway is significantly activated in gal+ condition. In that analysis, the external information provided by the network was only used to determine the gene sets of interest. As discussed in Section 1.1, the Ideker data also includes strength of gene interactions in the network. Therefore, it is possible to directly incorporate the network information and use the proposed network-based inference procedure. It is important to note that the Ideker data only includes one set of association weights for both gal+ and gal− conditions. In other words, in this section we assume Λ^{T} = Λ^{C} = Λ, and hence the proposed inference procedure cannot test the change in the network structure. Assuming that the latent variable model correctly represents the effect of the underlying network, the increased power of the network based procedure is mainly due to directly incorporating the network information.

Table 7 compares results of analyzing the Ideker data using the GSEA method and the network based method presented in this paper (using *l*^{(N)}). This table also includes results of analyzing this data using the GSA method of Efron and Tibshirani (2007).^{2} As one may expect, all three methods find the Galactose Utilization pathway to be statistically significant. Although the GSEA and the GSA methods agree on the significance of other subnetworks, it can be seen from Table 7 and Figure 7, that including the underlying network structure in the analysis, reveals four additional significant pathways. Although additional experiments are needed to verify the result of Table 7, the biology of yeast cells may offer some insight to significance of newly detected pathways. These pathways can be categorized into two groups: Galactose Utilization and rProtein Synthesis pathways are involved in cell growth in gal+ environment, while genes in the Stress, Respiration and Fatty Acid Oxidation pathways are induced in gal− environment. The Stress pathway has a low nominal *p*-value in both GSEA and GSA results; however, these methods do not consider this pathway significant. The significance of the Stress pathway is not surprising and can be explained by the fact that galactose is a more efficient source of carbon than raffinose. Thus, in absence of galactose (gal−), the genes in the Environmental Stress Response (ESR) are induced (Gasch et al., 2000; Gasch and Werner-Washburne, 2002). The Fatty Acid Oxidation and Respiration pathways are also upregulated in gal− environment. The genes in the Respiration pathway are among the genes that are induced in the ESR.

Many of the stress defense mechanisms consume ATP, and therefore, cellular stress could lead to the induced expression of respiration genes (Hohmann and Mager, 2003). Also, many genes involved in importing and exporting fatty acids are induced in ESR and the induction of these genes can increase the local concentration of fatty acids, which in turn may induce the expression of genes in Fatty Acid Oxidation pathway (Hohmann and Mager, 2003). The induction of Fatty Acid Oxidation and Respiration genes can be further explained by the coregulation of genes in these pathways. It should be noted that two of the genes in the Respiration pathway are directly affected by genes in GAL pathway (GAL4 regulates CYC1 and HAP4 is regulated by MIG1), and our proposed model can exploit such relationship in order to gain more statistical power. Finally, the significance of the rProtein Synthesis genes can be explained by growth dependent expression of these genes and the fact that ESR represses the expression of many protein synthesis genes (Hohmann and Mager, 2003).

## 6.Discussion

Finding significant subnetworks and pathways that are involved in certain biological phenomena has been the focus of many new studies. The main challenge is to formulate the null and alternative hypotheses that consider the change in the expression levels of the genes as well as the change in the network structure in response to environmental factors. In this paper, we proposed a model-based approach for testing the significance of biological pathways using the underlying gene network and studied graph theoretic properties of the model. Our approach uses external information available about the underlying network and it hence depends on availability and quality of such data. The method proposed in this paper, incorporates the weighted adjacency matrix of the network through a latent variable model and uses a flexible mixed linear representation. We discussed that the inference based on this method depends on the choice of the contrast vector and proposed a choice that offers improvement in power of the test compared to the GSEA method of Subramanian et al. (2005). The simulation studies and the analysis of the yeast galactose utilization pathway reveal the ability of the proposed method in identifying significant pathways that are otherwise difficult to distinguish. Although the focus of this paper was on testing the significance of subnetworks in the two population inference problem, the proposed method provides a general framework for studying a variety of phenotypes including analysis of time series mRNA data and the change in the network over time. More generally, different correlation structures among observations can be implemented in the mixed linear model and therefore, different types of data can be modeled using this framework. Considering parameters for environment factors and gene-gene and gene-environment interactions is also a straight forward extension of the proposed model.

The model presented in this paper relies on two main assumptions: (a) The relationship between the expression levels of genes in the network can be represented linearly using the influence matrix of the network and (b) that the data follows a normal distribution. Although the first assumption is a crucial part of this analysis, the second assumption can be relaxed using the Generalized Mixed Linear Model (GMLM) framework. However, this would make the computational aspects of the problem more challenging.

The growth of information available on the underlying biological networks calls for effective methods that can utilize such information efficiently and requires extensions of statistical methods appropriate for studying of network structures. The model presented in this paper requires external information on the weighted adjacency matrix of the network. Although more data is becoming available on gene and protein networks, many available network data only include the binary association among genes (network topology) and do not include information about the strength or direction of associations among genes. The problem of estimating the weighted adjacency matrix of the network, which is related to estimation of the covariance matrix, is of separate interest and is beyond the scope of this paper. Chaudhuri et al. (2007) propose an efficient algorithm for estimating the association among genes when the topology of the network is known. The method proposed in this paper can also be extended to the cases where only partial information about the network is available.

## Footnotes

^{1}The mean expression levels of the two samples corresponding to each perturbation is subtracted from the two columns of data.

^{2}The minmax criteria is used as the enrichment function in the GSA method.

## Acknowledgments

We would like to thank the CoEditor-in-Chief, Professor Sorin Istrail, and two anonymous referees for helpful comments and suggestions. We are also thankful to Professor Trey Ideker for providing the yeast Galactose Utilization data and helpful discussions. The work of George Michailidis was partially supported by the NIH (grant 5P 41RR018627) and the MEDC (grant GR-687).

## Disclosure Statement

No competing financial interests exist.

## References

- Alexa A. Rahnenfuhrer J. Lengauer T. Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics. 2006;22:1600–1607. [PubMed]
- Ashburner M. Ball C. Blake J., et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 2000;25:25–29. [PMC free article] [PubMed]
- Bader J.S. Chaudhuri A. Rothberg J.M., et al. Gaining confidence in high-throughput protein interaction networks. Nat. Biotechnol. 2004;22:78–85. [PubMed]
- Barry W.T. Nobel A.B. Wright F.A. Significance analysis of functional categories in gene expression studies: a structured permutation approach. Bioinformatics. 2005;21:1943–1949. [PubMed]
- Benjamini Y. Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Statist. Soc. Ser. B. 1995;57:289–300.
- Chaudhuri S. Drton M. Richardson T. Estimation of a covariance matrix with zeros. Biometrika. 2007;94:199–216.
- Diestel R. Graph Theory. Springer-Verlag; New York: 2006.
- Efron B. Tibshirani R. On testing the significance of sets of genes. Ann. Appl. Statist. 2007;1:107–129.
- Friedberg S.H. Insel A.J. Spence L.E. Linear Algebra. Prentice Hall; Englewood Cliffs, NJ: 1996.
- Gasch A.P. Spellman P.T. Kao C.M., et al. Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell. 2000;11:4241–4257. [PMC free article] [PubMed]
- Gasch A.P. Werner-Washburne M. The genomics of yeast responses to environmental stress and starvation. Funct. Integr. Genomics. 2002;2:181–192. [PubMed]
- Goeman J.J. Bühlmann P. Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics. 2007;23:980–987. [PubMed]
- Hohmann S. Mager W. Yeast Stress Responses. Springer; New York: 2003.
- Ideker T. Thorsson V. Ranish J., et al. Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science. 2001:292. [PubMed]
- Jiang Z. Gentleman R. Extensions to gene set enrichment. Bioinformatics. 2007;23:306–313. [PubMed]
- Kanehisa M. Goto S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000;28:27–30. [PMC free article] [PubMed]
- Li K.-C. Genome-wide coexpression dynamics: theory and application. Proc. Natl. Acad. Sci. USA. 2002;99:16875–16880. [PMC free article] [PubMed]
- McLean R.A. Sanders W.L. Approximating degrees of freedom for standard errors in mixed linear models. Proc. Statist. Comput. Sect. Am. Statist. Assoc. 1988:50–59.
- Rahnenführer J. Domingues F.S. Maydt J., et al. Calculating the statistical significance of changes in pathway activity from gene expression data. Statist. Appl. Genet. Mol.Biol. 2004;3:16. [PubMed]
- Searle S.R. Linear Models. John Wiley & Sons, Inc.; New York: 1971.
- Subramanian A. Tamayo P. Mootha V., et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA. 2005;102:15545–15550. [PMC free article] [PubMed]
- Tian L. Greenberg S.A. Kong S.W., et al. Discovering statistically significant pathways in expression profiling studies. Proc. Natl. Acad. Sci. USA. 2005;102:13544–13549. [PMC free article] [PubMed]
- Wei P. Pan W. Incorporating gene networks into statistical tests for genomic data via a spatially correlated mixture model. Bioinformatics. 2008;24:404–411. [PubMed]
- Wei Z. Li H. A Markov random field model for network-based analysis of genomic data. Bioinformatics. 2007;23:1537–1544. [PubMed]
- West M. Technical report. Institute of Statistics and Decision Sciences; 2000. Bayesian regression analysis in the large p small n paradigm.

**Mary Ann Liebert, Inc.**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (435K)

- A copula method for modeling directional dependence of genes.[BMC Bioinformatics. 2008]
*Kim JM, Jung YS, Sungur EA, Han KH, Park C, Sohn I.**BMC Bioinformatics. 2008 May 1; 9:225. Epub 2008 May 1.* - Quantitative epistasis analysis and pathway inference from genetic interaction data.[PLoS Comput Biol. 2011]
*Phenix H, Morin K, Batenchuk C, Parker J, Abedi V, Yang L, Tepliakova L, Perkins TJ, Kærn M.**PLoS Comput Biol. 2011 May; 7(5):e1002048. Epub 2011 May 12.* - Reverse engineering module networks by PSO-RNN hybrid modeling.[BMC Genomics. 2009]
*Zhang Y, Xuan J, de los Reyes BG, Clarke R, Ressom HW.**BMC Genomics. 2009 Jul 7; 10 Suppl 1:S15. Epub 2009 Jul 7.* - Transcriptional regulation in the yeast GAL gene family: a complex genetic network.[FASEB J. 1995]
*Lohr D, Venkov P, Zlatanova J.**FASEB J. 1995 Jun; 9(9):777-87.* - A systems-biology approach to modular genetic complexity.[Chaos. 2010]
*Carter GW, Rush CG, Uygun F, Sakhanenko NA, Galas DJ, Galitski T.**Chaos. 2010 Jun; 20(2):026102.*

- Pathway network inference from gene expression data[BMC Systems Biology. ]
*Ponzoni I, Nueda MJ, Tarazona S, Götz S, Montaner D, Dussaut JS, Dopazo J, Conesa A.**BMC Systems Biology. 8(Suppl 2)S7* - Using random walks to identify cancer-associated modules in expression data[BioData Mining. ]
*Petrochilos D, Shojaie A, Gennari J, Abernethy N.**BioData Mining. 617* - EDDY: a novel statistical gene set test method to detect differential genetic dependencies[Nucleic Acids Research. 2014]
*Jung S, Kim S.**Nucleic Acids Research. 2014 Apr; 42(7)e60* - Inferring Regulatory Networks by Combining Perturbation Screens and Steady State Gene Expression Profiles[PLoS ONE. ]
*Shojaie A, Jauhiainen A, Kallitsis M, Michailidis G.**PLoS ONE. 9(2)e82393* - CellFateScout - a bioinformatics tool for elucidating small molecule signaling pathways that drive cells in a specific direction[Cell Communication and Signaling : CCS. ]
*Siatkowski M, Liebscher V, Fuellen G.**Cell Communication and Signaling : CCS. 1185*

- Analysis of Gene Sets Based on the Underlying Regulatory NetworkAnalysis of Gene Sets Based on the Underlying Regulatory NetworkJournal of Computational Biology. Mar 2009; 16(3)407PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...