- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

- Journal List
- J Comput Biol
- PMC3123978

# Subnetwork State Functions Define Dysregulated Subnetworks in Cancer

^{1}Rod K. Nibbe,

^{2,,}

^{4}Mark R. Chance,

^{3,,}

^{4}and Mehmet Koyutürk

^{}

^{1,,}

^{4}

^{1}Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, Ohio.

^{2}Department of Pharmacology, Case Western Reserve University, Cleveland, Ohio.

^{3}Department of Physiology and Biophysics, Case Western Reserve University, Cleveland, Ohio.

^{4}Center of Proteomics and Bioinformatics, Case Western Reserve University, Cleveland, Ohio.

^{}Corresponding author.

*Dr. Mehmet Koyutürk, Department of Electrical Engineering and Computer Science, Case Western Reserve University, 10900 Euclid Avenue, Cleveland, OH 44106. E-mail:*Email: ude.esac@133kxm

## Abstract

Emerging research demonstrates the potential of protein-protein interaction (PPI) networks in uncovering the mechanistic bases of cancers, through identification of interacting proteins that are coordinately dysregulated in tumorigenic and metastatic samples. When used as features for classification, such coordinately dysregulated subnetworks improve diagnosis and prognosis of cancer considerably over single-gene markers. However, existing methods formulate coordination between multiple genes through additive representation of their expression profiles and utilize fast heuristics to identify dysregulated subnetworks, which may not be well suited to the potentially combinatorial nature of coordinate dysregulation. Here, we propose a combinatorial formulation of coordinate dysregulation and decompose the resulting objective function to cast the problem as one of identifying subnetwork state functions that are indicative of phenotype. Based on this formulation, we show that coordinate dysregulation of larger subnetworks can be bounded using simple statistics on smaller subnetworks. We then use these bounds to devise an efficient algorithm, Crane, that can search the subnetwork space more effectively than existing algorithms. Comprehensive cross-classification experiments show that subnetworks identified by Crane outperform those identified by additive algorithms in predicting metastasis of colorectal cancer (CRC).

**Key words:**combinatorial optimization, computational molecular biology, machine learning

## 1.Introduction

Recent advances in high-throughput screening techniques enable studies of complex phenotypes in terms of their associated molecular mechanisms. While genomic studies provide insights into genetic differences that relate to certain phenotypes, functional genomics (e.g., gene expression, protein expression) helps elucidate the variation in the activity of cellular systems (Schadt, 2005). However, cellular systems are orchestrated through combinatorial organization of thousands of biomolecules (Papin et al., 2005). This complexity is reflected in the diversity of phenotypic effects, which generally present themselves as weak signals in the expression profiles of single molecules. For this reason, researchers increasingly focus on identification of multiple markers that together exhibit differential expression with respect to various phenotypes (Ideker and Sharan, 2008; Rich et al., 2005).

### 1.1.Network-based approaches to identification of multiple markers

High-throughput protein-protein interaction (PPI) data (Ewing et al., 2007) provide an excellent substrate for network-based identification of multiple interacting markers. Network-based analyses of diverse phenotypes show that products of genes that are implicated in similar phenotypes are clustered together into “hot spots” in PPI networks (Goh et al., 2007; Rhodes and Chinnaiyan, 2005). This observation is exploited to identify novel genetic markers based on network connectivity (Franke et al., 2006; Karni et al., 2009; Lage et al., 2007). For the identification of differentially expressed subnetworks with respect to GAL80 deletion in yeast, Ideker et al. (2002) propose a method that is based on searching for connected subgraphs with high aggregate significance of individual differential expression. Variants of this method are shown to be effective in identifying multiple genetic markers in prostate cancer (Guo et al., 2007), melanoma (Nacu et al., 2007), diabetes (Liu et al., 2007), and others (Cabusora et al., 2005; Patil and Nielsen, 2005; Scott et al., 2005).

### 1.2.Coordinate/synergistic dysregulation

Network-based approaches are further elaborated to capture coordinate dysregulation of interacting proteins at a sample-specific resolution (Chowdhury and Koyutürk, 2010). Ulitksy et al. (2008) define dysregulated pathways as subnetworks composed of products of genes that are dysregulated in a large fraction of phenotype samples. Chuang et al. (2007) define subnetwork activity as the aggregate expression of genes in the subnetwork, quantify the dysregulation of a subnetwork in terms of the mutual information between subnetwork activity and phenotype, and develop fast algorithms to identify subnetworks that exhibit significant dysregulation. Subnetworks identified by this approach are also used as features for classification of breast cancer metastasis, providing significant improvement over single-gene markers (Chuang et al., 2007). Nibbe et al. (2009, 2010) show that this notion of coordinate dysregulation is also effective in integrating protein and mRNA expression data to identify important subnetworks in colorectal cancer (CRC). Anastassiou (2007) introduces the concept of synergy to delineate the complementarity of multiple genes in the manifestation of phenotype. While identification of multiple genes with synergistic dysregulation is intractable (Anastassiou, 2007), important insights can still be gained through pairwise assessment of synergy (Watkinson et al., 2008).

### 1.3.Contributions of this study

Despite significant advances, existing approaches to the identification of coordinately dysregulated subnetworks have important limitations, including the following: (i) additive formulation of subnetwork activity can only highlight the coordinate dysregulation of interacting proteins that are dysregulated in the same direction, overlooking the effects of inhibitory and other complex forms of interactions; (ii) simple heuristics that make greedy decisions may not be able to adequately capture the coordination between multiple genes that provide weak individual signals. In this article, with a view to addressing these challenges, we develop a novel algorithm, Crane, for the identification of Combinatorially dysRegulAted subNEtworks. The contributions of the proposed computational framework include the following:

- We formulate coordinate dysregulation combinatorially, in terms of the mutual information between
*subnetwork state functions*(specific combinations of quantized mRNA expression levels of proteins in a subnetwork) and phenotype (as opposed to additive*subnetwork activity*). - We decompose combinatorial coordinate dysregulation into individual terms associated with individual state functions, to cast the problem as one of identifying state functions that are
*informative*about the phenotype. - Based on this formulation, we show that the information provided on phenotype by a state function can be bounded from above using statistics of subsets of this subnetwork state. Using this bound, we develop bottom-up enumeration algorithms that can effectively prune out the subnetwork space to identify informative state functions efficiently.
- We use subnetworks identified by the proposed algorithms to train neural networks for classification of phenotype, which are better suited to modeling the combinatorial relationship between the expression levels of genes in a subnetwork, as compared to classifiers that require aggregates of the expression profiles of genes as features (e.g., Support vector machines [SVMs]).

We describe these algorithmic innovations in detail in Section 2.

### 1.4Results

We implement Crane in Matlab and perform comprehensive cross-classification experiments for prediction of metastasis in CRC. These experiments show that subnetworks identified by the proposed framework outperform subnetworks identified by additive algorithms in terms of accuracy of classification. We then conduct comprehensive experiments to evaluate the effect of parameters on the performance of Crane. We also investigate the highly informative subnetworks in detail to assess their potential in highlighting the mechanisms of metastasis in human CRC. We present these results in Section 3 and conclude our discussion in Section 4.

## 2.Methods

In the context of a specific phenotype, a group of genes that exhibit significant differential expression and whose products interact with each other may be useful in understanding the network dynamics of the phenotype. This is because, the patterns of (i) collective differential expression and (ii) connectivity in PPI network are derived from independent data sources (sample-specific mRNA expression and generic protein-protein interactions, respectively). Thus, they provide corroborating evidence indicating that the corresponding subnetwork of the PPI network may play an important role in the manifestation of phenotype. In this article, we refer to the collective differential expression of a group of genes as *coordinate dysregulation*. We call a group of coordinately dysregulated genes that induce a connected subnetwork in a PPI network a *coordinately dysregulated subnetwork.* The terminology and notation in this article are described in Table 1.

### 2.1.Dysregulation of a gene with respect to a phenotype

For a set of genes and of samples, let denote the properly normalized (Quackenbush, 2002) gene expression vector for gene where *E _{i}*(

*j*) denotes the relative expression of

*g*in sample . Assume that the phenotype vector

_{i}*C*annotates each sample as phenotype or control, such that

*C*=1 indicates that sample

_{j}*s*is associated with the phenotype (e.g., taken from a metastatic sample) and

_{j}*C*=0 indicates that

_{j}*s*is a control sample (e.g., taken from a non-metastatic tumor sample). Then, the mutual information of

_{j}*E*and

_{i}*C*is a measure of the reduction of uncertainty about phenotype

*C*due to the knowledge of the expression level of gene

*g*. Here, denotes the Shannon entropy of discrete random variable

_{i}**with support . The entropy**

*X**H*(

*E*) of the expression profile of gene

_{i}*g*is computed by quantizing

_{i}*E*properly. Clearly,

_{i}*I*(

*E*;

_{i}*C*) provides a reasonable measure of the dysregulation of

*g*, since it quantifies the power of the expression level of

_{i}*g*in distinguishing phenotype and control samples.

_{i}### 2.2.Additive coordinate dysregulation

Now let denote a PPI network where the product of each gene is represented by a node and each edge represents an interaction between the products of *g _{i}* and

*g*. For a subnetwork of with set of nodes , Chuang et al. (2007) define the

_{j}*subnetwork activity*of

*S*as , i.e., the aggregate expression profile of the genes in

*S*. Then, the dysregulation of is given by , which is a measure of the reduction in uncertainty on phenotype

*C*, due to knowledge of the aggregate expression level of all genes in

*S*. In the following discussion, we refer to as the

*additive coordinate dysregulation*of

*S*.

### 2.3.Combinatorial coordinate dysregulation

Additive coordinate dysregulation is useful for identifying subnetworks that are composed of genes dysregulated in the same direction (either up- or down-regulated). However, interactions among genes and proteins can also be inhibitory (or more complex), and the dysregulation of genes in opposite directions can also be coordinated, as illustrated in Figure 1. Combinatorial formulation of coordinate dysregulation may be able to better capture such complex coordination patterns.

*g*) are shown as nodes; interactions between their products are shown as edges. Expression profiles (

*E*) of genes are shown by colormaps. Dark red indicates high expression (H); light green

**...**

To define combinatorial coordinate dysregulation, we consider binary representation of gene expression data. Binary representation of gene expression is commonly utilized for several reasons, including removal of noise, algorithmic considerations, and tractable biological interpretation of identified patterns. Such approaches are shown to be effective in the context various problems, ranging from genetic network inference (Akutsu et al., 1999) to clustering (Koyutürk et al., 2004) and classification (Akutsu and Miyano, 2001). Ulitsky et al. (2008) also use binary representation of differential expression to identify dysregulated pathways with respect to a phenotype. There are also many algorithms for effective binarization of gene expression data (Shmulevich and Zhang, 2002).

For our purposes, let denote the binarized expression profile of gene *g _{i}*. We say that gene

*g*has

_{i}*high expression*in sample

*s*if and

_{j}*low expression*if . Then, the

*combinatorial coordinate dysregulation*of subnetwork is defined as

where is the random variable that represents the combination of binary expression states of the genes in and .

The difference between additive and combinatorial coordinate dysregulation is illustrated in Figure 1. Anastassiou (2007) also incorporates this combinatorial formulation to define the synergy between a pair of genes as . Generalizing this formulation to the synergy between multiple genes, it can be shown that identification of multiple genes with synergistic dysregulation is an intractable computational problem (Anastassiou, 2007). Here, we define combinatorial coordinate dysregulation as a more general notion than synergistic dysregulation, in that coordinate dysregulation is defined based solely on collective differential expression, whereas synergy explicitly looks for genes that cannot individually distinguish phenotype and control samples.

Subnetworks that exhibit combinatorial coordinate dysregulation with respect to a phenotype may shed light into the mechanistic bases of that phenotype. However, identification of such subnetworks is intractable, and due to the combinatorial nature of the associated objective function , simple heuristics may not suit well to this problem. This is because, as also demonstrated by the example in Figure 1, it is not straightforward to bound the combinatorial coordinate dysregulation of a subnetwork in terms of the individual dysregulation of its constituent genes or coordinate dysregulation of its smaller subnetworks. Motivated by these considerations, we propose to decompose the combinatorial coordinate dysregulation of a subnetwork into individual subnetwork state functions and show that information provided by state functions of larger subnetworks can be bounded using statistics of their smaller subnetworks.

### 2.4.Subnetwork state functions informative of phenotype

Let denote an observation of the random variable , i.e., a specific combination of the expression states of the genes in . By definition of mutual information, we can write the combinatorial coordinate dysregulation of as

where

Here, *p*(*x*) denotes *P*(*X*=*x*), that is the probability that random variable *X* is equal to *x* (similarly, *p*(*x*|*y*) denotes *P*(*X*=*x*|*Y*=*y*)). In biological terms, can be considered a measure of the information provided by subnetwork *state function* on phenotype *C*. Therefore, we say a state function is *informative* of phenotype if it satisfies the following conditions:

- , where
*j** is an adjustable threshold. - for all . Here, denotes that is a substate of state function , that is and maps each gene in to an expression level that is identical to the mapping provided by .

Here, the first condition ensures that the information provided by the state function is considered high enough with respect to a user-defined threshold. The second condition ensures that informative state functions are non-redundant, that is, a state function is considered informative only if it provides more information on the phenotype than any of its substates can. This restriction ensures that the expression of each gene in the subnetwork provides additional information on the phenotype, capturing the synergy between multiple genes to a certain extent. For a given set of phenotype and control samples and a reference PPI network, the objective of our framework is to identify all informative state functions.

The following theorem shows that for any where *q* denotes the fraction of phenotype samples among all available samples.

#### Theorem 1

*For a given gene expression dataset, let the fraction of phenotype samples be q*=*p*(1)=*P*(*C*=1)*. Then, for any subnetwork* ,

We use the following conventions for notational convenience:

*z*denotes , that is the probability that subnetwork is in state in a given sample.*s*denotes , that is the probability that a sample with state for the genes in is associated with the phenotype of interest.

#### Proof

Assume that *q* and *z* are fixed. Then we can write as a function of *s*:

Taking the derivative of this function with respect to *s*, we obtain

Observe that *J′*(*s*) assumes its zero at *s*=*q*. Furthermore, for *s>q*, since *s/q*>0 and (1−*s*)*/*(1−*q*)<0, *J′*(*s*) is always positive and *J* is an increasing function of *s*. Similarly, for *s<q*, *J′*(*s*) is always negative and *J* is a decreasing function of *s*. Consequently, *J*(*s*) is always non-negative and it assumes its maximum at one of the boundaries of the range of values that *s* can take. Therefore, for fixed *q*, if we bound *J*(*s*) at the boundaries that are enforced by *z*, we can write the bound on *J* as a function of *z*. The maxima of this function over all values of *z* will provide a bound on *J* over all possible values of *z* and *s* for fixed *q*. We analyze the cases *z*≥*q* and *z*≤*q* separately.

**Case A:** *z*≥*q*, that is the state function is observed at least as commonly as the phenotype of interest. In this case, since the number of phenotype samples in which the state function is observed can be at most equal to the number of all phenotype samples, we have *s*≤*q/z*. On the other hand, if *z*≤1−*q*, then it is possible that none of the samples that exhibit the state function are associated with the sample, and therefore *s*≥0. Finally, when *z*≤1−*q* (which is only possible if *z*≥1*/*2, the *s* will be minimized if all samples that are not associated with the phenotype exhibit the state function, and therefore we have *s*≥1−(1−*q*)*/z*. Consequently, we have three boundary cases for *s*:

*s*=*q/z*, subject to*q*≤*z*≤1.*s*=0, subject to*q*≤*z*≤1−*q*.*s*=1−(1−*q*)*/z*, subject to max {*q,*1−*q*}≤*z*≤1.

We consider each of these boundary cases separately.

**Case A1:** Letting *s*=*q/z* in (5), we obtain

and therefore *J _{A}*

_{1}(

*z*)=(

*z*−

*q*) log((

*z*−

*q*)

*/*(1−

*q*))−

*z*log

*z*. Consequently, for

*q*≤

*z*≤1 and therefore

*J*

_{A}_{1}(

*z*)≤

*J*(

*q*)=−

*q*log

*q*, proving the bound for this case.

**Case A2:** Letting *s*=0 in (5), we obtain *J _{A}*

_{2}(

*z*)=−

*z*log(1−

*q*) and therefore

*J*

_{A}_{2}(

*z*)≤

*J*(1−

*q*)=(1−

*q*) log(1−

*q*) for

*q*≤

*z*≤1−

*q*, proving the bound for this case.

**Case A3:** Letting *s*=1−(1−*q*)*/z* in (5), we obtain

and therefore *J _{A}*

_{3}(

*z*)=(

*q*+

*z*−1) log((

*q*+

*z*−1)

*/q*)−

*z*log

*z*. Consequently, . assumes its zero at

*z*=1, corresponding to a minimum at

*J*(1)=0. Therefore, if

*q*≤1−

*q*, then

*J*

_{A}_{3}(

*z*) attains its maximum at

*z*=1−

*q*, which gives

*J*

_{A}_{3}(

*z*)≤

*J*

_{A}_{3}(

*q*)=−(1−

*q*) log(1−

*q*). Otherwise (

*q*>1−

*q*and hence

*q*>1/2),

*J*

_{A}_{3}(

*z*) attains its maximum at

*z*=

*q*, which gives

*J*

_{A}_{3}(

*z*)≤

*J*

_{A}_{3}(1−

*q*)=(2

*q*−1) log((2

*q*−1)

*/q*)−(1−

*q*) log(1−

*q*)≤−(1−

*q*) log(1−

*q*) since (2

*q*−1)

*/q*≤1 for 1

*/*2≤

*q*≤1. This proves the bound for this case.

**Case B:** *z*≤*q*, that is the state function is observed at most as commonly as the phenotype of interest. In this case, *s* can attain the value 1 if all samples that exhibit the state function are associated with the phenotype of interest, thus *s*≤1. On the other hand, for *z*≤1−*q*, *s* can be as low as 0 if all samples that exhibit the state function are samples that are not associated with the phenotype. Finally, if *z*≥1−*q*, then *s* has to be at least 1−(1−*q*)*/z* since at most this fraction of samples that exhibit the state function can be samples that are not associated with the phenotype. Consequently, we have three boundary cases for *s*:

*s*=1, subject to 0≤*z*≤*q*.*s*=0, subject to 0≤*z*≤min{1−*q,q*}.*s*=1−(1−*q*)*/z*, subject to 1−*q*≤*z*≤*q*.

We consider each of these boundary cases separately.

**Case B1:** Letting *s*=1 in (5), we obtain *J _{B}*

_{1}(

*z*)=−

*z*log

*q*and therefore

*J*

_{B}_{1}(

*z*)≤−

*q*log

*q*for 0≤

*z*≤

*q*, proving the bound for this case.

**Case B2:** Letting *s*=0 in (5), we obtain *J _{B}*

_{2}(

*z*)=−

*z*log(1−

*q*) and therefore

*J*

_{B}_{2}(

*z*)≤−(1−

*q*) log(1−

*q*) for

*q*≤

*z*≤1−

*q*, proving the bound for this case.

**Case B3:** Observe that *J _{B}*

_{3}(

*z*)=

*J*

_{A}_{3}(

*z*). As we know from case A3,

*J*

_{B}_{3}(

*z*) is a decreasing function of

*z*and

*J*

_{A}_{3}(1−

*q*)≤−(1−

*q*) log(1−

*q*), so

*J*

_{B}_{3}(1−

*q*)≤−(1−

*q*) log(1−

*q*), proving the bound for this case.

Based on this result, we allow the user to specify a threshold *j*** in the range [0, 1] in practice and adjust it as *j**=*j****j*_{max}(*p*(*c*)), to make the scoring criterion interpretable and uniform across all datasets.

### 2.5Algorithms for the identification of informative state functions

Since the space of state functions is very large, the problem of discovering all informative state functions is intractable. Here, we address this challenge by utilizing a bound on the value of *J* to effectively prune the search space. Our approach is inspired by a similar result by Smyth and Goodman (1992) on information-theoretic identification of association rules in databases. In the following theorem, we show that the information that can be provided by all superstates of a given state function can be bounded based on the statistics of that state function, without any information about the superstate.

#### Theorem 2

*Consider a subnetwork* *and associated state function* *. For any* *the following bound holds:*

The proof of this theorem is based on a more general result by Smyth and Goodman (1992) in the context of association rule mining. We first prove two lemmas necessary for the proof of Theorem 2.

#### Lemma 1

*For* .

#### Proof

Let *x*_{1}<*x*_{2}. Since *b*−*a*>0, we have *x*_{1}(*b*−*a*)<*x*_{2}(*b*−*a*). Adding *x*_{1}*x*_{2}+*ab* to both sides of the inequality, we obtain (*a*−*x*_{2})(*b*−*x*_{1})<(*a*−*x*_{1})(*b*−*x*_{2}). Consequently, *x*_{1}<*x*_{2} implies

and therefore the maximum of for the interval 0≤*x*<*a* occurs at *x*=0, which is equal to .

#### Lemma 2

*For* .

#### Proof

Let *x*_{1}>*x*_{2}. Since *a*−*b*>0, we have *x*_{1}(*a*−*b*)>*x*_{2}(*a*−*b*). Adding *x*_{1}*x*_{2}+*ab* to both sides of the inequality, we obtain (*x*_{1}−*a*)(*x*_{2}−*b*)>(*x*_{1}−*b*)(*x*_{2}−*a*). Consequently, *x*_{1}>*x*_{2} implies

and therefore the maximum of for the interval *a*<*x*≤1 occurs at *x*=1, which is equal to .

To prove Theorem 2, we use the following conventions for notational convenience:

*r*denotes , that is the probability that a sample with state for the genes in is associated with the phenotype of interest.- denotes and denotes the state of that is consistent with .
- γ denotes , that is the probability of observing state for subnetwork , given that subnetwork is in state .
- θ denotes , that is the probability that a sample is associated with the phenotype of interest, given that subnetwork is in state , but subnetwork is not in state in that sample.

#### Proof of Theorem 2

We can write and as follows:

We will show that, for fixed and , the maximum value that attains cannot exceed (for any choice of and ). First, by definition of conditional probability, we note the following equality:

Since *s* is fixed, this equation represents a constraint that must be satisfied by *r*, γ, and θ. Thus, we will bound subject to this constraint. Note also that we can write this constraint as

Without loss of generality, we assume *s*>*q*, that is the observation of state function increases the probability of a sample being associated with the phenotype ( “indicates” phenotype). Since we consider only two classes for the samples (phenotype or control), if the assumption does not hold (i.e., if “indicates” control), then the following arguments still hold if we simply interchange the labels of sample classes.

Given that *s*>*q*, five different cases are possible: (i) *q*<*s*<*r*, (ii) *q*<*s*=*r*, (iii) *q*<*r*<*s*, (iv) *q*=*r*<*s*, and (v) *r*<*q*<*s*. We consider each case separately.

**Case (i):** *q*<*s*<*r*. In this case, the probability of phenotype given the state of the larger subnetwork is larger than the probability of phenotype given the state of the smaller subnetwork (and thus the additional part of the larger subnetwork provides additional evidence indicating that the sample might be associated with phenotype).

Since *s*<*r*, we have *r*>*γr*+(1−γ)θ from (14) and thus *r*>θ. Therefore, since 0≤θ<*s*<*r*≤1, we can write by Lemma 1 that γ≤*s/r*, without putting any additional constraint on *r*. Consequently, from (12), we obtain

and thus

Since *q*<*r*≤1, the second term in parenthesis is negative. Consequently, noting *r*≤1, we obtain

This proves the theorem for case (i).

**Case (ii):** *q*<*s*=*r*. In this case, the probability of phenotype given the state of the larger subnetwork is equal to the probability of phenotype given the state of the smaller subnetwork (and thus the additional part of the larger subnetwork does not provide additional information).

Noting λ≤1 and replacing *r* with *s*, we can write

Since 1−*s*<1−*q*, the second term in parentheses is negative, so we have

This proves the theorem for case (ii).

**Case (iii):** *q*<*r*<*s*. In this case, the observation of the state of the larger subnetwork increases the probability of phenotype compared to background, but not to the extent that the smaller subnetwork does.

The proof here is very similar to that in case (ii). Let *y*(*x*)=*x* log(*x/q*)+(1−*x*) log((1−*x*)*/*(1−*q*)). Then we have *y*′(*x*)=log(*x/q*)−*log*((1−*x*)*/*(1−*q*)). Therefore, for *x*>*q*, since *x/q*>1 and (1−*x*)*/*(1−*q*), *y*′(*x*) is always positive and *y* is an increasing function of *x*. Consequently, for *q*<*r*<*s*, we have:

Once this inequality is etablished, the rest of the proof for case (iii) follows the proof for case (ii).

**Case (iv):** *q*=*r*<*s*. In this case, the probability of phenotype given the state of the larger subnetwork is equal to background, thus the additional part of the larger subnetwork takes away all the evidence provided by the smaller subnetwork in favor of phenotype.

By definition of *J*(.), (both *r/q* and (1−*r*)*/*(1−*q*) are equal to 1 in (12)). Thus, trivially satisfies the bound, proving the theorem for this case.

**Case (v):** *r*<*q*<*s*. In this case, the additional part of the larger subnetwork reverses the direction of evidence provided by the smaller subnetwork, that is the state function of the larger subnetwork increases the probability of the sample being associated with control.

The proof in this case is very similar to that for case (i). Since *r*<*s*, using Equation 14 we have *r*<γ*r*+(1−γ)*θ* and thus *r*<*θ*. Therefore, since 0≤*r*<*s*<*θ*≤1, we can write by Lemma 2 (see below) that *γ*≤(1−*s*)*/*(1−*r*), without putting any additional constraint on *r*. Consequently, from (12), we obtain

and thus

Since *r*<*q*, the second term in parenthesis is negative and also 1−*r*≤1; therefore

This proves the theorem for case (v).

Note that this theorem does not state that the *J*-value of a state function is bounded by the *J*-value of its smaller parts, it rather provides a bound on the *J*-value of the larger state function based on simpler statistics of its smaller parts. Using this bound, we develop an algorithm, Crane, to efficiently search for informative state functions. Crane enumerates state functions in a bottom-up fashion, by pruning out the search space effectively based on the following principles:

- A state function is said to be a
*candidate*state function if or for all . - A candidate state function is said to be
*extensible*if . This restriction enables pruning of larger state functions using statistics of smaller state functions. - An extension of state function is obtained by adding one of the H or L states of a gene such that , where
*g*is the most recently added gene to . This ensures network connectivity of the subnetwork associated with the generated state functions._{j} - For an extensible state function, all possible extensions are considered and among those that qualify as candidate state functions, the top
*b*state functions with maximum*J*(.) are selected as candidate state functions. Here,*b*is an adjustable parameter that determines the breadth of the search and the case*b*=1 corresponds to a greedy algorithm. - An extensible state function is not extended if . Here,
*d*is an adjustable parameter that determines the depth of the search.

Crane enumerates all candidate state functions that qualify according to these principles, for given *j**, *b*, and *d*. At the end of the search process, the candidate state functions that are not superceded by another candidate state function (the leaves of the enumeration tree) are identified as informative state functions, if their *J*-value exceeds *j**. A detailed pseudo-code for this procedure is given as Algorithm 1.

### 2.6.Using state functions to predict metastasis in cancer

An important application of informative state functions is that they can serve as features for classification of phenotype. Since the genes that compose an informative state function are by definition highly discriminative of phenotype and control when considered *together*, they are expected to perform better than single-gene features (Chuang et al., 2007). Note here that Crane discovers specific state functions that are informative of phenotype, as opposed to subnetworks that can discriminate phenotype or control. However, by Equation 2, we expect that a high for a specific state function is associated with a potentially high for the corresponding subnetwork . Therefore, for the application of Crane in classification, we sort the subnetworks that are associated with discovered state functions based on their combinatorial coordinate dysregulation and use the top *K* disjoint (non-overlapping in terms of their gene content) subnetworks with maximum as features for classification. In the next section, we report results of classification experiments for different values of *K*.

Deriving representative features for subnetworks is a challenging task. Using simple aggregates of individual expression levels of genes along with traditional classifiers (e.g., regression or SVMs) might not be adequate, since such representations may not capture the combinatorial relationship between the genes in the subnetwork. For this reason, we use neural networks that incorporate subnetwork states directly as features. The proposed neural network model is illustrated in Figure 2. In the example of this figure, two subnetworks are used to build the classifier. Each input is the expression level of a gene and the inputs that correspond to a particular subnetwork are connected together to an input layer neuron. All input layer neurons, each representing a subnetwork, are connected to a single output layer neuron, which produces the output. Each layer's weights and biases are initialized with the Nguyen-Widrow layer initialization method (provided by Matlab's initnw parameter). Then for a given gene expression dataset for a range of control and phenotype samples (which, in our experiments, is identical to that used for identification of informative state functions), the network is trained with Levenberg-Marquardt backpropagation (using Matlab's trainlm parameter), so that, given expression profiles in the training dataset, the output of the second layer matches the associated phenotype vector within minimal mean squared error. This learned model is then used to perform classification tests on a different gene expression dataset for the same phenotype.

Since Neural Networks show stochastic behavior, we train 30 independent NNs with the same training data and use the following voting scheme to consolidate these 30 different runs. For each run, we feed both the training and test samples as separate test data to the trained neural network and collect the real valued predictions for all training and test sample cases. For each neural network, we convert the quantitative NN outputs to binary predictions based on the NN outputs of training samples. We then predict the discrete class label of each test sample using output of NN as features. This procedure generates 30 separate class labels for each sample, each predicted by a particular NN. Then the final class label of each sample is determined based on the majority of the predicted class labels for that sample (i.e., for a particular sample, if more than 50% of the output labels represent phenotype, we declare it as a phenotype sample).

## 3.Results and Discussion

In this section, we evaluate the performance of Crane in identifying state functions associated with metastasis of CRC. We first compare the classification performance of the subnetworks associated with these state functions against single gene markers and subnetworks identified by an algorithm that aims to maximize additive coordinate dysregulation. We then present comprehensive experimental results to evaluate the effect of parameters on the performance of Crane. Subsequently, with a view to investigating the benefits of pruning the subnetwork search space, we compare Crane's performance with a version that does not use the bound on *J*(.) value to prune the search space. Finally, we inspect the subnetworks that are useful in classification, and discuss the insights these subnetworks can provide into the metastasis of CRC.

### 3.1.Datasets

In our experiments, we use two CRC-related microarray datasets obtained from GEO (Gene Expression Omnibus; *http://www.ncbi.nlm.nih.gov/geo/index.cgi*). These datasets, referenced by their accession number in the GEO database, include the following relevant data:

*{"type":"entrez-geo","attrs":{"text":"GSE6988","term_id":"6988"}}GSE6988*contains expression profiles of 17,104 genes across 29*vs.*51 colorectal tumor samples with and without liver metastasis, respectively (Ki et al., 2007).*{"type":"entrez-geo","attrs":{"text":"GSE3964","term_id":"3964"}}GSE3964*contains expression profiles of 5,845 genes across 28*vs.*18 colorectal tumor samples with and without liver metastasis, respectively (Graudens et al., 2006).

The human protein-protein interaction data used in our experiments is obtained from the Human Protein Reference Database (HPRD; *http://www.hprd.org*). This dataset contains 35023 binary interactions among 9299 proteins, as well as 1060 protein complexes consisting of 2146 proteins. We integrate the binary interactions and protein complexes using a matrix model (e.g., each complex is represented as a clique of the proteins in the complex), to obtain a PPI network composed of 42781 binary interactions among 9442 proteins.

### 3.2.Experimental design

For each of the datasets mentioned above, we discover informative state functions (in terms of discriminating tumor samples with or without metastasis) using Crane. While state functions that are indicative of either metastatic or non-metastatic phenotype can have high *J*(.) values, we use only those that are indicative of (i.e., knowledge of which increases the likelihood of) metastatic phenotype for classification and further analyses, since such state functions are directly interpretable in terms of their association with metastasis.

In the experiments reported here, we set *b*=10. *d* is set at 3 for *{"type":"entrez-geo","attrs":{"text":"GSE3964","term_id":"3964"}}GSE3964* and at 6 for *{"type":"entrez-geo","attrs":{"text":"GSE6988","term_id":"6988"}}GSE6988*. The value of *j*** is set to and 0.15 and 0.40 for discovery of subnetworks on *{"type":"entrez-geo","attrs":{"text":"GSE3964","term_id":"3964"}}GSE3964* and *{"type":"entrez-geo","attrs":{"text":"GSE6988","term_id":"6988"}}GSE6988* respectively. The top five non-overlapping subnetworks discovered on *{"type":"entrez-geo","attrs":{"text":"GSE6988","term_id":"6988"}}GSE6988* by Crane using these parameter settings are shown in Table 2. Note that these parameters are used to balance the trade-off between computational cost of subnetwork identification and classification accuracy. The reported values are those that provide reasonable performance by spending a reasonable amount of time on subnetwork identification (a few hours in Matlab for each dataset). The effect of different values of these parameters on Crane's performance are presented later in this section.

*d*=6 and the Functional Enrichment of These Subnetworks

**...**

To binarize the gene expression datasets, we first normalize the gene expression profiles so that each gene has an average expression of 0 and standard deviation 1. Then we set the top *α* fraction of the entries in the normalized gene expression matrix to H (high expression) and the rest to L (low expression). In the reported experiments, we use *α*=0.25 (25% of the genes are expressed on an average) as this value is found to optimize the classification performance.

### 3.3.Implementation of other algorithms

We identify *single gene markers* by running Crane with *d*=1 (i.e., by searching for subnetworks composed of one gene). We also identify coordinately dysregulated subnetworks using an *additive algorithm*, that is an algorithm that aims to maximize additive coordinate dysregulation (Chuang et al., 2007). The additive algorithm identifies a subnetwork associated with each gene in the network by seeding a greedy search process from that gene. It grows subnetworks by iteratively adding to the subnetwork a network neighbor of the genes that are already in the subnetwork. At each iteration, the neighbor that maximizes the coordinate dysregulation of the subnetwork is selected to be added. Once all subnetworks are identified, we sort these subnetworks according to their coordinate dysregulation or and use the top *K* disjoint subnetworks to train and test classifiers, for different values of *K*. While quantizing to compute , as suggested in Chuang et al. (2007), we use bins where denotes the number of samples. Note that, in Chuang et al. (2007), the subnetworks identified by the greedy algorithm are filtered through three statistical tests. In our experiments, these statistical tests are not performed for the subnetworks discovered by the additive algorithm and Crane. This is because, testing of statistical significance based on multiple runs on permuted instances is computationally expensive, since Crane performs an almost exhaustive search of the subnetwork space. It should be noted that this is currently an important limitation of Crane. In this respect, development of efficient algorithms for testing statistical significance of subnetworks identified by such exhaustive algorithms remains an important problem.

For the subnetworks with additive coordinate dysregulation, we compute the subnetwork activity for each subnetwork, and use these as features to train and test two different classifiers: (i) a SVM using Matlab's svmtrain and svmclassify functions (this method is not applicable to combinatorial coordinate dysregulation), (ii) feed-forward neural networks, in which each input represents the subnetwork activity for a subnetwork and these inputs are connected to hidden layer neurons. For the single-gene markers, we rank all genes according to the mutual information of their expression profile with phenotype (*I*(*E _{i}*;

*C*)) and use the expression level of

*K*genes with maximum

*I*(

*E*;

_{i}*C*) as features for classification.

### 3.4.Classification performance

We evaluate the cross-classification performance of the subnetworks in the context of predicting metastasis of CRC. Namely, we use subnetworks discovered on the *{"type":"entrez-geo","attrs":{"text":"GSE6988","term_id":"6988"}}GSE6988* dataset to train classifiers and we test the resulting classifiers on all the samples of *{"type":"entrez-geo","attrs":{"text":"GSE3964","term_id":"3964"}}GSE3964*. Similarly, we use subnetworks discovered on *{"type":"entrez-geo","attrs":{"text":"GSE3964","term_id":"3964"}}GSE3964* to train classifiers using the same dataset and perform testing of these classifiers on 28 metastatic and 20 randomly selected non-metastatic samples of *{"type":"entrez-geo","attrs":{"text":"GSE6988","term_id":"6988"}}GSE6988*. The cross-classification performance of subnetworks discovered by an algorithm is not only indicative of the power of the algorithm in discovering subnetworks that are descriptive of phenotype, but also the reproducibility of these subnetworks across different datasets.

The classification performances of the subnetworks identified by Crane, the additive algorithm, and single gene markers are compared in Figure 3. In the figure, for each 1≤*K*≤10, the precision and recall achieved by each classifier are reported. These performance criteria are defined as follows:

**...**

Here, a true positive is defined as a metastatic sample that is correctly predicted as a metastatic sample, while a false positive is a non-metastatic sample that is incorrectly predicted as metastatic. A false negative is a metastatic sample that is incorrectly predicted as non-metastatic. Therefore, precision quantifies the fraction of true positives among all samples predicted as metastatic by the classifier, while recall quantifies the fraction of true positives among all metastatic samples.

As seen in Figure 3, subnetworks identified by Crane outperform the subnetworks identified by other algorithms in predicting metastasis of colorectal cancer. In fact, in both cases, Crane has the potential to deliver very high accuracy using very few subnetworks (maximum precision of 100% on both *{"type":"entrez-geo","attrs":{"text":"GSE6988","term_id":"6988"}}GSE6988* and *{"type":"entrez-geo","attrs":{"text":"GSE3964","term_id":"3964"}}GSE3964*, maximum recall of and 95% and 86% for classification of samples in *{"type":"entrez-geo","attrs":{"text":"GSE6988","term_id":"6988"}}GSE6988* and *{"type":"entrez-geo","attrs":{"text":"GSE3964","term_id":"3964"}}GSE3964*, respectively). While we use a simple feature selection method here for purposes of illustration, the performance of Crane subnetworks are quite consistent, suggesting that these performance figures can indeed be achieved by developing elegant methods for selection of subnetwork features. These results are rather impressive, given that the best performance that can be achieved by the additive algorithm is 82%/93% precision and 89%/100% recall for the classification of *{"type":"entrez-geo","attrs":{"text":"GSE3964","term_id":"3964"}}GSE3964* and *{"type":"entrez-geo","attrs":{"text":"GSE6988","term_id":"6988"}}GSE6988*, respectively. Note that, while the performance of other algorithms is improved by increasing number of subnetwork features, the performance of Crane appears to decline. This is likely because Crane represents subnetwork features as multi-dimensional state functions. Therefore, while a few subnetworks each containing a few genes provide sufficient information for accurate classification, the accuracy declines as more subnetworks are incorporated because of the growth in dimensionality.

### 3.5.Effect of pruning

An important feature of Crane is the use of a theoretical bound on *J*(.) to prune out the search space. In order to verify the effectiveness of this feature in improving the efficiency of Crane, as well as its ability to discover informative subnetworks, we compare Crane with a version that does not apply pruning using the bound on *J*(.). The results of this comparison are shown in Figure 4. These experiments are performed on *GSE*6988, by fixing *b*=10, *j***=0.45, *α*=0.25, and running Crane and its version without pruning for *d* ranging from 1 to 8.

*J*(.). Note that Crane identifies all subnetworks that are identified by the algorithm without pruning.

The runtimes of the Crane and the algorithm without pruning are compared in Figure 4. As seen in the figure, the algorithm without pruning does not scale well with increasing *d*. This is expected, since the algorithm performs exhaustive search with a breadth of *b*=10, making the runtime exponential in *d*. However, by pruning this search space using the bound on *J*(.), Crane reduces this runtime drastically, providing orders of magnitude improvement for larger values of *d*. Note that, if *b*=∞, both Crane and its version without pruning are guaranteed to discover all subnetworks with *J*(.)≥*j**. However, since the breadth of search is limited by parameter *b*, both algorithms may miss some subnetworks. In the experiments reported here, Crane is able to identify all subnetworks that are identified by the version without pruning; i.e., Crane achieves the drastic improvement in runtime without compromising sensitivity. These results clearly demonstrate the value of using the theoretical bound on *J*(.) value while searching for informative subnetworks.

### 3.6.Effect of parameters

We also investigate the effect of parameters used to configure Crane on classification performance of identified subnetworks, by fixing all but one of the parameters to the above-mentioned values and varying the remaining parameter. The tuneable parameters of Crane are the following:

*d*:*d*is the maximum size of a subnetwork. Crane stops extending a subnetwork when the number of genes in the subnetwork reaches*d*. In other words,*d*determines the depth of the search.*b*:*b*is the number of state functions selected by Crane at each iteration with maximum*J*(.) value. Thus,*b*determines the breadth of the search.*j***:*j*** is the minimum*J*(.) value of a subnetwork state function to be considered informative.*α*:*α*is the fraction of the entries in the normalized gene expressin matrix that is set to H (high expression). The rest of the (1−*α*) entries of the gene expression matrix is set to L (low expression).

The results of our experiments on the effect of these parameters on the performance of Crane are shown in Figure 5. In this figure, for each configuration of the parameters, we report the average F-measure across different values of the number of subnetworks used in classification, ranging from 1 to 10. Here, F-measure is defined as the harmonic mean of precision and recall, i.e.,

*{"type":"entrez-geo","attrs":{"text":"GSE3964","term_id":"3964"}}GSE3964*and tested on samples of

*{"type":"entrez-geo","attrs":{"text":"GSE6988","term_id":"6988"}}*

**...**

We observe that classification performance is quite robust against variation in *α* ranging from 10% to 50%, while best performance is observed at *α*=25%. As expected, classification performance improves by increasing *j***. Increasing the breadth of search (*b*) improves classification performance in general, which is also expected since larger values of *b* enable exploration of the search space further. Note that the special case with *b*=1 is algorithmically equivalent to the additive algorithm with a different objective function (combinatorial coordinate dysregulation as opposed to additive coordinate dysregulation). We observe that Crane outperforms the additive algorithm with *b*=1 as well, indicating that the combinatorial formulation of coordinate dysregulation is potentially more useful than the additive formulation for classification.

As seen in Figure 5, increasing *d* improves performance as would be expected; however this improvement saturates for *d*>3 and performance declines for larger subnetworks. This observation can be attributed to curse of dimensionality, since the number of possible values of random variable *F* (expression state of a subnetwork) grows exponentially with increasing subnetwork size. We also investigate the effect of parameter *d* on Crane's ability to discover larger subnetworks. For this purpose, we compare the subnetworks identified by Crane on *GSE*6988 using *d*=7 and *d*=8 with those identified using *d*=6. The top five non-overlapping subnetworks identified using *d*=7 and *d*=8 are shown in Table 3. Comparison of the subnetworks in Tables 2 and and33 shows that, while there is some overlap in subnetworks discovered using different values of *d*, some subnetworks that can be discovered for larger values of *d* can be missed if a smaller value of *d* is used. Note, however, that this does not mean that smaller subnetworks of these subnetworks are not discovered by Crane. Rather, such subnetworks are often eliminated because of their overlap with subnetworks that have higher combinatorial coordinate dysregulation. Indeed, comprehensive comparison of subnetworks shows that many of the subnetworks composed of seven genes, which are discovered using *d*=7, are identified as different six-gene combinations when *d* is set 6. In other words, if *d* is set to a smaller value, then a larger “naturally occurring” subnetwork can be “truncated” into smaller subnetworks. For this reason, the parameter *d* needs to be set carefully, possibly by using different values of *d* and inspecting the size and gene content of subnetworks discovered for each *d*.

### 3.7.Subnetworks and state functions indicative of metastasis in CRC

Cancer metastasis involves the rapid proliferation and invasion of malignant cells into the bloodstream or lymphatic system. The process is driven, in part, by the dysregulation of proteins involved in cell adhesion and motility (Paschos et al., 2009), the degradation of the extracellular matrix (ECM) at the invasive front of the primary tumor (Zucker and Vacirca, 2004), and is associated with chronic inflammation (McConnell and Yang, 2009). An enrichment analysis of the top five subnetworks identified on *{"type":"entrez-geo","attrs":{"text":"GSE6988","term_id":"6988"}}GSE6988* reveals that all of these subnetworks are highly significant for the network processes underlying these phenotypes (Table 2).

Further, as CRC metastasis is our classification endpoint, we wanted to evaluate our subnetworks in terms of their potential to propose testable hypotheses. In particular, to highlight the power of our model approach, we choose a subnetwork for which at least one gene was expressed in the state function indicative of CRC metastasis. This subnetwork contains TNFSF11, MMP1, BCAN, MMP2, TBSH1, and SPP1 and the state function LLLLLH (in respective order) indicates metastatic phenotype with *J*-value 0.33. The combinatorial dysregulation of this subnetwork is 0.72, while its additive coordinate dysregulation is 0.37, i.e., this is a subnetwork which would likely have escaped detection by the additive algorithm (this subnetwork is not listed in Table 2 since it is not among the top five scoring subnetworks). Using the genes in this subnetwork as a seed, we construct a small subnetwork diagram for the purpose of more closely analyzing the post-translational interactions involving these proteins. This is done using Metacore, a commercial platform that provides curated, highly reliable interactions. From this subnetwork, we remove all genes indicated to be not expressed in human colon by the database, and then selectively prune it in order to clearly focus on a particular set of interactions (Fig. 6). It merits noting that, although Brevican (BCAN) is in subnetwork, it is removed for being non-expressed in the human colon, although evidence from the Gene Expression Omnibus (see accession *GDS2609*) (Hong et al., 2007) casts doubt on this, as does the microarray we use for scoring (*{"type":"entrez-geo","attrs":{"text":"GSE6988","term_id":"6988"}}GSE6988*).

**...**

As seen on the interaction diagram, SPP1 (osteopontin) and TBSH1 (thrombosponidin 1) interact with a number of the integrin heterodimers to increase their activity (green line). Integrin heterodimers play a major role in mediating cell adhesion and cell motility. SPP1, up-regulated in metastasis (Fig. 6), is a well-studied protein that triggers intracellular signaling cascades upon binding with various integrin heterodimers, promotes cell migration when it binds CD44, and when binding the alpha-5/beta-3 dimer in particular, promotes angiogenesis, which is associated with the metastatic phenotype of many cancers (Markowitz and Bertagnolli, 2009). MMP proteins are involved in the breakdown of ECM, particularly collagen which is the primary substrate at the invasive edge of colorectal tumors (Vishnubhotla et al., 2007). MMP-1 has an inhibitory effect on vitronectin (red line), hence the loss of expression of MMP-1 may “release the brake” on vitronectin, which in turn may increase the activity of the alpha-v/beta-5 integrin heterodimer. Likewise, MMP-2 shows an inhibitory interaction with the alpha-5/beta-3 dimer, which may counteract to some extent the activating potential of SPP1, suggesting that a loss of MMP-2 may exacerbate the metastatic phenotype. Taken together, these interactions suggest a number of perturbation experiments, perhaps by pharmacological inhibition or siRNA interference of the integrin dimmers or MMP proteins, to evaluate the role of these interactions, individually or synergistically, in maintaining the metastatic phenotype. Note also that, alpha-v/beta-5 integrin does not exhibit significant differential expression at the mRNA-level, suggesting that the state function identified by Crane may be a signature of its post-translational dysregulation in metastatic cells.

## 4.Conclusion

We present a novel framework for network-based analysis of coordinate dysregulation in complex phenotypes. Experimental results on metastasis of colorectal cancer show that the proposed framework can achieve almost perfect performance when discovered subnetworks are used as features for classification. These results are highly promising in that the state functions that are found to be informative of metastasis can also be useful in modeling the mechanisms of metastasis in cancer. Detailed investigation of the state functions and the interactions between proteins that together compose state functions might therefore lead to development of novel hypotheses, which in turn may be useful for development of theurapetic intervention strategies for late stages of cancer.

## Acknowledgments

We would like to thank Vishal Patel, Jill Barnholtz-Sloan, Xiaowei Guan, and Gurkan Bebek, of Case Western Reserve University for many useful discussions. This work was supported, in part, by the NSF National Science Foundation (CAREER Award CCF-0953195) and the National Institutes of Health (Grants UL1-RR024989 from the National Center for Research Resources, Clinical and Translational Science Awards; P30-CA043703 from the Case Western Reserve University Cancer Center Proteomics Core; and T32-GM008803 from the NIGMS, Institutional National Research Service Award).

## Disclosure Statement

No competing financial interests exist.

## References

- Akutsu T. Miyano S. Selecting informative genes for cancer classification using gene expression data. Proc. IEEE-EURASIP Workshop Nonlinear Signal Image Processing. 2001:3–6.
- Akutsu T. Miyano S. Kuhara S. Identification of genetic networks from a small number of gene expression patterns under the Boolean network model. Pac. Symp. Biocomput. 1999:17–28. [PubMed]
- Anastassiou D. Computational analysis of the synergy among multiple interacting genes. Mol. Syst. Biol. 2007;3:83. [PMC free article] [PubMed]
- Cabusora L. Sutton E. Fulmer A., et al. Differential network expression during drug and stress response. Bioinformatics. 2005;21:2898–2905. [PubMed]
- Chowdhury S.A. Koyutürk M. Identification of coordinately dysregulated subnetworks in complex phenotypes. Pac. Symp. Biocomput. 2010:133–144. [PubMed]
- Chuang H.-Y. Lee E. Liu Y.-T., et al. Network-based classification of breast cancer metastasis. Mol. Syst. Biol. 2007;3:140. [PMC free article] [PubMed]
- Nacu Ş. Critchley-Thorne R. Lee P., et al. Gene expression network analysis and applications to immunology. Bioinformatics. 2007;23:850–858. [PubMed]
- Ewing R.M. Chu P. Elisma F., et al. Large-scale mapping of human protein-protein interactions by mass spectrometry. Mol. Syst. Biol. 2007;3:89. [PMC free article] [PubMed]
- Franke L. Bakel H. Fokkens L., et al. Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am. J. Hum. Genet. 2006;78:1011–1025. [PMC free article] [PubMed]
- Goh K.-I. Cusick M.E. Valle D., et al. The human disease network. Proc. Natl. Acad. Sci. USA. 2007;104:8685–8690. [PMC free article] [PubMed]
- Graudens E. Boulanger V. Mollard C., et al. Deciphering cellular states of innate tumor drug responses. Genome Biol. 2006;3:R19. [PMC free article] [PubMed]
- Guo Z. Li Y. Gong X., et al. Edge-based scoring and searching method for identifying condition-responsive protein–protein interaction sub-network. Bioinformatics. 2007;23:2121–2128. [PubMed]
- Hong Y. Ho K.S. Eu K.W., et al. A susceptibility gene set for early onset colorectal cancer that integrates diverse signaling pathways: implication for tumorigenesis. Clin. Cancer Res. 2007;13:1107–1114. [PubMed]
- Ideker T. Sharan R. Protein networks in disease. Genome Res. 2008;18:644–652. [PMC free article] [PubMed]
- Ideker T. Ozier O. Schwikowski B., et al. Discovering regulatory and signalling circuits in molecular interaction networks. Proc. ISMB. 2002:233–240. [PubMed]
- Karni S. Soreq H. Sharan R. A network-based method for predicting disease-causing genes. J. Comput. Biol. 2009;16:181–189. [PubMed]
- Ki D.H. Jeung H.-C. Park C.H., et al. Whole genome analysis for liver metastasis gene signatures in colorectal cancer. Int. J. Cancer. 2007;121:2005–2012. [PubMed]
- Koyutürk M. Szpankowski W. Grama A. Biclustering gene-feature matrices for statistically significant dense patterns. Proc. IEEE Comput. Syst. Bioinformatics Conf. (CSB’04) 2004:480–484.
- Lage K. Karlberg O.E. Størling Z.M., et al. A human phenome-interactome network of protein complexes implicated in genetic disorders. Nat. Biotechnol. 2007;25:309–316. [PubMed]
- Liu M. Liberzon A. Kong S.W., et al. Network-based analysis of affected biological processes in type 2 diabetes models. PLoS Genet. 2007;3 e96+. [PMC free article] [PubMed]
- Markowitz S. Bertagnolli M. Molecular origins of cancer: molecular basis of colorectal cancer. N. Engl. J. Med. 2009;361:2449–2460. [PMC free article] [PubMed]
- McConnell B. Yang V. The role of inflammation in the pathogenesis of colorectal cancer. Curr. Colorectal Cancer Rep. 2009;5:69–74. [PMC free article] [PubMed]
- Nibbe R.K. Ewing R. Myeroff L., et al. Discovery and scoring of protein interaction sub-networks discriminative of late stage human colon cancer. Mol. Cell Prot. 2009;9:827–845. [PMC free article] [PubMed]
- Nibbe R.K. Koyutürk M. Chance M.R. An integrative -omics approach to identify functional sub-networks in human colorectal cancer. PLoS Comput. Biol. 2010;6 e1000639+. [PMC free article] [PubMed]
- Papin J.A. Hunter T. Palsson B.O., et al. Reconstruction of cellular signalling networks and analysis of their properties. Nat. Rev. Mol. Cell Biol. 2005;6:99–111. [PubMed]
- Paschos K. Canovas D. Bird N. The role of cell adhesion molecules in the progression of colorectal cancer and the development of liver metastasis. Cell Signal. 2009;21:665–674. [PubMed]
- Patil K.R. Nielsen J. Uncovering transcriptional regulation of metabolism by using metabolic network topology. Proc. Natl. Acad. Sci. USA. 2005;102:2685–2689. [PMC free article] [PubMed]
- Quackenbush J. Microarray data normalization and transformation. Nat. Genet. 2002;32(Suppl):496–501. [PubMed]
- Rhodes D.R. Chinnaiyan A.M. Integrative analysis of the cancer transcriptome. Nat. Genet. 2005;37(Suppl):S31–S37. [PubMed]
- Rich J. Jones B. Hans C., et al. Gene expression profiling and genetic markers in glioblastoma survival. Cancer Res. 2005;65:4051–4058. [PubMed]
- Schadt E.E. An integrative genomics approach to infer causal associations between gene expression and disease. Nat. Genet. 2005;37:710–717. [PMC free article] [PubMed]
- Scott M.S. Perkins T. Bunnell S., et al. Identifying regulatory subnetworks for a set of genes. Mol. Cell Prot. 2005;4:683–692. [PubMed]
- Shmulevich I. Zhang W. Binary analysis and optimization-based normalization of gene expression data. Bioinformatics. 2002;18:555–565. [PubMed]
- Smyth P. Goodman R.M. An information theoretic approach to rule induction from databases. IEEE Trans. Knowl. Data Eng. 1992;4:301–316.
- Ulitsky I. Karp R.M. Shamir R. Detecting disease-specific dysregulated pathways via analysis of clinical expression profiles. Proc. RECOMB 2008. 2008:347–359.
- Vishnubhotla R. Sun S. Huq J., et al. Rock-ii mediates colon cancer invasion via regulation of mmp-2 and mmp-13 at the site of invadopodia as revealed by multiphoton imaging. Lab. Invest. 2007;87:1149–1158. [PubMed]
- Watkinson J. Wang X. Zheng T., et al. Identification of gene interactions associated with disease from gene expression data using synergy networks. BMC Syst. Biol. 2008;2:10. [PMC free article] [PubMed]
- Zucker S. Vacirca J. Role of matrix metalloproteinases (MMPS) in colorectal cancer. Cancer Metastasis Rev. 2004;23:101–117. [PubMed]

**Mary Ann Liebert, Inc.**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (702K)

- Identification of coordinately dysregulated subnetworks in complex phenotypes.[Pac Symp Biocomput. 2010]
*Chowdhury SA, Koyutürk M.**Pac Symp Biocomput. 2010; :133-44.* - Identifying stage-specific protein subnetworks for colorectal cancer.[BMC Proc. 2012]
*Erten S, Chowdhury SA, Guan X, Nibbe RK, Barnholtz-Sloan JS, Chance MR, Koyutürk M.**BMC Proc. 2012 Nov 13; 6 Suppl 7:S1. Epub 2012 Nov 13.* - Identification of differentially expressed subnetworks based on multivariate ANOVA.[BMC Bioinformatics. 2009]
*Hwang T, Park T.**BMC Bioinformatics. 2009 Apr 30; 10:128. Epub 2009 Apr 30.* - Identifying responsive functional modules from protein-protein interaction network.[Mol Cells. 2009]
*Wu Z, Zhao X, Chen L.**Mol Cells. 2009 Mar 31; 27(3):271-7. Epub 2009 Mar 19.* - Protein networks in disease.[Genome Res. 2008]
*Ideker T, Sharan R.**Genome Res. 2008 Apr; 18(4):644-52.*

- MIRA: mutual information-based reporter algorithm for metabolic networks[Bioinformatics. 2014]
*Cicek AE, Roeder K, Ozsoyoglu G.**Bioinformatics. 2014 Jun 15; 30(12)i175-i184* - Biomarker Gene Signature Discovery Integrating Network Knowledge[Biology. ]
*Cun Y, Fröhlich H.**Biology. 1(1)5-17* - Clinical proteomic biomarkers: relevant issues on study design & technical considerations in biomarker development[Clinical and Translational Medicine. ]
*Frantzi M, Bhat A, Latosinska A.**Clinical and Translational Medicine. 37* - Integrative approaches for finding modular structure in biological networks[Nature reviews. Genetics. 2013]
*Mitra K, Carvunis AR, Ramesh SK, Ideker T.**Nature reviews. Genetics. 2013 Oct; 14(10)719-732* - Network Signatures of Survival in Glioblastoma Multiforme[PLoS Computational Biology. 2013]
*Patel VN, Gokulrangan G, Chowdhury SA, Chen Y, Sloan AE, Koyutürk M, Barnholtz-Sloan J, Chance MR.**PLoS Computational Biology. 2013 Sep; 9(9)e1003237*

- Subnetwork State Functions Define Dysregulated Subnetworks in CancerSubnetwork State Functions Define Dysregulated Subnetworks in CancerJournal of Computational Biology. Mar 2011; 18(3)263PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...