- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

# Network-based support vector machine for classification of microarray samples

^{1}Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota 55455, USA

^{2}School of Statistics, University of Minnesota, Minneapolis, Minnesota 55455, USA

^{}Corresponding author.

#### Supplement

#### Conference

This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## Abstract

### Background

The importance of network-based approach to identifying biological markers for diagnostic classification and prognostic assessment in the context of microarray data has been increasingly recognized. To our knowledge, there have been few, if any, statistical tools that explicitly incorporate the prior information of gene networks into classifier building. The main idea of this paper is to take full advantage of the biological observation that neighboring genes in a network tend to function together in biological processes and to embed this information into a formal statistical framework.

### Results

We propose a network-based support vector machine for binary classification problems by constructing a penalty term from the *F*_{∞}-norm being applied to pairwise gene neighbors with the hope to improve predictive performance and gene selection. Simulation studies in both low- and high-dimensional data settings as well as two real microarray applications indicate that the proposed method is able to identify more clinically relevant genes while maintaining a sparse model with either similar or higher prediction accuracy compared with the standard and the *L*_{1 }penalized support vector machines.

### Conclusion

The proposed network-based support vector machine has the potential to be a practically useful classification tool for microarrays and other high-dimensional data.

## Background

The past two decades have witnessed rapid advances in gene expression profiling with the microarray technology, which not only brighten the prospect of deciphering the complexity of disease genesis and progression at the genomic level, but also revolutionize the diagnostic, therapeutic, and prognostic approaches. Up to recently, diagnostic classification and prognostic assessment have been based on conventional clinical and pathological risk factors, such as patient age and tumor size, many of which are believed to be secondary manifestation [1]. The advent of microarray technology allows researchers to explore primary disease mechanisms by comparing gene expression profiles for malignant and normal cells. The regularity and aberration in the expression patterns of certain genes shed light on their functions and pathological importance [2]. Studies that seek to identify gene markers to refine diagnostic classification and improve prognostic prediction in the context of gene expression data have enriched the literature [3-5]. In recent years, researchers have realized that gene markers identified from microarrays drawn from difierent studies on the same disease across similar cohorts lack consistency [6,7]. A possibly more effective means to resolve this problem is to employ a network-based approach, that is, to identify markers as gene subnetworks, defined as groups of functionally related genes based on a gene network, instead of treating individual genes as completely independent and identical a priori as in most existing approaches [1]. A novel network-based approach proposed recently [1,8] can be summarized as follows: (1) randomly searching subnetworks and assigning a score to each subnetwork that characterizes the subnetwork-wise gene expression level; (2) identifying significant subnetworks that can well discriminate the clinical outcome; (3) constructing a classifier based on the significant subnetworks with a conventional statistical tool, such as logistic regression. Essentially such a network-based approach aggregates gene expression data at the subnetwork level and then identifies and utilizes some significant subnetworks. It has been shown that such a network-based approach not only improves predictive performance and reproducibility, but also sheds biological insights into molecular mechanisms underlying the clinical outcome. However, the above method is largely heuristic without a formal statistical framework; more importantly, it involves a random search over subnetworks, leading to possibly different results from different runs with no guarantee of the optimality of the final result. Because of the ever-increasing popularity of penalization methods for high-dimensional data, we propose a novel network-based penalty to be used with the hinge loss, leading to a network-based support vector machine. While maintaining some desirable properties of support vector machine (SVM) with the hinge loss function, the network-based penalty directly integrates a biological network to realize more effective variable selection, as compared with generic methods, such as the standard SVM (STD-SVM) or *L*_{1}-penalized SVM (L1-SVM).

The support vector machine (SVM) is one of the most popular supervised learning techniques with wide-ranging applications [9,10]. In particular, previous studies have demonstrated its superior performance in gene expression data analysis, especially its ability to handle high dimensional data [11,12]. Nevertheless, with categorical predictors, both the STD-SVM and the L1-SVM may have some shortcomings. Zou and Yuan [13] applied the concept of grouped variable selection and developed an *F*_{∞}-norm penalized SVM to realize simultaneous selection/elimination of all the features derived from the same categorical factor (or a group of variables). Their numerical examples showed that the *F*_{∞}-norm SVM outperformed the L1-SVM in factor-wise variable selection. We extend the idea of variable grouping to gene networks: rather than grouping all the dummy variables created from the same categorical factor, we treat two neighboring genes in a network as one group. The network-based penalty is constructed as the sum of the *F*_{∞}-norms being applied to the groups of neighboring-gene pairs. With the hinge loss penalized by such a network-based penalty as our objective function, we obtain our network-based SVM. The later sections are organized as follows. We begin with a brief review of the SVM, and then introduce our proposed network-based SVM. We evaluate its performance by simulation studies in both low dimensional and high dimensional data settings as well as two real data applications. The last section concludes the paper with a brief summary.

## Methods

### Existing methods

Suppose we have training data ${\{({x}_{i},{y}_{i})\}}_{i=1}^{N}$ with *x*_{i } ^{p }and *y*_{i } {1, -1}. Define a hyperplane {*x *: *f*(*x*)= *x*^{T}*β *+ *β*_{0 }= 0}. The classification rule induced by *f *(*x*) is *sign *[$\widehat{f}$(*x*)]. SVM searches for such a hyperplane $\widehat{f}(x)={x}^{T}\widehat{\beta}+{\widehat{\beta}}_{0}$ that maximizes the margin between the training data points for class 1 and class -1:

where *ξ*_{i }are slack variables, and *C *is a tuning parameter to be determined. The STD-SVM has an equivalent *hinge loss + penalty *formulation as an optimization problem [13-15]:

where the subscript "+" denotes the positive part, i.e., *z*_{+ }= *max*{*z*, 0}, ${\Vert \beta \Vert}_{2}^{2}={\displaystyle {\sum}_{k\mathrm{=}1}^{p}{\left|{\beta}_{k}\right|}^{2}}$, and *λ *is the tuning parameter. The solution to (1) is the same as that to (2).

The above STD-SVM forces all nonzero coefficient estimates, which leads to the problem of its inability to conduct variable selection. The L1-SVM was proposed to accomplish the goal of variable selection. It can be formulated as

where ${\Vert \beta \Vert}_{1}={\displaystyle {\sum}_{k\mathrm{=}1}^{p}\left|{\beta}_{k}\right|}$. The L1-SVM wins over the STD-SVM when the true model is sparse, while the STD-SVM is preferred if there are not many redundant noise features [16].

Zou and Yuan [13] pointed out the shortcoming of the *L*_{1}-norm penalty: even though it encourages parsimonious models, it fails to guarantee successful models in cases of categorical predictors due to the fact that each dummy variable is selected independently. They applied the concept of grouped variable selection and proposed an *F*_{∞}-norm SVM to realize simultaneous selection/elimination of features derived from the same factor so as to accomplish automatic factor-wise variable selection. Suppose we have *G *factors *F*_{1},...,*F*_{G}. From each factor *F*_{g}, we generate a feature vector ${x}_{(g)}={({x}_{1}^{(g)},\cdots ,{x}_{j}^{(g)},\cdots ,{x}_{{n}_{g}}^{(g)})}^{T}$.

Correspondingly we have the coefficient vector ${\beta}_{(g)}={({\beta}_{1}^{(g)},\cdots ,{\beta}_{j}^{(g)},\cdots ,{\beta}_{{n}_{g}}^{(g)})}^{T}$. Therefore,

Define the *F*_{∞}-norm of *F*_{g }as

The *F*_{∞}-norm SVM is formulated as

The most noteworthy property of the *F*_{∞}-norm SVM is its guarantee of sparsity at the factor level. Due to the singularity property of the infinity norm: || *β*_{(g) }||_{∞ }is not differentiable at *β*_{(g) }= 0, *β*_{(g) }will be exactly zero if the regularization parameter *λ *is properly chosen [13]. Therefore, the *F*_{∞}-norm SVM automatically eliminates factors that are completely irrelevant to the response, and thus achieves the goal of factor-wise selection. The empirical evidence shows that the *F*_{∞}-norm SVM often outperforms both the L1-SVM and the STD-SVM.

### New method

Biological observations reveal that neighboring genes in a network tend to function together in biological processes. To incorporate this prior information, a network-based SVM for binary classification is proposed to facilitate generating models that extract more biological insight from gene expression data. The penalty term that characterizes the network structure can be specified by implanting the *F*_{∞}-norm into the context of known functional interrelationships among genes by considering each pair of the functionally related genes as one group.

Consider a gene network with *S *denoting the set of all edges, i.e., the pair of connected genes.

*S *= {(*j*_{1}, *j*_{2}) : gene *j*_{1 }and gene *j*_{2 }are connected}

Define *w*_{k }as some weight for gene *k*. For example, *w*_{k }= $\sqrt{{d}_{k}}$ where *d*_{k }is the number of direct neighbors of gene *k*, or *w*_{k }= *d*_{k}, or simply *w*_{k }= 1 for all genes. We propose a novel penalty in the form of

Thus the network-based SVM solves the optimization problem as follows.

Four properties of the penalty term are noteworthy. First, the regularization is performed at the level of grouped genes with each group containing two neighboring genes in the network. In the case of penalized linear regression, it has been proven that this penalty achieves the goal of eliminating both ${\beta}_{{j}_{1}}$ and ${\beta}_{{j}_{2}}$ simultaneously if (*j*_{1}, *j*_{2}) *S *[17]. The automatic selection of grouped features is due to the singularity of function max{|*a*|, |*b*|} [13]. This formulation satisfies our assumption that neighboring genes tend to (or not to) contribute to the same biological process at the same time. Second, the choice of the weight depends on the goal of shrinkage and influences the predictive performance. Consider a network comprised of several subnetworks, each with one regulator and ten target genes. Because of the singularity of function max(|*a*|, |*b*|) at *a *= *b*, the weighted penalty in the context of penalized regression, encourages $\left|{\beta}_{{j}_{1}}\right|/{w}_{{j}_{1}}=\left|{\beta}_{{j}_{2}}\right|/{w}_{{j}_{2}}$[17]. Here we examine three weight functions in particular: *w*_{k }= 1, *w*_{k }= $\sqrt{{d}_{k}}$, and *w*_{k }= *d*_{k}, where gene *k *has *d*_{k }direct neighbors. The new method encourages $\left|{\beta}_{{j}_{1}}\right|=\left|{\beta}_{{j}_{2}}\right|$ if *w*_{k }= 1, $\frac{\left|{\beta}_{{j}_{1}}\right|}{\sqrt{{d}_{{j}_{1}}}}=\frac{\left|{\beta}_{{j}_{2}}\right|}{\sqrt{{d}_{{j}_{2}}}}$ if *w*_{k }= $\sqrt{{d}_{k}}$, and $\frac{\left|{\beta}_{{j}_{1}}\right|}{{d}_{{j}_{1}}}=\frac{\left|{\beta}_{{j}_{2}}\right|}{{d}_{{j}_{2}}}$ if *w*_{k }= *d*_{k}. Therefore, heavier weights (from *w*_{k }= 1, *w*_{k }= $\sqrt{{d}_{k}}$, to *w*_{k }= *d*_{k}) favor genes with more direct neighbors to have larger coefficient estimates; in other words, heavier weights relax the shrinkage effect for those regulators, which are known to be biologically more important. Due to this property, the choice of a heavy weight, as a simple strategy, enables us to alleviate the bias in the coefficient estimates from the penalization method and possibly improve the *p *predictive performance. Our default weight is *w*_{k }= $\sqrt{{d}_{k}}$. The weight, considered as another tuning parameter, can be determined from cross-validation or an independent validation data set, though we do not consider it here. Third, the penalty term, under certain conditions, tends to encourage a grouping effect, where highly correlated predictors tend to have similar coefficient estimates [17-20]. Fourth, the penalty is linear, which allows the solution to be found by the linear programming (LP) technique that is computationally convenient.

As usual, the fitted classifier is $\widehat{f}(x)={\widehat{\beta}}_{0}+{x}^{T}\widehat{\beta}$, and the classification rule is *sign*($\widehat{f}$(*x*)). We employ LP to obtain the solutions to (8) by

subject to

where

and ${\beta}_{j}={\beta}_{j}^{+}-{\beta}_{j}^{-}$, in which ${\beta}_{j}^{+}$ and ${\beta}_{j}^{-}$ denote the positive and negative parts of *β*_{j}. The calculation of the new method can be easily implemented by the R package lpsolve, so is the computation of the L1-SVM. The R package e1071 (with linear kernel) is used to obtain the solution to the STD-SVM.

## Results and discussion

### Simulation

We conducted several simulation studies to numerically evaluate the performance of the network-based SVM along with the STD-SVM and L1-SVM. The simulation setups were similar to those in [18]. We started from a simple network consisting of 5 subnetworks, each having a regulator gene *t *(*t *= 1,...,5) that regulated 10 target genes, leading to a total of 55 genes (*p *= 55). We assumed that two out of the five subnetworks were informative; that is, the coefficients of 22 genes were nonzero and thus informative to the outcome, while the remaining 33 noise genes had no effect on the outcome. We generated a simulated data set by the following steps:

• Generate the expression level of regulator gene *t*, *X*_{t }~ *N *(0, 1), *t *= 1,..., 5, independently.

• Assume that the expression level of regulator gene *t *and each of its regulated genes follow a bivariate normal distribution with correlation 0.7. Thus, the expression level of each target gene regulated by gene *t*, ${X}_{l}^{(t)}$ ~ *N*(0.7*X*_{t}, 0.51), *l *= 1,..., 10 and *t *= 1,..., 5.

• Generate the outcome *Y *from a logistic regression model: Logit (*Pr*(*Y *= 1|*X*)) = *X*^{T}*β *+ *β*_{0}, *β*_{0}= 2, where *X *is the vector of the expression levels of all the genes, and coefficient vector $\beta =({\beta}_{1}^{(1)},\mathrm{...},{\beta}_{10}^{(1)},\mathrm{...},{\beta}_{1}^{(5)},\mathrm{...},{\beta}_{10}^{(5)})$.

Four sets of true coefficients, *β *'s, were specified to reflect four scenarios:

1. $\beta =(5,\underset{10}{\underbrace{\frac{5}{\sqrt{10}},\cdots ,\frac{5}{\sqrt{10}}}},-5,\underset{10}{\underbrace{\frac{-5}{\sqrt{10}},\cdots ,\frac{-5}{\sqrt{10}}}},0,\cdots ,0).$.

The effect of one informative subnetwork was the same as the other in magnitude but with an opposite direction.

2. $\beta =(5,\underset{10}{\underbrace{\frac{5}{\sqrt{10}},\cdots ,\frac{5}{\sqrt{10}}}},3,\underset{10}{\underbrace{\frac{3}{\sqrt{10}},\cdots ,\frac{3}{\sqrt{10}}}},0,\cdots ,0).$.

Both informative subnetworks had positive effects but in different magnitudes.

3. $\beta =(5,\underset{7}{\underbrace{\frac{5}{\sqrt{10}},\cdots ,\frac{5}{\sqrt{10}}}},\frac{-5}{\sqrt{10}},\frac{-5}{\sqrt{10}},\frac{-5}{\sqrt{10}},3,\underset{7}{\underbrace{\frac{3}{\sqrt{10}},\cdots ,\frac{3}{\sqrt{10}}}},\frac{-3}{\sqrt{10}},\frac{-3}{\sqrt{10}},\frac{-3}{\sqrt{10}}0,\cdots ,0).$.

Target genes in the same informative subnetworks had both positive and negative effects.

4. $\beta =(5,\underset{6}{\underbrace{\frac{5}{\sqrt{10}},\cdots ,\frac{5}{\sqrt{10}}}},\underset{4}{\underbrace{\frac{-5}{\sqrt{10}},\cdots ,\frac{-5}{\sqrt{10}}}},-3,\underset{6}{\underbrace{\frac{-3}{\sqrt{10}},\cdots ,\frac{-3}{\sqrt{10}}}},\underset{4}{\underbrace{\frac{3}{\sqrt{10}},\cdots ,\frac{3}{\sqrt{10}}}},0,\cdots ,0).$.

It was similar to but more extreme than scenario 3.

Five methods, STD-SVM, L1-SVM, and network-based SVM with *w*_{k }= 1, *w*_{k }= $\sqrt{{d}_{k}}$, and *w*_{k }= *d*_{k}, were compared based on the results averaged over 100 runs under each of the above four scenarios. For each run, 100 observations were simulated as training data to build a classifier (with any given *λ*), another 100 for tuning the regularization parameter *λ*, and the last 10,000 as test data. Each predictor was normalized to have mean 0 and standard deviation 1. Given any value of *λ*, we obtained the coefficient estimates from the training set, then applied the classifier to the tuning set to find the classification error. We searched for $\widehat{\lambda}$, from a wide range of prespecified values, which produced the smallest classification error. The classifier corresponding to $\widehat{\lambda}$ was identified as the fitted classifier $\widehat{f}$. Then we applied $\widehat{f}$ to the test set and calculated the test error, the number of misclassifications divided by the test sample size. Table Table11 reports the mean classification error of the test set and its standard error (SE in parentheses), the standard deviation of the classification errors divided by the square root of the number of runs, for each method over 100 runs under each scenario. To evaluate each method's ability to select informative genes, we examined the false negatives, defined as the number of informative genes whose coefficients were estimated to be zero. In addition, we also considered a smaller sample size: we repeated the entire process with 50 training data points, 50 tuning data points, and again 10,000 test data points. The network-based SVM is named as "New" in the table.

*p*= 55. The simulation results were averaged over 100 runs for

*p*= 55 (22 informative and 33 noise genes).

According to our simulation setups, the correct weight function should be *w *= $\sqrt{d}$. However, we find that the new method with *w *= *d *overwhelmingly beat all other methods in all the setups. It consistently made the most accurate classifications and missed no informative genes. The new method with *w *= $\sqrt{d}$ performed the second best: in most cases, it improved the classification accuracy over STD-SVM and L1-SVM; and under all the settings, it produced models that identified more informative genes than the L1-SVM. In contrast, *w *= 1 did not bring much gains over the STD-SVM or the L1-SVM. The L1-SVM led to models that were too sparse, missing about 14 and 11 informative genes for *n *= 50 and *n *= 100 respectively. The superior performance and the larger model size of the heavy weight (*w *= *d*) compared with its counterparts (*w *= 1 and *w *= $\sqrt{d}$) is presumably due to its relaxation of the shrinkage effect. The penalization methods shrink the $\widehat{\beta}$ toward zero by imposing the constraints (the penalty term) and therefore introduces bias to $\widehat{\beta}$. By grouping neighboring genes, the new method encourages the pairwise weighted absolute coefficients to be equal. Therefore, a heavy weight leads to larger |$\widehat{\beta}$| for regulator genes. By choosing a heavier weight, we may overcome over-shrinkage, alleviate biases, and achieve better classification accuracy to some extent at the expense of model sparsity. As shown by Table Table2,2, *w *= *d *produced the largest |$\widehat{\beta}$| for regulators than its two counterparts. The L1-SVM estimates were treated as a yardstick for comparison as to provide an idea of the extent of shrinkage by each weight function. For example, *w *= 1 and *w *= $\sqrt{d}$ overly shrank all the regulators under all scenarios as compared with the L1-SVM estimates. Note that the binary outcome *Y *was generated from a logistic regression model while $\widehat{\beta}$ was estimated from a linear model, hence *E*($\widehat{\beta}$) may be different from *β *even for an unbiased estimator $\widehat{\beta}$ of the linear model.

*p*= 55 and

*n*= 100. The mean and the standard deviation (SD) of the coefficient estimates for selected informative genes were calculated from 100 runs.

Next, we evaluated the performance of the new method for high-dimensional data with large *p*. We used the setup of 50 observations for training, 50 for tuning, and 10,000 for test data. We assumed that (1) the network was composed of either 50 or 100 subnetworks, each having one gene regulating 10 target genes; (2) the first 2 subnetworks were informative resulting in 22 informative genes; (3) the rest of the genes had no effect on the outcome, leading to 528 noise genes when *p *= 550 and 1,078 noise genes when *p *= 1, 100; and (4) the true *β *was specified as in scenario 3. Table Table33 shows the simulation results averaged over 100 runs. Again, we see the gains from using a heavy weight (*w *= *d*). It prevailed over all the other methods in making accurate classifications and selecting informative genes. The *w *= $\sqrt{d}$ ranked the second. However, *w *= *d *generated models much larger than those from other methods except STD-SVM. In this case, the performance of *w *= 1 is no better than L1-SVM possibly due to over shrinkage of the effects of the regulator genes.

### Applications to microarray data

To evaluate its performance in the real world, we applied the new method to two microarray gene expression data sets related to the Parkinson's disease (PD) [21] and breast cancer metastasis (BC) [1,4] respectively.

#### Parkinson's disease

The data set includes the Parkinson's disease status and the expression levels of 22,283 genes from 105 patients (50 cases and 55 controls) [22]. We used the same network structure as [18]. The network combines 33 Kyoto Encyclopedia of Genes and Genomes (KEGG) regulatory pathways and contains a total of 1,523 genes and 6,865 edges. The data were randomly split into training (40 observations), tuning (20 observations), and test (45 observations) sets. The expression level of each gene was normalized to have mean 0 and standard deviation 1 across samples. The tuning parameter was identified from the tuning set and the performance of the method was evaluated on the test set by the mean classification error and its standard error averaged over 10 runs. Five methods were compared: STD-SVM, L1-SVM, network-based SVM with *w *= 1, *w *= $\sqrt{d}$, and *w *= *d*. To obtain a final model based on the new method with *w *= $\sqrt{d}$, we combined, for each run, the previous tuning and test data as the new tuning set leading to a sample size as large as 65 observations, on which the classification errors were calculated for wide-ranging values of the tuning parameter. Then after 10 runs, we had an averaged classification error corresponding to each tuning parameter value. The value that generated the minimal averaged error was the one we selected to fit the final model to all the data. Note that the classification error rate from the final model was likely to be biased due to the double use of the data for training/tuning and test; the main purpose of fitting the final model was to see the selected genes at the end.

First, we focused on the 1,070 genes that appeared in the network with the largest variations of expression levels (i.e., SD of expression levels across the 105 samples ≥ 15). According to the KEGG pathway of Parkinson's disease [23], 20 genes play a role in the disease progression, five of which (*UBE1, PARK2, UBB, SEPT5*, and *SNCAIP*) belong to the 1,070 genes. In addition to the classification error, we added two additional criteria for method comparison: the number of disease genes identified, and the number of genes identified. Table Table44 shows that STD-SVM made the most accurate classification, even though the difference with other methods was perhaps non-significant. The *w *= *d *ranked the second in predictive performance while produced a model including 70.6 genes on average. In this case, the *w *= $\sqrt{d}$ gained advantage: it selected more disease genes by a relatively sparse model with a classification error non-significantly larger than STD-SVM. From the 1,070 genes, with the final model the new method identified 75 genes including one disease gene.

**...**

Next, to better integrate the biological observation of the KEGG pathway and the known network structure of [18], we restricted our analysis to the first- and second-order-neighbors of the 8 disease genes on the Parkinson's disease KEGG pathway whose expression levels and network structure are available. The first-order-neighbor subnetwork (PD-1nb-net) was composed of the 8 disease genes and their 8 direct neighbors. The second-order-neighbor subnetwork (PD-2nb-net) comprised the PD-1nb-net as well as the direct neighbors of the 8 direct neighbors of the disease genes, leading to a total of 26 genes. Figure Figure11 displays the two subnetworks. We conducted the analysis in the same way as described above. The only difference resided in that this time only genes appearing in the PD-1nb-net and PD-2nb-net were included in the analysis. Table Table55 shows the results.

**...**

**Parkinson's disease gene subnetworks**. Left: PD-1nb-net, including 8 Parkinson disease genes (gray) and their 8 direct neighbors (white). Right: PD-2nb-net, including 8 Parkinson disease genes (gray), their 8 direct and 10 second-order neighbors (white).

**...**

We see the gains from employing the new method when narrowing down our focus on the PD-1nb-net and PD-2nb-net. For the PD-1nb-net, *w *= 1 and *w *= $\sqrt{d}$ performed equally well. They had the smallest classification error and identified one more disease gene through a model slightly larger than the one obtained from L1-SVM. The new method with *w *= *d *won over in the case of PD-2nb-net with the best accuracy and most selected disease genes. The *w *= $\sqrt{d}$ ranked the second in terms of the prediction accuracy while detecting 3 more disease genes by a model with 3 more genes than that of the L1-SVM. This means that the new method was able to identify more clinically relevant genes while keeping the same number of noise genes in the model as L1-SVM. In both subnetworks, the final models included all the genes.

#### Breast cancer metastasis

The breast cancer metastasis data set [1,4] contains expression levels of 8,141 genes for 286 patients, 106 of whom were detected to develop metastasis within a 5-year follow-up after surgery. *TP53*, *BRCA1*, and *BRCA2 *are three human genes that belong to the class of tumor suppressor genes, which are known to prevent uncontrolled cell proliferation, and to play a critical role in repairing the chromosomal damage. Certain mutations of these genes lead to increasing risk of breast cancer. We explored the protein-protein interaction (PPI) network previously used by [1]. The PPI network comprises 57,235 interactions among 11,203 proteins, obtained by assembling various sources of experimental data and curation of the literature [1]. We confined our analysis to the direct or first-order neighbors (BC-1nb-net) of the three cancer genes, and the subnetwork composed of two parts (BC-2nb-net): the direct neighbors of *TP53*, and the second-order neighbors of *BRCA1 *and *BRCA2*. We fit the final model and compared the four methods in terms of classification error, cancer genes selection, and model sparsity. The cancer genes are the 227 known or putative cancer genes with estimated mutation frequencies in cancer samples ([1]). A total of 294 genes that fell into the BC-1nb-net had observed expression levels, among which were 40 cancer genes and 7 cancer genes (*ABL1, JAK2, p53, PTEN, p14ARF, PTCH*, and *RB*) with mutation frequencies larger than 0.10. The BC-2nb-net was composed of 2,070 genes, 1,718 of them with observed expression levels, including 107 cancer genes. Besides the 7 included in BC-1nb-net, 7 additional cancer genes (*ACH, APC, EGFR, KIT, NICD, RAS*, and *CTNNB1*) that had mutation frequencies larger than 0.10 belonged to BC-2nb-net.

For BC-1nb-net, *w *= *d *had the advantage in selecting cancer genes and those with large mutant frequencies (Table (Table6).6). The *w *= $\sqrt{d}$ detected more clinically relevant genes by a sparser model while reaching a comparable classification error rate to that of L1-SVM. Even though the final model was parsimonious, it included 4 cancer genes, one of which had a large mutation frequency. For BC-2nb-net, the new method with *w *= $\sqrt{d}$ detected more cancer genes with equally accurate predictions while maintaining a sparse model compared with L1-SVM. The final model included only 23 genes out of 1,718, two of which were cancer genes with one having a large mutation frequency.

## Conclusion

The advancement in the microarray technology has enriched the tool kit of researchers to decipher the complexity of disease mechanisms at the genomic level. Studies have been widely conducted to identify genetic markers to better the diagnostic classification and prognostic assessment, largely by ignoring biological knowledge on gene functions and treating individual genes equally and independently a priori. The downside of such an endeavor has been realized; for example, gene markers identified across similar patient cohorts for the same disease in such a way often lack consistency. As a viable alternative, the network-based approach has been gaining popularity. In addition to improving predictive performance and gene selection, the network-based approach extracts more biological insights from high-throughput gene expression data. Here we have proposed a network-based SVM, with a penalty term incorporating gene network information, as a practically useful classification tool for microarray data. Our simulation studies and two real data applications indicate that the proposed method is able to better identify clinically relevant genes and make accurate predictions.

## List of abbreviations used

SVM: support vector machine; STD-SVM: standard support vector machine; L1-SVM: L1-penalized support vector machine; LP: linear programming; PD: Parkinson's disease; BC: Breast cancer; KEGG: Kyoto Encyclopedia of Genes and Genomes; PPI: protein protein interaction

## Competing interests

The authors declare that they have no competing interests.

## Authors' contributions

YZ implemented the methods, did all the experiments and drafted the paper. XS and WP initiated the project. All participated in the writing of the article.

## Acknowledgements

YZ and WP were partially supported by NIH grants HL65462 and GM081535; XS supported by NIH grant GM081535 and NSF grants IIS-0328802 and DMS-0604394. We thank Dr Hongzhe Li and Dr Trey Ideker for providing the KEGG network and PPI network data respectively.

This article has been published as part of *BMC Bioinformatics *Volume 10 Supplement 1, 2009: Proceedings of The Seventh Asia Pacific Bioinformatics Conference (APBC) 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S1

## References

- Chuang HY, Lee EJ, Liu YT, Lee DH, Ideker T. Network-based classification of breast cancer metastasis. Mol Syst Biol. 2007;3:140. doi: 10.1038/msb4100180. [PMC free article] [PubMed] [Cross Ref]
- Frolov AE, Godwin AK, Favorova OO. Differential gene expression analysis by DNA microarray technology and its application in molecular oncology. Mol Biol. 2003;37:486–494. doi: 10.1023/A:1025166706481. [PubMed] [Cross Ref]
- Yang TY. The simple classification of multiple cancer types using a small number of significant genes. Mol Diagn Ther. 2007;11:265–275. [PubMed]
- Wang Y, Klijin JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijer-van Gelder ME, Yu J, Jatkoe T, Berns EM, Atkins D, Foekens JA. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet. 2005;365:671–679. [PubMed]
- Xiong MM, Li WJ, Zhao JY, Li J, Boerwinkle E. Feature (gene) selection in gene expression-based tumor classification. Mol Genet Metab. 2001;73:239–247. doi: 10.1006/mgme.2001.3193. [PubMed] [Cross Ref]
- Ein-Dor L, Kela I, Getz G, Givol D, Domany E. Outcome signature genes in breast cancer: is there a unique set? Bioinformatics. 2005;21:171–178. doi: 10.1093/bioinformatics/bth469. [PubMed] [Cross Ref]
- Ein-Dor L, Zuk O, Domany E. Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc Natl Acad Sci USA. 2006;103:5923–5928. doi: 10.1073/pnas.0601231103. [PMC free article] [PubMed] [Cross Ref]
- Liu M, Liberzon A, Kong SW, Lai WR, Park PJ, Kohane IS, Kasif S. Network-based analysis of affected biological processes in type 2 diabetes models. PLoS Genet. 2007;3:e96. doi: 10.1371/journal.pgen.0030096. doi:10.1016/S0140-6736(05)17947-1. [PMC free article] [PubMed] [Cross Ref]
- Cortes C, Vapnik V. Support-vector networks. Machine Learning. 1995;20:273–297.
- Vapnik V. The Nature of Statistical Learning Theory. New York: Springer; 1995.
- Brown MPS, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Ares M, Haussler D. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci USA. 2000;97:262–267. doi: 10.1073/pnas.97.1.262. [PMC free article] [PubMed] [Cross Ref]
- Furey T, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics. 2000;16:906–914. doi: 10.1093/bioinformatics/16.10.906. [PubMed] [Cross Ref]
- Zou H, Yuan M. The
*F*_{∞}-norm Support Vector Machine. Stat Sin. 2008;18:379–398. - Wahba G, Lin Y, Zhang H. GACV for support vector machines. In: Smola A, Bartlett P, Scholkopf B, Schuurmans D, editor. Advances in Large Margin Classifiers. Cambridge, MA: MIT Press; 2000. pp. 297–311.
- Hastie T, Tibshirani R, Friedman JH. The Elements of Statistical Learning. New York: Springer; 2001.
- Friedman JH, Hastie T, Rosset S, Tibshirani R, Zhu J. Discussion of boosting papers. Ann Appl Stat. 2004;32:102–107.
- Pan W, Xie B, Shen X. Incorporating predictor network in penalized regression with application to microarray data. [Manuscript submitted]. [PMC free article] [PubMed]
- Li C, Li H. Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics. 2008;24:1175–1182. doi: 10.1093/bioinformatics/btn081. [PubMed] [Cross Ref]
- Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Statist Soc B. 2005;67:301–320. doi: 10.1111/j.1467-9868.2005.00503.x. [Cross Ref]
- Wang L, Zhu J, Zou H. The doubly regularized support vector machine. Stat Sin. 2006;16:589–615.
- Gene Expression Omnibus: GSE6613 http://www.ncbi.nlm.nih.gov/geo/
- Scherzer CR, Eklund AC, Morse LJ, Liao Z, Locascio JJ, Fefer D, Schwarzschild MA, Schlossmacher MG, Hauser MA, Vance JM, Sudarsky LR, Standaert DG, Growdon JH, Jensen RV, Gullans SR. Molecular markers of early Parkinson's disease based on gene expression in blood. Proc Natl Acad Sci USA. 2007;104:955–960. doi: 10.1073/pnas.0610204104. [PMC free article] [PubMed] [Cross Ref]
- KEGG: Parkinson's disease http://cgap.nci.nih.gov/Pathways/Kegg/hsa05020

**BioMed Central**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (448K)

- Mixture classification model based on clinical markers for breast cancer prognosis.[Artif Intell Med. 2010]
*Zeng T, Liu J.**Artif Intell Med. 2010 Feb-Mar; 48(2-3):129-37. Epub 2009 Dec 14.* - Support Vector Machines with Disease-gene-centric Network Penalty for High Dimensional Microarray Data.[Stat Interface. 2009]
*Zhu Y, Pan W, Shen X.**Stat Interface. 2009 Summer; 2(3):257-269.* - Feature selection and classification of MAQC-II breast cancer and multiple myeloma microarray gene expression data.[PLoS One. 2009]
*Liu Q, Sung AH, Chen Z, Liu J, Huang X, Deng Y.**PLoS One. 2009 Dec 11; 4(12):e8250. Epub 2009 Dec 11.* - Identifying cancer biomarkers by network-constrained support vector machines.[BMC Syst Biol. 2011]
*Chen L, Xuan J, Riggins RB, Clarke R, Wang Y.**BMC Syst Biol. 2011 Oct 12; 5:161. Epub 2011 Oct 12.* - Filter versus wrapper gene selection approaches in DNA microarray domains.[Artif Intell Med. 2004]
*Inza I, Larrañaga P, Blanco R, Cerrolaza AJ.**Artif Intell Med. 2004 Jun; 31(2):91-103.*

- Molecular pathway identification using biological network-regularized logistic models[BMC Genomics. ]
*Zhang W, Wan YW, Allen GI, Pang K, Anderson ML, Liu Z.**BMC Genomics. 14(Suppl 8)S7* - Protein Sub-Nuclear Localization Prediction Using SVM and Pfam Domain Information[PLoS ONE. ]
*Kumar R, Jain S, Kumari B, Kumar M.**PLoS ONE. 9(6)e98345* - Biomarker Gene Signature Discovery Integrating Network Knowledge[Biology. ]
*Cun Y, Fröhlich H.**Biology. 1(1)5-17* - Gene selection for cancer identification: a decision tree model empowered by particle swarm optimization algorithm[BMC Bioinformatics. ]
*Chen KH, Wang KJ, Tsai ML, Wang KM, Adrian AM, Cheng WC, Yang TS, Teng NC, Tan KP, Chang KS.**BMC Bioinformatics. 1549* - Network and Data Integration for Biomarker Signature Discovery via Network Smoothed T-Statistics[PLoS ONE. ]
*Cun Y, Fröhlich H.**PLoS ONE. 8(9)e73074*