# Gene Expression Data Classification With Kernel Principal Component Analysis

^{1}Bioinformatics Cell, US Army Medical Research and Materiel Command, 110 North Market Street, Frederick, MD 21703, USA

^{2}Department of Preventive Medicine and Biometrics, Uniformed Services University of the Health Sciences, 4301 Jones Bridge Road, Bethesda, MD 20814, USA

^{3}Department of Statistics, University of Tennessee, 331 Stokely Management Center, Knoxville, TN 37996, USA

This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## Abstract

One important feature of the gene expression data is that the
number of genes *M* far exceeds the number of samples
*N*. Standard statistical methods do not work well when
*N < M*. Development of new methodologies or modification of
existing methodologies is needed for the analysis of the
microarray data. In this paper, we propose a novel analysis
procedure for classifying the gene expression data. This procedure
involves dimension reduction using kernel principal component
analysis (KPCA) and classification with logistic regression
(discrimination). KPCA is a generalization and nonlinear version
of principal component analysis. The proposed algorithm was
applied to five different gene expression datasets involving human
tumor samples. Comparison with other popular classification
methods such as support vector machines and neural networks shows
that our algorithm is very promising in classifying gene
expression data.

## INTRODUCTION

One important application of gene expression data is the classification of samples into different categories, such as the types of tumor. Gene expression data are characterized by many variables on only a few observations. It has been observed that although there are thousands of genes for each observation, a few underlying gene components may account for much of the data variation. Principal component analysis (PCA) provides an efficient way to find these underlying gene components and reduce the input dimensions (Bicciato et al [1]). This linear transformation has been widely used in gene expression data analysis and compression (Bicciato et al [1], Yeung and Ruzzo [2]). If the data are concentrated in a linear subspace, PCA provides a way to compress data and simplify the representation without losing much information. However, if the data are concentrated in a nonlinear subspace, PCA will fail to work well. In this case, one may need to consider kernal principal component analysis (KPCA) (Rosipal and Trejo [3]). KPCA is a nonlinear version of PCA. It has been studied intensively in the last several years in the field of machine learning and has claimed success in many applications (Ng et al [4]). In this paper, we introduce a novel algorithm of classification, based on KPCA. Computational results show that our algorithm is effective in classifying gene expression data.

## ALGORITHM

A gene expression dataset with *M* genes (features) and *N* mRNA
samples (observations) can be conveniently represented by the
following gene expression matrix:

where *x*_{li} is the measurement of the expression level of gene
*l* in mRNA sample *i*. Let **x**_{i} = (*x*_{1i}, *x*_{2i}, ...,
*x*_{Mi})′ denote the *i*th column (sample) of *X* with the prime
′ representing the transpose operation, and *y*_{i} the
corresponding class label (eg, tumor type or clinical outcome).

KPCA is a nonlinear version of PCA. To perform KPCA, one first
transforms the input data **x** from the original input space
*F*_{0} into a higher-dimensional feature space *F*_{1} with the
nonlinear transform **x** → Φ(**x**), where Φ is a
nonlinear function. Then a kernel matrix *K* is formed using the
inner products of new feature vectors. Finally, a PCA is performed
on the centralized *K*, which is the estimate of the covariance
matrix of the new feature vector in *F*_{1}. Such a linear PCA on
*K* may be viewed as a nonlinear PCA on the original data. This
property is sometimes called “kernel trick” in the literature.
The concept of kernel is very important, here is a simple example
to illustrate it. Suppose we have a two-dimensional input
**x** = (*x*_{1}, *x*_{2})′, let the nonlinear transform be

Therefore, given two points **x**_{i} = (*x*_{i1}, *x*_{i2})′
and **x**_{j} = ( x_{j1}, x_{j2})′, the
inner product (kernel) is

which is a second-order polynomial kernel. Equation (3)
clearly shows that the kernel function is an inner product in the
feature space and the inner products can be evaluated without even
explicitly constructing the feature vector Φ(**x**).

The following are among the popular kernel functions:

- first norm exponential kernel$$K\left({\mathbf{x}}_{i},{\mathbf{x}}_{j}\right)=exp\left(-\beta \Vert {\mathbf{x}}_{i}-{\mathbf{x}}_{j}\Vert \right),\phantom{\rule{2em}{0ex}}\left(4\right)$$
- radial basis function (RBF) kernel$$K\left({\mathbf{x}}_{i},{\mathbf{x}}_{j}\right)=exp\left(-\frac{{\left|{\mathbf{x}}_{i}-{\mathbf{x}}_{j}\right|}^{2}}{{\sigma}^{2}}\right),\phantom{\rule{2em}{0ex}}\left(5\right)$$
- power exponential kernel (a generalization of RBF kernel)$$K\left({\mathbf{x}}_{i},{\mathbf{x}}_{j}\right)=exp\left[-{\left(\frac{{\left|{\mathbf{x}}_{i}-{\mathbf{x}}_{j}\right|}^{2}}{{r}^{2}}\right)}^{\beta}\right],\phantom{\rule{2em}{0ex}}\left(6\right)$$
- sigmoid kernel$$K\left({\mathbf{x}}_{i},{\mathbf{x}}_{j}\right)=tanh\left(\beta {{\mathbf{x}}^{\prime}}_{i}{\mathbf{x}}_{j}\right),\phantom{\rule{2em}{0ex}}\left(7\right)$$
- polynomial kernelwhere$$K\left({\mathbf{x}}_{i},{\mathbf{x}}_{j}\right)={\left({{\mathbf{x}}^{\prime}}_{i}{\mathbf{x}}_{j}+{p}_{2}\right)}^{p1},\phantom{\rule{2em}{0ex}}\left(8\right)$$
*p*_{1}and*p*_{2}= 0, 1, 2, 3,... are both integers.

For binary classification, our algorithm, based on KPCA, is stated as follows.

### KPC classification algorithm

Given a training dataset ${\left\{{\mathbf{x}}_{i}\right\}}_{i=1}^{n}$ with class labels ${\left\{{y}_{i}\right\}}_{i=1}^{n}$ and a test dataset ${\left\{{\mathbf{x}}_{t}\right\}}_{t=1}^{{n}_{t}}$ with labels ${\left\{{y}_{t}\right\}}_{t=1}^{{n}_{t}}$, do the following.

- Compute the kernel matrix, for the training data,
*K*= [*K*_{ij}]_{n×n}, where*K*_{ij}=*K*(**x**_{i},**x**_{j}). Compute the kernel matrix, for the test data,*K*_{te}= [*K*]_{ti}_{nt × n}, where*K*= K(_{ti}**x**_{t},**x**_{i}).*K*projects the test data_{ti}**x**_{t}onto training data**x**_{i}in the high-dimensional feature space in terms of the inner product. - Centralize
*K*using and*K*_{te}$$\begin{array}{l}K=\left({\mathbf{I}}_{n}-\frac{1}{n}{\mathbf{1}}_{n}{{\mathbf{1}}^{\prime}}_{n}\right)K\left({\mathbf{I}}_{n}-\frac{1}{n}{\mathbf{1}}_{n}{{\mathbf{1}}^{\prime}}_{n}\right),\\ {K}_{\text{te}}=\left({K}_{\text{te}}-\frac{1}{n}{\mathbf{1}}_{{n}_{t}}{{\mathbf{1}}^{\prime}}_{n}K\right)\left(\mathbf{I}-\frac{1}{n}{\mathbf{1}}_{n}{{\mathbf{1}}^{\prime}}_{n}\right).\end{array}\phantom{\rule{2em}{0ex}}\left(9\right)$$ - Form an
*n*×*k*matrix*Z*=[*z*_{1}*z*_{2}...*z*_{k}], where*z*_{1},*z*_{2}, ...,*z*_{k}are eigenvectors of*K*that correspond to the largest eigenvalues λ_{1}≥ λ_{2}≥ ... ≥ λ_{k}> 0. Also form a diagonal matrix*D*with λ_{i}in a position*(i, i)*. - Find the projections
**V**=*KZD*^{−1/2}and**V**_{te}=*K*_{te}*ZD*^{−1/2}for the training and test data, respectively. - Build a logistic regression model using
**V**and ${\left\{{y}_{i}\right\}}_{i=1}^{n}$ and test the model performance using**V**_{te}and ${\left\{{y}_{t}\right\}}_{t=1}^{{n}_{t}}$.

We can show that the above KPC classification algorithm is a
nonlinear version of the logistic regression. From our KPC
classification algorithm, the probability of the label
*y*, given the projection **v**, is expressed as

where the coefficients **w** are adjustable parameters
and *g* is the logistic function

Let *n* be the number of training samples and Φ
the
nonlinear transform function. We know each eigenvector *z*_{i} lies
in the span of Φ(**x**_{1}), Φ(**x**_{2}), ...,
Φ(**x**_{n}) for *i* = 1, ..., *n* (Rosipal and Trejo
[3]). Therefore one can write, for constants *z*_{ij},

Given a test data **x**, let *v*_{i} denote the projection of
Φ(**x**) onto the *i*th nonlinear component with a
normalizing factor $1/\sqrt{{\lambda}_{i}}$, we have

Substituting (13) into (10), we have

where

When *K*(**x**_{i},
**x**_{j}) =
**x**_{i}′**x**
_{j}, (14) becomes
logistic regression. *K*(**x**_{i},
**x**_{j}) =
**x**_{i}′**x**_{j} is a
linear kernel
(polynomial kernel with *p*_{1} = 1 and *p*_{2} = 0). When we first
normalize the input data through minusing their mean and then
dividing their standard deviation, linear kernel matrix is the
covariance matrix of the input data. Therefore KPC classification
algorithm is a generalization of logistic regression.

Described in terms of binary classification, our classification
algorithm can be readily employed for multiclass classification
tasks. Typically, two-class problems tend to be much easier to
learn than multiclass problems. While for two-class problems only
one decision boundary
must be inferred, the general *c*-class setting
requires us to apply a strategy for coupling decision rules. For
a *c*-class problem, we employ the standard approach where
two-class classifiers are trained in order to separate each of
the classes against all others. The decision rules are then
coupled by voting, that is, sending the sample to the class with
the largest probability.

Mathematically, we build *c* two-class classifiers
based on a KPC classification algorithm in the form of
(14) with the scheme “one against the rest”:

where *i* =1, 2, ..., *c*. Then for a test data point **x**_{t},
we have the predicted class

### Feature and model selections

Since many genes show little variation across samples, gene
(feature) selection is required. We chose the most informative
genes with the highest likelihood ratio scores, described below
(Ideker et al [5]). Given a two-class problem with an
expression matrix *X* = [*x*_{li}]_{M × N}, we have, for each
gene *l*,

where

Here *μ*,
*μ*_{0}, and *μ*_{1} are the whole sample mean, the
Class 0 mean, and the Class 1 mean, alternatively. We selected
the most informative genes with the largest *T* values.
This selection procedure is based on the likelihood ratio and used
in our classification.

On the other hand, the dimension of projection (the number of
eigenvectors) *k*
used in the model can be selected based on Akaike's information
criteria (AIC):

where $\stackrel{\u02c6}{L}$
is the maximum likelihood and *k* is the
dimension of the projection in (10). The maximum
likelihood $\stackrel{\u02c6}{L}$ can also be calculated using (10):

We can choose the best *k* with minimum AIC value.

## COMPUTATIONAL RESULTS

To illustrate the applications of the algorithm proposed in the previous section, we considered five gene expression datasets: leukemia (Golub et al [6]), colon (Alon et al [7]), lung cancer (Garber et al [8]), lymphoma (Alizadeh et al [9]), and NCI (Ross et al [10]). The classification performance is assessed using the “leave-one-out (LOO) cross validation” for all of the datasets except for leukemia which uses one training and test data only. LOO cross validation provides more realistic assessment of classifiers which generalize well to unseen data. For presentation clarity, we give the number of errors with LOO in all of the figures and tables.

### Leukemia

The leukemia dataset consists of expression profiles of 7129
genes from 38 training samples (27 ALL and 11 AML) and 34
testing samples (20 ALL and 14 AML). For classification of
leukemia using a KPC classification algorithm, we chose the
polynomial kernel *K*(**x**_{i},
**x**_{j}) =
(**x**_{i}′**x**_{j} + 1)^{2} and
15 eigenvectors corresponding to the first 15 largest
eigenvalues with AIC. Using 150 informative genes, we obtained
0 training error and 1 test error. This is the best result
compared with those reported in the literature. The plot for the
output of the test data is given in Figure 1, which
shows that all the test data points are classified correctly
except for the last data point.

### Colon

The colon dataset consists of expression profiles of 2000 genes
from 22 normal tissues and 40 tumor samples. We calculated the
classification result using a KPC classification algorithm with
a kernel *K*(**x**_{i}, **x**_{j}) = (**x**_{i}′**x*** _{j}* + 1)

^{2}. There were 150 selected genes and 25 eigenvectors selected with AIC criteria. The result is compared with that from the linear principal component (PC) logistic regression. The classification errors were calculated with the LOO method. The average error with linear PC logistic regression is 2 and the error with KPC classification is 0. The detailed results are given in Figure 2.

### Lung cancer

The lung cancer dataset has 918 genes, 73 samples, and 7
classes. The number of samples per class for this dataset is
small (less than 10) and unevenly distributed with 7 classes,
which makes the classification task more challenging. A
third-order polynomial kernel *K*(**x**_{i}, **x**_{j}) = (**x**_{i}′**x**_{j}
+1)^{3}, and an RBF kernel with *σ* = 1 were used in the
experiments. We chose the 100 most informative genes and 20
eigenvectors with our gene and model selection methods. The
computational results of KPC classification and other methods are
shown in Table 1. The results from SVMs for lung
cancer, lymphoma, and NCI shown in this paper are those from Ding
and Peng [11]. Six misclassifications with KPC and a polynomial
kernel are given in Table 2. Table 1
shows that KPC with a polynomial kernel is performed better than
that with an RBF kernel.

### Lymphoma

The lymphoma dataset has 4026 genes, 96 samples, and 9
classes. A third-order polynomial kernel *K*(**x**_{i},
**x**_{j}) =
(**x**_{i}′**x**_{j} + 1)^{3} and an RBF kernel with σ = 1 were used in
our analysis. The 300 most informative genes and 21
eigenvectors corresponding to the largest eigenvalues were
selected with the gene selection method and AIC criteria. A
comparison of KPC with other methods is shown in
Table 3. Misclassifications of
lymphoma using KPC with a polynomial kernel are given
in Table 4. There are only 2 misclassifications of
class 1 using our KPC algorithm with a polynomial kernel, as
shown in Table 4. The KPC with a polynomial kernel
outperformed that with an RBF kernel in this experiment.

### NCI

The NCI dataset has 9703 genes, 60 samples, and 9 classes.
The third-order polynomial kernel *K*(**x**_{i}, **x**_{j}) =
(**x**_{i}′**x**_{j} + 1)^{3} and an RBF kernel with σ = 1 were chosen in
this experiment. The 300 most informative genes and 23
eigenvectors were selected with our simple gene selection method
and AIC criteria. A comparison of computational results is
summarized in Table 5 and the details of
misclassification are listed in Table 6. KPC
classification has equivalent performance with other
popular tools.

## DISCUSSIONS

We have introduced a nonlinear method, based on kPCA, for classifying gene expression data. The algorithm involves nonlinear transformation, dimension reduction, and logistic classification. We have illustrated the effectiveness of the algorithm in real life tumor classifications. Computational results show that the procedure is able to distinguish different classes with high accuracy. Our experiments also show that KPC classifications with second- and third-order polynomial kernels are usually performed better than that with an RBF kernel. This phenomena may be explained from the special structure of gene expression data. Our future work will focus on providing a rigorous theory for the algorithm and exploring the theoretical foundation that KPC with a polynomial kernel performed better than that with other kernels.

## DISCLAIMER

The opinions expressed herein are those of the authors and do not necessarily represent those of the Uniformed Services University of the Health Sciences and the Department of Defense.

## ACKNOWLEDGMENT

D. Chen was supported by the National Science Foundation Grant CCR-0311252. The authors thank Dr. Hanchuan Peng, the Lawrence Berkeley National Laboratory for providing the NCI, lung cancer, and lymphoma data.

## References

*Bioinformatics*. 2003;19(5):571–578. [PubMed]

*Bioinformatics*. 2001;17(9):763–774. [PubMed]

*J Comput Biol*. 2000;7(6):805–817. [PubMed]

*Science*. 1999;286(4539):531–537. [PubMed]

*Proc Natl Acad Sci USA*. 1999;96(12):6745–6750. [PMC free article] [PubMed]

*Proc Natl Acad Sci USA*. 2001;98(24):13784–13789. [PMC free article] [PubMed]

*Nature*. 2000;403(6769):503–511. [PubMed]

*Nat Genet*. 2000;24(3):227–235. [PubMed]

**Hindawi Publishing Corporation**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (119K) |
- Citation

- Independent component analysis-based penalized discriminant method for tumor classification using gene expression data.[Bioinformatics. 2006]
*Huang DS, Zheng CH.**Bioinformatics. 2006 Aug 1; 22(15):1855-62. Epub 2006 May 18.* - Improving gene expression cancer molecular pattern discovery using nonnegative principal component analysis.[Genome Inform. 2008]
*Han X.**Genome Inform. 2008; 21:200-11.* - Logistic support vector machines and their application to gene expression data.[Int J Bioinform Res Appl. 2005]
*Liu Z, Chen D, Xu Y, Liu J.**Int J Bioinform Res Appl. 2005; 1(2):169-82.* - Tumor classification by partial least squares using microarray gene expression data.[Bioinformatics. 2002]
*Nguyen DV, Rocke DM.**Bioinformatics. 2002 Jan; 18(1):39-50.* - Large-scale maximum margin discriminant analysis using core vector machines.[IEEE Trans Neural Netw. 2008]
*Tsang IH, Kocsor A, Kwok JY.**IEEE Trans Neural Netw. 2008 Apr; 19(4):610-24.*

- New bandwidth selection criterion for Kernel PCA: Approach to dimensionality reduction and classification problems[BMC Bioinformatics. ]
*Thomas M, Brabanter KD, Moor BD.**BMC Bioinformatics. 15137* - SNP Set Association Analysis for Genome-Wide Association Studies[PLoS ONE. ]
*Cai M, Dai H, Qiu Y, Zhao Y, Zhang R, Chu M, Dai J, Hu Z, Shen H, Chen F.**PLoS ONE. 8(5)e62495* - Detection for gene-gene co-association via kernel canonical correlation analysis[BMC Genetics. ]
*Yuan Z, Gao Q, He Y, Zhang X, Li F, Zhao J, Xue F.**BMC Genetics. 1383* - A Classification Method Based on Principal Components of SELDI Spectra to Diagnose of Lung Adenocarcinoma[PLoS ONE. ]
*Lin Q, Peng Q, Yao F, Pan XF, Xiong LW, Wang Y, Geng JF, Feng JX, Han BH, Bao GL, Yang Y, Wang X, Jin L, Guo W, Wang JC.**PLoS ONE. 7(3)e34457* - Gene- or region-based association study via kernel principal component analysis[BMC Genetics. ]
*Gao Q, He Y, Yuan Z, Zhao J, Zhang B, Xue F.**BMC Genetics. 1275*

- PubMedPubMedPubMed citations for these articles

- Gene Expression Data Classification With Kernel
Principal Component Analy...Gene Expression Data Classification With Kernel Principal Component AnalysisJournal of Biomedicine and Biotechnology. 2005; 2005(2)155

Your browsing activity is empty.

Activity recording is turned off.

See more...