# Multi-class cancer classification by total principal component regression (TPCR) using microarray gene expression data

^{1}Center for Toxicoinformatics, Division of Systems Toxicology, National Center for Toxicological Research, FDA, Jefferson, AR 72079, USA

^{*}To whom correspondence should be addressed. Tel: +1 310 4237363; Fax: +1 310 4237452; Email: gro.shsc@gnaw.selrahc

*Nucleic Acids Research, Vol. 33 No. 1*©

*Oxford University Press 2005; all rights reserved*

The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use permissions, please contact gro.slanruojpuo@snoissimrep.slanruoj.

## Abstract

DNA microarray technology provides a promising approach to the diagnosis and prognosis of tumors on a genome-wide scale by monitoring the expression levels of thousands of genes simultaneously. One problem arising from the use of microarray data is the difficulty to analyze the high-dimensional gene expression data, typically with thousands of variables (genes) and much fewer observations (samples), in which severe collinearity is often observed. This makes it difficult to apply directly the classical statistical methods to investigate microarray data. In this paper, total principal component regression (TPCR) was proposed to classify human tumors by extracting the latent variable structure underlying microarray data from the augmented subspace of both independent variables and dependent variables. One of the salient features of our method is that it takes into account not only the latent variable structure but also the errors in the microarray gene expression profiles (independent variables). The prediction performance of TPCR was evaluated by both leave-one-out and leave-half-out cross-validation using four well-known microarray datasets. The stabilities and reliabilities of the classification models were further assessed by re-randomization and permutation studies. A fast kernel algorithm was applied to decrease the computation time dramatically. (MATLAB source code is available upon request.)

## INTRODUCTION

Improvements in cancer classification have been of great importance in cancer treatment. It is difficult to distinguish tumors, which have similar histopathological appearance but different clinical course and response to therapy, by the traditional cancer diagnostic methods that are based primarily on morphological appearance of tumors (1). DNA microarray technology provides a powerful approach to the diagnosis and prognosis of various tumors on a genome-wide scale. By simultaneously monitoring the expression of thousands of genes in cells to obtain quantitative information about the complete transcription profile of cells, microarray technology makes tailored therapeutics to specific pathologies possible (1–4). Despite the usefulness of microarray technology, analyzing and understanding the obtained data has been a complex and challenging task. Microarray data analysis methods can be categorized roughly into unsupervised learning, including various clustering techniques such as self-organizing map (5) and hierarchical clustering (6), and supervised learning, including various classification and prediction techniques (7,8). Some recent applications of supervised learning techniques include molecular classification of acute leukemia (1), classification of human cancer cell lines (9), support vector machine classification of cancer tissue samples (10), classifying cancers using artificial neural networks (11), mapping of the physiological state of cells and tissues and identification of important genes using Fisher discriminant analysis (12), tumor classification by polychotomous discrimination and quadratic discriminant analysis after dimension reduction using principal component analysis (PCA) or partial least squares (PLS) (13), PCA disjoint models for cancer classification (14), multi-class tumor classification by discriminant PLS and assessment of classification models (15), classification by incorporating PLS within the interactively re-weighed least square steps for multinomial or binary logistic regression (16) and classification using PLS with penalized logistic regression (17).

DNA microarray gene expression data are usually characterized by thousands of variables (genes) with much fewer observations (samples), resulting in a high degree of multicollinearity. This makes it difficult or even impossible to apply directly classical statistical methods to the analysis of microarray data. To tackle this kind of collinearity problems, latent variable methods, such as PCA (18) and PLS (19), have been developed to reduce the dimensionality of gene expression data and mitigate the collinearity. These methods assume that the independent variables (gene expression profiles) are inherently located in a low-dimensional linear subspace, i.e. they have an intrinsic latent variable structure. PCA attempts to find a set of orthogonal principal components (linear combinations of original independent variables) to account for the maximum variations in independent variables (18). Since the information about the sample classification provided by dependent variable (class membership) is not taken into account in the extraction of the principal components in PCA, the performance of PCA in classification or prediction may not be satisfactory. And this is one of the reasons why PLS was developed several decades ago to provide a better performance in calibration and prediction than PCA (19). In addition, it is well-known that microarray experiments are influenced by many potential sources of variation/error and these kinds of variability can be roughly classified into three categories: biological variation (genetic or environmental factors, pooled or non-pool), technical variation (during extraction, labeling and hybridization) and measurement error (signal detection, etc.) (20–22). These kinds of variability have been analyzed using the ANOVA model to determine the sources and magnitudes of error/variation in gene expression profiles (22–25). However, in classification or prediction using microarray data, most of the statistical methods (e.g. PCA and PLS) relating the gene expression profiles, **X**-independent variables, to other information of interest, **Y**-dependent variables (e.g. tumor type or survival time), do not account for errors in the independent variables, which is one of the important characteristics of measured data. To tackle these problems mentioned above, a novel method, total principal component regression (TPCR), is proposed to not only incorporate the information of the dependent variables into the construction of latent structure but also take into account the errors in both independent variables and dependent variables. A salient feature of TPCR is that it extracts the latent structure from the augmented subspace of both independent and dependent variables, using a weighted least square fitting. This enables the proposed method to construct latent variables approximating optimally the actual latent structure and eliminate the collinearity to a certain degree.

## METHODS

### Total principal component regression

The basic goal of various projection or dimension-reduction approaches, for example PCA (18) and PLS (19), is to project the observations (samples) from the high-dimensional variables (genes) space to a low-dimensional subspace spanned by several linear combinations of the original variables, in order to satisfy a certain criterion. PCA attempts to find a set of orthogonal principal components to explain as much variance as possible in independent variables (**X**). The performance of PCA in classification may not be satisfactory from the predictive point of view, because there is no guarantee that the principal component representing the large variance in **X** should necessarily be the component strongly related to dependent variables (**Y**). To solve this problem, the information of dependent variables should be taken into account during the construction of orthogonal components. One way to do so is to maximize the sample covariance between the linear combination of dependent variables and the orthogonal component of independent variables, which is the essence of PLS (15,26,27), as shown by the objective criterion of PLS:

where *w* and *c* denote the weight vectors of **X** and **Y**, respectively. It has been proved that *w* and *c* are related to the following eigenvalue problems (26,27):

where *a* is the maximum eigenvalue and the weight vectors *w* and *c* can thus be calculated as the first left and right singular vector of **X**^{T}**Y**, respectively.

Another simple way to make use of the information of dependent variable would be to construct the orthogonal components, rather than only from the subspace of **X**, from the augmented subspace of both **X** and **Y**, that is, finding a low-dimensional subspace to best fit the subspace spanned by both *X* and *Y*, which is one of the motivations of TPCR.

In classical regression/prediction models, the independent variables are usually assumed to be non-stochastic, in other words, there is no error in the independent variables or at least the error is negligible. However, it is well-known that various variation/error may be introduced during a multi-step microarray experiment and these kinds of variation/error are usually not negligible. In order to account for errors in both independent variables and dependent variables, the error-in-variables (EIV) model (28–32) was used in this paper:

where $\tilde{\text{X}}$ and $\tilde{\text{Y}}$ denote the systematic or unobservable true values for independent variables **X**_{N×P} and dependent variables **Y**_{N×M}, respectively; **E _{X}**,

**E**represent the random error matrices whose rows are assumed to be independently, identically distributed (i.i.d) with common mean vector

_{Y}**0**and common covariance matrices σ

^{2}

_{X}**I**

*and σ*

_{P}

^{2}_{Y}I*(*

_{M}**I**

*and*

_{P}**I**

*denote appropriate identity matrices), respectively;*

_{M}**B**is a

*P*×

*M*matrix. Equation 5 implies an assumption of the EIV model, i.e. there exists a linear functional relationship between the systematic or true values of

**X**and

**Y**(29,30,32). In the case of microarray data analysis, in our opinion, the errors in the independent variables (gene expression profiles) may include the random fluctuations introduced by microarray technology itself, including measurement error and/or technical variation, while the biological variation (the true difference in gene expression) is embedded in the systematic part of independent variables.

Suppose the systematic or true values of the independent variables under observation is actually driven by a set of unobservable latent variables, i.e. they lie in a lower dimensional linear subspace spanned by the latent variables, then we can define a column-wise orthonormal matrix **T**_{N×K} (*K* < *P* and **T**^{T}**T** = **I**, where superscript T denotes the transpose of a matrix), whose columns provide the basis for the subspace of both $\tilde{\text{X}}$ and $\tilde{\text{Y}}$:

where **G**_{K×P} and **F**_{K×M} are the corresponding loading matrices for $\tilde{\text{X}}$ and $\tilde{\text{Y}}$, respectively. **T** can be seen as the common latent structure for both $\tilde{\text{X}}$ and $\tilde{\text{Y}}$. On substitution of Equations 6 and 7 into the previous EIV model, we obtain the EIV latent variable model (33–35):

The assumption behind this model is that there is a linear functional or structural relationship between the systematic or true part of **X** and **Y** and this relationship can be linked by a set of unobservable underlying latent variables **T** (33–35). Note that just like the assumption of ordinary least squares is not strictly complied with in practical applications, although the i.i.d. assumption about the error structure in the TPCR model may not be rigorously valid in some cases, it should not degrade the performance of this model too much if the violation is not too severe. It is worthy to point out that a model taking into account the error information in the independent variable, even if incomplete, would provide more insight into the realistic characteristics and structure of data and better performance than those incorporating no such information. And the violation of this assumption may be corrected partially by some preprocessing techniques such as transformation and scaling.

A criterion for solving this EIV latent variable model is:

where ‖**M**‖_{F} denotes the Frobenius norm of a matrix, that is, ‖**M**‖_{F} = [tr(**MM**^{T})]^{1/2}.

To deal with the error in both independent variables and dependent variables, let meta parameter λ (≥0) be:

Then Equation 10 can be rewritten as

Noting that **TT**^{T} is the projection matrix, it is easy to see from least square analysis (36–38) that:

Thus, Equation 13 is equivalent to

where **A** is the *N* × (*P* + *M*) augmented matrix of **X** and **Y**, that is, **A** = (**X**, **λY**).

Noting that (**I**−**TT**^{T}) stands for the *N* × *N* projection matrix, which projects on the orthogonal complement of the subspace spanned by **T**, Equation 19 can be minimized when **T** is the first *K* largest principal component for the augmented matrix **A** (18). Let the singular value decomposition (SVD) of **A** be

where left singular vectors **U** = (**u**_{1},…,**u*** _{N}*) ∈

*R*with

^{N×N}**U**

^{T}

**U**=

**I**

*; right singular vectors*

_{N}**V**= (

**v**

_{1},…,

**v**

_{(P+M)}) ∈

*R*

^{(P+M)×(P+M)}with

**V**

^{T}

**V**=

**I**

_{(P+M)},

**Σ**is a diagonal matrix containing singular values. Then let

**T**be the first

*K*columns of

**U**:

Thus the estimates of $\tilde{\text{X}}$ and $\tilde{\text{Y}}$ can be obtained by:

The regression coefficients estimated using TPCR are given by

where superscript + denotes the generalized inverse of a matrix.

### Fast kernel EVD algorithm for wide data

The speed and time of calculations have always been the practical and important problems in the implementation of algorithm or method in multivariate data analysis. Microarray dataset typically consists of thousands of variables and less than 100 samples (*P* ≫ *N*). For such ‘wide’ data (39), the computation time needed for matrix decomposition using classical SVD algorithm (e.g. in MATLAB: [**U**, **S**, **V**] = **svd** (**A**), where **svd** is the build-in function to calculate the singular vectors, **U** and **V**, and singular values **S**) may be pretty long. The situation becomes even worse when cross-validation is applied to evaluate the methods, which is the typical case in tumor classification. Therefore, an efficient and fast algorithm is needed to calculate the singular vectors from microarray data.

Since the kernel matrix **AA**^{T} contains the same information (e.g. eigenvalues) as the covariance matrix **A**^{T}**A** while the size of **AA**^{T} (*N* × *N*) is much smaller than that of **A**_{N×(P+M)} and (**A**^{T}**A**)_{(P+M)×(P+M)} (*P* ≫ *N*), it would be much faster to calculate the left singular vectors or eigenvectors **U** from (**A****A**^{T})_{(N × N)} by eigenvalue decomposition (EVD) than from **A**_{N×(P+M)} by SVD (39).

Thus, the modified fast kernel EVD algorithm to calculate **T** is:

where **eig** is the build-in function of MATLAB to calculate eigenvectors and eigenvalues of a matrix, **U** denotes the eigenvectors of **AA**^{T} or the singular vectors of **A** and **Σ**^{2} is a diagonal matrix containing eigenvalues of **AA**^{T} (**Σ** contains the singular values of **A**).

### TPCR for discrimination

When TPCR is used for classification, the matrix of dependent variables (**Y**) contains the information about the class memberships, with element *y _{ik}* = 0 or 1 (

*i*= 1,…,

*N*;

*k*= 1,…,

*M*; where

*N*and

*M*is the number of samples and the number of tumor classes, respectively). If the

*i*-th sample belongs to class

*k*, then

*y*= 1, otherwise

_{ik}*y*= 0.

_{ik}The prediction of dependent variables on a new set of samples is made by:

where **X**_{new} is the gene expression profiles for the new set of samples, and **Y**_{new} is the predicted values for these samples. The identity of the class membership of each new sample (each row in **Y**_{new}) is assigned as the column index of the element with the largest predicted value in this row.

### Meta parameter λ

As stated above, the relative magnitude of errors in the independent variables and dependent variables is given by meta parameter λ in TPCR model. It is important to choose the appropriate λ to obtain the best prediction performance. Considering two extreme cases for λ (≥0):

- If λ = 0, i.e. ${\sigma}_{\text{x}}^{2}=0$, which means no error is considered in
**X**matrix, then the augmented matrix**A**= (**X**, λ**Y**) =**X**; therefore,**T**will be the principal components of**X**itself. In other words, the TPCR model degenerates to the classical principal component regression (PCR) model in this extreme case. Note that no information about**Y**is taken into account in the construction of latent variable in this case. - If λ is very large, the variation of the columns of λ
**Y**would be much larger than that of the columns of**X**in the augmented matrix**A**. In this case, the major principal components**T**will come largely from λ**Y**, since the PCA always projects to the directions showing the largest variation. Therefore, the prediction performance of**T**for the new sample**X**_{new}will be poor.

The choice of λ depends on the experience about the data and the computation power available. In practice, the optimal meta parameter λ (≥0) is chosen from an appropriate set according to leave-one-out cross-validation (LOOCV) procedure.

### Gene selection

Variable (gene) selection is important for the successful analysis of gene expression data since most of the genes do not provide useful information for classification, and including all of them in the modeling process will degrade the performance of the model. Therefore, non-informative genes should be removed before building a classification model. There exist different approaches for gene selection such as neighborhood analysis (1), significance analysis of microarrays (40), Wilks' lambda (12), *t*-score and critical score (13,41) and classifier feedback approach (14).

In this paper, the sum of squared correlation coefficients between gene expression and each of the dependent variables is used to select the genes for analysis (15). For example, the *g** = 100 genes are taken as the first 100 genes with the largest values of sum of squared correlation coefficients.

### Assessment of prediction method by leave-one-out and leave-half-out CV

LOOCV has become a standard procedure to evaluate the performance of various classification methods in microarray data analysis. Note that when gene selection or dimension reduction is used together with LOOCV procedure, a common mistake made in tumor classification using microarray gene expression profiles, was to perform gene selection or dimension reduction before CV loop. However, such incomplete LOOCV procedure is well known to be substantially biased and prone to generating spuriously good results since the information about all the samples is used for gene selection or dimension reduction before the CV loop (15,42–44). In this paper, the complete LOOCV (the gene selection and dimension reduction within the CV loop) was applied.

LOOCV is nearly unbiased, but often with unacceptably high variability, especially for dataset with small number of samples (45–48), implying that it may give unreliable good prediction results due to the effect of high variance/variability, especially when the number of samples is small (just like ordinary least square may give unreliable result due to the large variance if there exists severe collinearity in the data, even if it is unbiased). External validation can provide some protection against over-fitting caused by LOOCV, but there may not be enough samples for external test, especially for microarray data analysis (due to the relatively expensive cost, etc.). In this case, a re-randomization study, leave-half-out cross-validation (LHOCV), provides an alternative approach to evaluate the prediction method more realistically (14,15,41,48). Briefly, the whole datasets, including original training and test dataset, are pooled together and split randomly (half/half) into a new training dataset and a new test dataset; the randomly generated training dataset is used to derive a classification model (select genes, reduce dimension and calculate regression coefficients) that is then applied to classify the corresponding new test dataset. This LHOCV procedure is repeated 100 times (splitting the pooled dataset randomly for 100 times) to avoid chance factor.

## RESULTS AND DISCUSSION

### Acute leukemia data

The well-known leukemia dataset was measured by Golub *et al*. (1) using Affymetrix high-density oligonucleotide microarray containing probes for 6817 human genes and has become a benchmark for the evaluation of various cancer classification algorithms. The original training dataset consisted of 38 bone marrow samples from acute leukemia patients, including 19 B-cell acute lymphoblastic leukemia (B-ALL), 8 T-cell acute lymphoblastic leukemia (T-ALL) and 11 acute myeloid leukemia (AML). The original independent (test) dataset consisted of 24 bone marrow and 10 peripheral blood samples (19 B-All, 1 T-ALL and 14 AML). The gene expression data was log transformed and centered to have mean zeros across samples during the cross-validation process.

The original 38 training samples and 34 test samples were pooled together and then classified by TPCR using LOOCV with *g** = 50, 100, 200, 500 and 1000 genes selected (λ was taken from the values: 0, 0.001, 0.01, 0.1, 1–20 with interval of 1, 40–100 with interval of 20, 200, 1000, 10 000 and 100 000 in this paper) and the results are shown in Table 1. One ALL sample (#17; error rate 1.39%; λ = 19) was misclassified by TPCR. In our previous study (15), discriminant PLS (D-PLS) was used to classify the same dataset using the same gene selection method and resulted in at least 2 (2.78%) misclassifications. This dataset was also analyzed by Nguyen and Rocke (13) using polychotomous discrimination and quadratic discriminant analysis together with dimension reduction by PCA or PLS and at least 8 (11.1%) and 3 (4.2%) samples were misclassified using PCA and PLS for dimension reduction, respectively (from A2 procedure; note that both A0 and A1 procedures are incomplete LOOCV and should not have been used to compare with our results).

### Hereditary breast cancer data

The gene expression profiles of primary breast tumor samples from seven carriers of the BRCA1 mutation, eight carriers of the BRCA2 mutation and seven patients with sporadic cases of breast cancer were monitored with a microarray of 6512 cDNA clones of 5361 genes and a total of 3226 genes were selected for analysis (49). The gene expression data were centered to have mean zeros across samples during cross-validation process.

Classification results based on TPCR using LOOCV are presented in Table 2 and at the best, four samples were misclassified (error rate 18.18%, λ = 9), which is better than the best result (22.7%, 5 misclassification) of D-PLS (15) and the result (5 misclassifications) obtained by Hedenfalk *et al*. (49). In the study of Nguyen and Rocke (13), at least six (27.3%) misclassifications were found using A2 procedure. The percentage of misclassified samples by TPCR and other methods was pretty high, implying the difficulty to accurately classify these 22 tumor samples based on gene expression profiles. This may be due to the inherent lack of discriminating power of the dataset (e.g. the small sample size, diagnosis error or the lack of differentiating power for the expression profiles of 3226 genes to separate these 22 samples) (15,49). Another possibility could be that the underlying structure of the data (e.g. inner non-linear relationship) was not characterized by the methods used (15). As pointed out before by Hedenfalk *et al*. (49), ‘the use of microarray covering a larger proportion of the genome and the analysis of larger numbers of tumor may make possible a more precise molecular classification of breast cancer’.

### Small, round blue cell tumor data

The small, round blue cell tumors (SRBCTs) of childhood, including neuroblastoma (NB), rhabdomyosarcoma (RMS), non-Hodgkin lymphoma (NHL) and the Ewing family of tumors (EWS), are difficult to distinguish due to their similar appearances in routine histology. Khan *et al*. (11) monitored the gene expression profiles of 6567 genes for these four types of malignancies using cDNA microarrays and reduced the number of genes to 2308 by quality filtering for a minimal level of expression. The original 63 training samples included both tumor biopsy materials (13 EWS and 10 RMS) and cell lines (10 EWS, 10 RMS, 12 NB and 8 Burkitt Lymphomas (BL, a subset of NHL). The original test samples contained both tumors (5 EWS, 5 RMS and 4 NB) and cell lines (1 EWS, 2 NB and 3 BL). This dataset was centered to have mean zeros across samples for analysis.

The pooled-together 83 samples were classified by TPCR using LOOCV with *g** = 50, 100, 200, 500 and 1000 genes selected and the results are presented in Table 3. All the samples were correctly classified using 3–6 TPCR components with *g** = 100, 200 and 500 genes selected (λ = 200), indicating a good prediction performance of TPCR on this dataset. The same result (0 misclassification) was also obtained in other studies (15,50–53), indicating good class separability of this dataset. Khan *et al*. (11) also found 0 misclassification using artificial neural networks for 88 samples, including 5 non-SRBCT samples. However, the dimension reduction by PCA was performed using all the 88 samples before the validation procedure, which may suffer from bias.

### NCI60 data

Using cDNA microarrays containing 9703 cDNA clones representing ∼8000 unique genes, Ross *et al*. (9) and Scherf *et al*. (54) studied gene expression in the 60 human cancer cell lines used in the NCI anticancer drug screening program. The 60 human cell lines were derived from tumors from a variety of tissues and organs, which, in contrast to clinical tumors, have been characterized pharmacologically by treatment with >70 000 different agents, one at a time and independently (54). As in the study of Nguyen and Rocke (13), five cancer types were used for multi-class classification: eight melanoma, eight renal, six leukemia, seven colon and six CNS. A subset of 1376 genes selectively filtered from initial 9703 genes and 40 molecular characteristics (targets) individually assessed by various laboratories was used to discriminate the different types of cancers (54). Since there are some missing gene expression values in this dataset, genes with <2 missing values were used for classification through replacing the missing values (1 or 2) with the median of the gene expression (13). This resulted in a gene expression dataset with 35 samples and 1299 genes. This dataset was centered to have mean zeros across samples in analysis. TPCR was then applied to this dataset with *g** = 50, 100, 200, 500 and 1000 genes selected.

The misclassification results using TPCR are given in Table 4. The best result was one misclassified sample (#15 ME:LOXIMVI, 2.86%, λ = 20), the same as the best result using D-PLS (15), while the best result obtained by Nguyen and Rocke (13) was 2 (5.7%) and 3 (8.6%) misclassifications using PCA and PLS for dimension reduction, respectively (A2 procedure). The classification performance of TPCR on this dataset is good, given the small sample-to-class ratio (35 samples and 5 classes). It would be worthy to note that ME:LOXIMVI, although supposedly a melanoma in origin, was reported to lack melanin and other useful marker for the identification of melanoma cells (55) and showed different characteristic pattern from the other seven melanoma lines (9,54). Furthermore, in an earlier study involving the clustering of the 60 cancer cell lines based purely on their sensitivity to tens of thousands of potential anticancer compounds, ME:LOXIMVI was also found to be different from other melanoma cell lines; instead, it was found to be more similar to a group of colon cancer cell lines (56).

### Evaluation of TPCR more realistically by LHOCV

As demonstrated in the above LOOCV studies with four well-publicized microarray datasets, TPCR showed better or at least comparable prediction performance compared with other published methods. However, LOOCV may provide unreliable good prediction results due to the high variance (45,46). To assess the prediction performance of TPCR more realistically, TPCR was further compared using LHOCV procedure with the well-known PLS method, which has been widely used in microarray analysis (13,15–17,41,57–60), including tumor classification. This LHOCV procedure was repeated 100 times and the average misclassification error rates over 100 re-randomizations for the four microarray datasets are shown in Table 5.

It is obvious that the error rates using LHOCV were higher than those using LOOCV since the size of the dataset is small and only half of the total samples were used to construct the classification model under LHOCV procedure. On the other hand, it can be observed that the minimum LHOCV error rate obtained by TPCR is consistently lower than that obtained by PLS for each of the four microarray datasets. Actually, with the same number of genes selected, the minimum LHOCV error rate by TPCR (bold number in Table 5) is, in most cases, lower than that by PLS, indicating that the prediction performance of TPCR is better than or at least comparable to the well-known PLS method.

### Assessment of the reliability of classification models by permutation analysis

Given the relatively small sample size of microarray datasets in cancer classification, especially for hereditary breast cancer dataset and NCI60 dataset, it is important to evaluate the stability and reliability of a classification model. There are various statistical methods to assess the reliability when there are not enough samples available to perform external validation (13–15,41). In this paper, so-called permutation or shuffle studies were performed to compare the misclassification error rates using TPCR with those expected at random. Initially, the class memberships of all the samples were permuted (the rows of **Y** matrix were shuffled) while keeping the gene expression profiles (**X** matrix) unchanged; then the newly generated random dataset with shuffled **Y** and unchanged **X** was analyzed by TPCR using exactly the same LOOCV procedure as applied before to the original dataset (gene number, TPCR component number and meta parameter λ were the same as those chosen to obtain the minimum error rates for original datasets, as shown in Tables 1–4). This procedure was carried out 100 times and the distributions of the error rates over 100 permutations for the four datasets are plotted in Figure 1 and compared with the minimum misclassification error rates obtained from original datasets. It is obvious that, in all cases, the estimated error rate obtained by TPCR for original dataset is significantly lower than what would be expected at random. This kind of permutation analysis can be used to test whether a complete cross-validation is performed as well as whether there is some real structure or classification information inside the dataset. If an incomplete LOOCV is applied or a dataset with no classification information (e.g. random dataset) is analyzed, the estimated error rate obtained from original dataset will be close to that calculated from the shuffled dataset.

**...**

Another randomization analysis applied to further assess the stability and reliability of a classification model is to examine the distribution of the error rates over 100 re-randomizations obtained using LHOCV (14,15), especially when there is inadequate sets of data. If the classification model is unstable during the 100 re-randomization or perturbations introduced by this procedure, the estimate of the predictive ability is unlikely to be reliable (46). The distribution plots of misclassification error rates over 100 re-randomizations using LHOCV for the four microarray datasets are shown in Figure 2 (genes, the number of TPCR components and meta parameter λ were chosen according to the minimum averaged LHOCV error rate for each dataset). Substantial stability of the classification models can be observed on both Leukemia (Figure 2a) and SRBCT (Figure 2c) datasets with small averaged error rate and variance, while the classification model on hereditary breast cancer dataset (Figure 2b) is unstable with relatively large averaged error rate and variance, possibly due to the small sample size or inherent difficulty to discriminate the tumors. Taken together, the reliability of the four classification models is in the decreasing order: SRBCT > Leukemia > NCI60 > Hereditary breast cancer, the same order as obtained previously using PLS (15).

### Comparison of the speed of classic SVD algorithm and fast kernel EVD algorithm

The classic SVD algorithm ([**U**, **S**, **V**] = **svd** (**X**), build-in function in MATLAB) and a fast kernel EVD algorithm originated from the work of Wu *et al*. (39), [**U**, Σ^{2}] = **eig** (**XX**^{T}), were compared in terms of the time used to calculate the left singular vectors **U** from the four microarray gene expression profiles (**X** matrix), and the results are shown in Table 6 (Intel Pentium 4 1.70 GHz CPU; 512 MB RAM; Window 2000 Pro; MATLAB R14). It is clear that the fast kernel EVD algorithm is much faster than the SVD algorithm, with a speed increase ranging from 147 (NCI60) to 696 times (Leukemia). Generally speaking, the improvement of the speed increases as the size of the dataset increases. Considering that the time-consuming cross-validation procedure was applied to select optimal meta parameter λ and to evaluate the method, the improvement of speed is very significant, especially for LHOCV in which 100 runs of re-randomizations were performed.

## CONCLUSIONS

In this paper, based on the EIV model, we proposed a novel method called TPCR to perform multi-class classification of tumor samples by taking into account the errors in microarray gene expression profiles. In addition, the latent variable structure underlying the microarray data was also taken into account in our method to mitigate the collinearity that resulted from the high-dimensional microarray data, which shows a salient advantage of our method over the classical EIV model. Four well-known microarray datasets were used to demonstrate that the performance of TPCR is better than or at least comparable to other published methods in the classification of tumors. A major advantage of TPCR over other methods is that it takes into account not only the errors in both independent variables and dependent variables but also the latent structure from the augmented subspace of independent and dependent variables. A fast kernel EVD algorithm was applied to decrease dramatically the time needed to extract the latent variables. The reported error rates of classification model using TPCR from the original datasets were shown to be significantly lower than what would be expected by chance from shuffling or permuting the memberships of samples. By half-half randomly splitting the original dataset into training and test samples for 100 times, the stability and reliability of the classification models were assessed by the distributions of the error rates over 100 re-randomizations and were shown in the decreasing order: SRBCT > Leukemia > NCI60 > Hereditary breast cancer.

## Acknowledgments

We are grateful to Dr Robert R. Delongchamp of the FDA's National Center for Toxicological Research for carefully reviewing the manuscript. Y.T. and C.W. were partially supported by General Clinical Research Center grant M01-RR00425 from the National Center for Research Resources. Funding to pay the Open Access publication charges for this article was provided by Burns and Allen Research Institute Microarray core, Cedars-Sinai Medical Center.

## REFERENCES

*k*-nearest neighbor classifiers. IEEE Trans. Patt. Anal. Mach. Intell. 1991;13:285–289.

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (124K) |
- Citation

- Multi-class tumor classification by discriminant partial least squares using microarray gene expression data and assessment of classification models.[Comput Biol Chem. 2004]
*Tan Y, Shi L, Tong W, Hwang GT, Wang C.**Comput Biol Chem. 2004 Jul; 28(3):235-44.* - Multi-class cancer classification via partial least squares with gene expression profiles.[Bioinformatics. 2002]
*Nguyen DV, Rocke DM.**Bioinformatics. 2002 Sep; 18(9):1216-26.* - Independent component analysis-based penalized discriminant method for tumor classification using gene expression data.[Bioinformatics. 2006]
*Huang DS, Zheng CH.**Bioinformatics. 2006 Aug 1; 22(15):1855-62. Epub 2006 May 18.* - Microarrays for cancer diagnosis and classification.[Adv Exp Med Biol. 2007]
*Perez-Diez A, Morgun A, Shulzhenko N.**Adv Exp Med Biol. 2007; 593:74-85.* - Microarray analysis in the clinical management of cancer.[Hematol Oncol Clin North Am. 2003]
*Mariadason JM, Augenlicht LH, Arango D.**Hematol Oncol Clin North Am. 2003 Apr; 17(2):377-87.*

- iPcc: a novel feature extraction method for accurate disease class discovery and prediction[Nucleic Acids Research. 2013]
*Ren X, Wang Y, Zhang XS, Jin Q.**Nucleic Acids Research. 2013 Aug; 41(14)e143* - An Ensemble Method for Predicting Subnuclear Localizations from Primary Protein Structures[PLoS ONE. ]
*Han GS, Yu ZG, Anh V, Krishnajith AP, Tian YC.**PLoS ONE. 8(2)e57225* - Transcriptomic Profiling of Human Peritumoral Neocortex Tissues Revealed Genes Possibly Involved in Tumor-Induced Epilepsy[PLoS ONE. ]
*Niesen CE, Xu J, Fan X, Li X, Wheeler CJ, Mamelak AN, Wang C.**PLoS ONE. 8(2)e56077* - Finding minimum gene subsets with heuristic breadth-first search algorithm for robust tumor classification[BMC Bioinformatics. ]
*Wang SL, Li XL, Fang J.**BMC Bioinformatics. 13178* - Integrating Biological Knowledge with Gene Expression Profiles for Survival Prediction of Cancer[Journal of Computational Biology. 2009]
*Chen X, Wang L.**Journal of Computational Biology. 2009 Feb; 16(2)265-278*

- Multi-class cancer classification by total principal component regression (TPCR)...Multi-class cancer classification by total principal component regression (TPCR) using microarray gene expression dataNucleic Acids Research. 2005; 33(1)56

Your browsing activity is empty.

Activity recording is turned off.

See more...