# Exploring the within- and between-class correlation distributions for tumor classification

^{a}Department of Statistics, University of California, 8125 Math Sciences Building, Box 951554, Los Angeles, CA, 90095-1554; and

^{b}Institute of Statistical Science, Academia Sinica, 128, Academia Road, Sec. 2, Taipei 115, Taiwan

^{1}To whom correspondence should be addressed. E-mail: ude.alcu.tats@ilck.

## Abstract

To many biomedical researchers, effective tumor classification methods such as the support vector machine often appear like a black box not only because the procedures are complex but also because the required specifications, such as the choice of a kernel function, suffer from a clear guidance either mathematically or biologically. As commonly observed, samples within the same tumor class tend to be more similar in gene expression than samples from different tumor classes. But can this well-received observation lead to a useful procedure of classification and prediction? To address this issue, we first conceived a statistical framework and derived general conditions to serve as the theoretical foundation that supported the aforementioned empirical observation. Then we constructed a classification procedure that fully utilized the information obtained by comparing the distributions of within-class correlations with between-class correlations via Kullback–Leibler divergence. We compared our approach with many machine-learning techniques by applying to 22 binary- and multiclass gene-expression datasets involving human cancers. The results showed that our method performed as efficiently as support vector machine and Naïve Bayesian and outperformed other learning methods (decision trees, linear discriminate analysis, and *k*-nearest neighbor). In addition, we conducted a simulation study and showed that our method would be more effective if the arriving new samples are subject to the often-encountered baseline shift or increased noise level problems. Our method can be extended for general classification problems when only the similarity scores between samples are available.

**Keywords:**cancer research, gene expression

Cancer type and stage classification/prediction is an important aspect for cancer diagnosis and treatment. Recent researches suggested that molecular diagnostics via high throughput data could make cancer prediction more objective and accurate and it may help clinicians choose the most appropriate treatments (1–10). Various cancer classification and prediction methods for gene-expression data, including Fisher’s linear discriminant analysis, support vector machine, and various Bayesian classification procedures, are the off-the-shelf procedures taken from the rich statistics and computer science literature. The rationale behind such methods is seldom backed up by biological intuition. Consequently, the intriguing and subtle nature of each procedure often appears like a black box to most biological practitioners. While such practices have led to a number of important discoveries in microarray studies, not only are the end-users often confused on when and why they may prefer one procedure to another, but also the theoreticians cannot offer a clear spectrum in guiding the choice of methods. For example, the successful use of support vector machine requires the proper specification of a kernel function. Yet, unfortunately, the critical decision of kernel selection has often been made somewhat arbitrarily via rounds of trials and errors, enforcing the black-box image of an already complex procedure to most biologists.

To conceptualize the problem, suppose there are *K* tumor classes and that there are *P* genes under consideration. These genes may come from a list specified by biologists to meet their aims of study. For example, a team specializing in metastasis may be interested in knowing how informative a given set of metastasis-related genes is for predicting cancer types. Alternatively, the list may come from screening by some available marker gene selection approach. For a tumor sample, we denote the relevant gene-expression data by . We would like to use to predict the class label of this sample *Y*. From the Bayesian point of view, a new tumor sample should be assigned to the class that achieves the highest posterior probability:

where is the *p*-dimensional conditional density function of the predictor variable in the *j*th class and *p*_{j} is the proportion of the *j*th tumor class in the target population (*j* = 1,2,…,*K*).

In the above conditional-density-driven formulation of cancer classification, how to accurately and efficiently estimate the high dimensional conditional density functions of each class from a relatively small number of samples is full of challenge. Linear discriminant analysis (LDA) (11) assumes the conditional densities of each class are multivariate Gaussian (normal) and uses the sample mean vector and sample covariance matrix calculated from the training samples to estimate these population parameters. But, the multivariate Gaussian assumption is hard to justify. Alternatively, non-parametric density estimation methods, such as kernel estimation or *K* nearest neighbor (KNN) (12, 13), can be used, but they are known to suffer from curse of dimensionality.

Here, we present an alternative formulation of cancer classification problem that is based on the biologists’ intuition that samples within the same tumor class tend to have more similar profiles in gene expression. The correlation between sample profiles tends to be higher if two samples come from the same tumor class (Fig. S1). Indeed, this is the same intuition behind many clustering methods used for discovering new cancer subtypes (2, 4, 6, 9, 14–21). However, to the best of our knowledge, this biological intuition has not been exploited in the classification problem. Our goal is to lay down the theoretical ground and present a method, termed “distribution based classification” (DBC), that fully utilizes the information obtained by comparing the distributions of the correlations computed within the same class and those from between classes (*SI Theoretical Justification of DBC* and Fig. S2).

In DBC, the prediction of which class a new sample belongs to is based on the comparison between correlation distributions. Denote the distribution of the correlation between the new sample profile and each training sample profile in class *k* by . This distribution will be compared with two types of correlation distributions generated from training samples only: *f*_{kk}, the distribution of correlations between any two training samples from class *k* and *f*_{kj}, the distribution of the correlations between one sample from class *k* and the other sample from class *j*, where *k* ≠ *j*. Kullback–Leibler (KL) distance is applied here to measure the similarity between two distributions. We assign the new sample to the *k*th class when the KL distance between and *f*_{kk} is smaller than the KL distance between and *f*_{kj}(*j* ≠ *k*).

## Application on Cancer Classification Using Microarray Data

The performance of DBC on cancer classification using real microarray gene expression data is investigated in this section. A total of 22 publicly available microarray datasets related with the studies of human cancer were collected. The basic information and references about these data are summarized in Tables S1 and S2 for binary-class and multiple-class datasets, respectively (*SI Supplementary Tables for 22 Gene Expression Datasets*).

Five commonly used classification methods were investigated and compared with our proposed DBC method. These methods are *K* nearest neighbor (KNN), support vector machines (SVM), linear discriminate analysis (LDA), decision tree method (DT), and naive bayesian method (NB) (*SI Other Machine Learning Methods*).

### Data Preprocessing and Marker Gene Selection.

A large number of genes have near constant expression across samples. We applied a preprocessing procedure to all 22 gene-expression datasets for a preliminary selection of genes before a cross-validation test.

For each gene-expression dataset, 100 simulations of 3-fold cross-validation (*SI Cross-Validation Test*) were used to compare the performance of different classification methods. In each simulation, two thirds of the samples were used as the training set and the other one-third were retained as the testing set. For each class in the training set, the most representative genes were selected simply based on the t-statistic. The top *P* (*P* = 20 was used in this study) genes with the largest absolute value of t-statistic were selected as the marker genes for this class. The representative genes for all classes were selected and pooled to construct a gene set for classification. The detail information about data preprocessing and marker gene selections is available in *SI Data Preprocessing and Maker Gene Selection*; see also Fig. S3.

### Prediction Results on 22 Real Microarray Datasets.

The average accuracy rates of the 100 simulations of 3-fold cross-validation were reported in Table 1 for binary-class datasets and in Table 2 for multiple-class datasets. On average, SVM (84.36%), NB (84.05%), and KNN (84.44%) performed slightly worse but comparable with DBC (84.69%), and outperformed other classification methods, such as LDA (75.35%) and DT (78.36%) on 10 binary-class gene-expression datasets. For 12 multiple-class datasets, DBC achieved the average accuracy rate 91.75% that was comparable with NB (92.80%) and SVM (93.61%) and outperformed the rest classification methods (LDA 81.15%, DT 81.55%, and KNN 90.90%).

### Simulation Results for Classifying Independent Testing Samples.

Quite often, samples in the independent test set may have higher noise level than in the training set and the baseline level may also be shifted by an unknown amount. In this section, we conducted two simulation studies to demonstrate that DBC performed much better than other methods when encountering such problems.

We considered the two-class (e.g., normal tissue versus cancer tissue) problem. In the first study, we simulated gene-expression data from the following model:

where *j* is the index for marker genes, *j* = 1,2,…,*J*, and *x*_{k}{0,1} is the class label of the *k*th sample, where *k* = 1,2,…,*K*. And *e*_{jk} is the error term following normal distribution with mean zero and standard deviation (SD) *σ*.

The training dataset included 50 marker genes and 50 samples in each class and was simulated with *a*_{j} and *b*_{j} uniformly distributed in [-10,10] and [-2,2], resp., and *σ*_{training} = 2. Testing dataset included 10 samples in each class. Four types of testing dataset were simulated to reflect four different situations in reality. First, the testing data shares exactly the same parameters as the training data, as if the testing experiment was done at exactly the same conditions as the training experiment. Second, the testing data has an increased noise level due to either chemical or electrical reasons; hence the variance of error term increases (e.g., *σ*_{testing} = 4). Third, the testing data has an overall baseline shifting due to overloading or other reasons; hence a random shift (e.g., a random shift follows a normal distribution with mean five and SD two) is added to each expression value. Finally, the testing data with both increasing noise level and overall baseline shifting problems was simulated. Table 3 shows the average accuracy rates of 100 simulations for classifying testing samples simulated under each situation. Generally speaking, KNN (93.97% and 94.10%), NB (95.44% and 96.00%), and SVM (93.55% and 94.40%) performed slightly worse but comparably with DBC (96.57% and 97.25%) in 3-fold cross-validation test and in independent testing data I. When independent testing data have either an increased noise level or an overall baseline shifting or both, which frequently occurred in reality, although the performance of all methods declined, DBC outperformed other methods to an even larger extent. Especially, DBC was quite robust to classify samples with overall baseline shifting.

For the second simulation study, we started with a real data, the prostate gene-expression data (2). But during the 3-fold cross validation, one-third of data used for testing were disturbed in the following ways:

- adding a random normal noise with mean 0 and SD , where was estimated by the sample standard deviation of testing data,
- creating a shifted baseline expression level by adding a constant, generated from the normal distribution with mean 4 and SD , to all samples in the training set, and
- doing both (1) and (2).

The results from 100 simulations are summarized in Table 4. We found that DBC behaved the best on all three disturbed testing datasets.

## Conclusion and Discussion

In this paper, we introduced the classification method DBC and compared it with several methods including KNN, SVM, LDA, DT, and NB to 22 different microarray gene-expression datasets related with human cancer. Our work supports the idea that biological intuition can be an important impetus for constructing effective statistical methods.

The method of DBC for cancer classification is rooted on the appealing empirical observation that samples within the same tumor class tend to be more similar in gene expression than samples from different tumor classes. Fig. S1 showed such an example for the leukemia dataset. The mathematical foundation of this empirical observation is provided in *SI Theoretical Justification of DBC*.

The successful application of DBC may depend on an implicit assumption of homogeneity within each tumor class. Before using our method, researchers should examine the pattern of each within-class correlation distribution closely. Intuitively, if we observe multiple modes or clusters, then this may indicate the existence of tumor subtypes in a tumor class. We may conduct cluster analysis within each class to detect subclasses. A multiple classification should be constructed to distinguish each subclass and we may modify our method by constructing distribution of correlations for each subclass.

One extension of our method would be in the situations where only the similarity scores between members were available. For the setting presented in the introduction section, the characteristic of a sample is measured by a *p*-dimensional variable. But there are also situations where researchers are able to make sample-to-sample comparisons and obtain similarity scores between samples. Suppose the information about the *n* samples in the training set has been summarized by a similarity score matrix of dimension *n* by *n*, denoted as *S*. When a new sample arrives, the similarity scores between the new sample and all the samples in the training set are obtained and presented as an *n*-dimensional vector . For such situations, many popular classification methods, such as LDA, DT, and SVM, which build their classifiers on the characteristics associated with each member, may no longer be applicable. One may apply the *K*-nearest neighbor (KNN) method of classification. The method assigns the new sample to the class with the majority votes from members with the *K* largest similarity scores. Clearly, KNN only utilizes the information in for prediction and it totally ignores the information included in *S*. In contrast, DBC fully utilizes the information obtained in both *S* and and compares the distribution of with the distribution of the within-class similarity scores and the distribution of between-class similarity scores.

DBC can sidestep the difficult high dimensional density estimation problem frequently encountered due to the abundance of predictor variables. It utilizes the difference between two types of one-dimensional distributions, where the distribution of within-class sample-correlations is skewed more to the right compared to the distribution of between-class sample-correlations. Hence, instead of estimating the posterior probability of the new sample belonging to the *k*th class given the observed new sample profile, we assign the new sample to the *k*th class if the KL distance between the distribution of and *f*_{kk} is the smallest distances between the distribution of and *f*_{kj}(*j* ≠ *k*). Here, we don’t need any parametric assumption about the conditional densities of each class and we don’t even need to prespecify any parameter to compute above KL distances. Thus, DBC can be viewed as a parameter-free algorithm. However, there is still room for modification. For instance, the KL distance may be replaced by other equally popular measures of distance between probability distributions.

From the theoretical aspect, it is not easy to demonstrate a clear significant improvement over existing methods. One anonymous referee has suggested a starting point by examining a basic two sample problems with unequal mean and variance from normal distributions (*SI Application of DBC on 1-d Example*). Our method is seen to behave like a compromise between the LDA rule and the QDA rule. Alternatively, we may use some idealized examples to challenge the existing methods. For this purpose, we constructed a geometric family of classification problems with increasing dimensionality (*SI Classification on Overlapped Circles or Spheres* and Fig. S4). Our method outperformed all existing ones by a large margin (Tables S3, S4, S5, and S6). These examples also highlight the weakness of SVM and they by themselves can be useful counterexamples for understanding the limitation of SVM.

In the study of the gene-expression data, marker gene selection is at least as important as the classification procedures. Good marker genes are able to not only help predict class identities accurately but also provide information to understand the underlying development of cancer or other disease. Many methods have been proposed to find such marker genes. The most commonly used methods are based on univariate ranking using either t-statistic or between- to within-sum-of-squares ratio. One concern is whether the result we reported would be sensitive to the number of top genes selected. We conducted a simulation analysis (*SI The Number of Marker Genes Selected for Each Class*) to demonstrate how different methods performed when different number of marker genes selected. As a result, the classification methods that build up their classifiers based on the similarity score, such as KNN and DBC, are more robust and less dependent on the number of marker gene selected than the methods based on the absolute gene-expression values, such as SVM, DT, and LDA.

Quite often researchers may want to study only a selective set of genes. Here is a prostate-cancer example. Prostate cancer is the third most common cancer and causes 6% of cancer deaths in men, and it has been extensively studied (22–26). Listed below are 40 well-known prostate-related gene symbols:

KLK3; AR; AZGP1; CDKN1A; COPEB; CYPEB; CYP3A4; HPN; KAI1; MSR1; MUC1; MXI1; MYC; PIM; HPC1; PTEN; RNASEL; SRD5A2; PCAP; MAD1L1; HPCX; CHEK2; ELAC2; CAPB; HPC2; PRCA1; KLF6; CHEK2; LFS2; EPHB2; P53; RB; RAS; CDKN2; POLB; hPMS2; P27; CDKN1B; KAI-1; GSTPI.

Among them, the first 23 genes come from Online Mendelian Inheritance in Man (http://www.ncbi.nlm.nih.gov/sites/entrez?db=OMIM). The rest of gene symbols come from *Principles and Practice of Oncology* (27). These genes are believed to be prostate-cancer related. We used these symbols as keywords to search the Prostate gene-expression data (2) and found 45 entries because of multiple probes for some genes. We applied DBC to these probes and compared the performance with other procedures. Fig. 1 shows how accurate these classification methods are based on 100 3-fold cross-validations. On average, DBC achieved the highest accuracy rate (92.91%) and followed by SVM and NB with average accuracy rates 90.08% and 90.68%, respectively. KNN and DT performed relatively worse with average accuracy rate 88.82% and 87.88%. LDA behaved worst with the lowest average accuracy rate 84.87%. Furthermore, NB and DBC behaved most robust to random split into test and training set. The standard deviations of the accuracy rates for NB and DBC were as low as 0.0129 and 0.0149, resp. While other methods almost doubled or tripled that standard deviation (KNN 0.0213, SVM 0.0237, DT 0.0299, and LDA 0.0346). Hence, DBC appears most reliable on classification using disease-related gene along.

## Methods

### Theoretical Justification.

In this section, we derive the general conditions under which the expected within-class sample-correlation is higher than the expected between-class sample-correlation. These conditions will serve as the theoretical justification for our proposed method.

For simplicity, suppose we have only two classes (e.g., normal tissue versus cancer tissue). The class indicator variable is *Y*{0,1}. Let represents the conditional density function of the expression profiles of the *p* genes for samples from the normal tissue class and let the mean vector be and the covariance matrix be . For the tumor tissue class, the notations are changed to , , and , resp., where and calculate the mean vector and the covariance matrix of when treating as a *p*-dim random variable. Now let and be the gene-expression profiles of any two samples and consider their Pearson correlation:

The distribution of , where both of come from the normal class, is called the distribution of within-normal-class sample-correlation. Similarly, we can define the distribution of within-cancer-class sample-correlation and the distribution of between-class sample-correlation, resp., when both of come from the cancer class and, resp., when come from different classes. Then we can derive that

- The first order approximation to the expected value of the distribution of within-normal-class sample-correlations is
- The first order approximation to the expected value of the distribution of within-cancer-class sample-correlations is
- The first order approximation to the expected value of the distribution of between-class sample-correlations is

Intuitively, the expected within-class sample-correlation is dependent on the ratio of two variances (or ); where (or ) measures the variance of the error about how far a sample profile departs from its mean vector and (or ) measures the variance of the mean vector. Hence, if the variance of the error is relatively large compared with the variance of the mean vector, the within-class sample-correlation tends to be lower. On the other hand, if the variance of the error is relatively small, the within-class sample-correlation tends to be higher. After some simple mathematical derivation, we have:

If, and only if,

we have*ρ*_{01} < min{*ρ*_{0},*ρ*_{1}}.

This theorem provides conditions under which the expected between-class sample-correlation is smaller than the expected within-class sample-correlations.

Because *ρ*_{μ,θ} measures the correlation between two mean vectors, it has a natural bound between -1 and 1. First, when these two mean vectors are negatively correlated or uncorrelated, i.e., -1 ≤ *ρ*_{μ,θ} ≤ 0, the above constrain is naturally satisfied. Second, even when these two mean vectors are positively correlated, if these two expected within-class sample-correlations are close to each other (i.e., if the ratios of the error variance and the mean variance are about the same in different classes), the above condition is still easy to satisfy. Please refer to *SI Theoretical Justification of DBC* for the detailed derivation.

### Statistical Model and Hypothesis.

Without loss of generality, assume there are *K* classes and *N* training samples. For each sample, *P* variables are measured. A similarity score is specified to measure the similarity between two sample profiles; for example, the correlation coefficient is a commonly used similarity score in microarray data analysis. In the learning step, the distribution of the similarity scores between sample pairs from the *i*th and the *j*th class is estimated and denoted as *f*_{ij}. A reference similarity distribution matrix is then constructed and denoted as {*f*_{ij}}_{K×K} (*i*,*j* = 1,2,…,*K* and *f*_{ij} = *f*_{ji}). Intuitively, samples in the same class should share certain common features; hence the similarity scores between them tend to be higher than the similarity scores for samples from different classes. In other words, the distribution of *f*_{ii}(*i* = 1,2,…,*K*) is more likely to shift to the right, whereas shift to the left.

In the prediction step, to predict the class label of a new testing sample, *K* parallel hypothesis tests will be preformed simultaneously, i.e., H_{0}: the new sample belongs to the *k*th class, H_{1}: the new sample does not belong to the *k*th class. where *k* = 1,…,*K*. For the *k*th hypothesis test, the distribution of the similarity scores between the new sample and the reference training samples in the *k*th class is computed and denoted as . Intuitively, if the new sample is really from the *k*th class, then should be close to *f*_{kk} by the construction of *f*_{kk}. Otherwise, should be close to one of . Hence, the decision rule is to assign the new sample to the *k*th class if the difference between and *f*_{kk} (measured in KL distance) is smaller than the difference between and ; otherwise, to “not belong to the *k*th class.” Furthermore we combine the discriminating power of above *K* “weaker” tests to make more reliable prediction. When the results of these *K* tests are consistent, a class label will be assigned (based on the testing results); otherwise, an “unclassified” label will be assigned.

For unclassified samples, a more sophisticated weighted KL decision rule could be constructed. The weights, denoted as *w*_{k}, are equally distributed among classes that claim this “unclassified” sample, or among all classes if no class claims it. Hence, the decision rule is to assign the new sample to the *k*th class that achieves the smallest weighted KL distance, , either among all classes claiming this “unclassified” sample, or among all classes if no class claims it. Fig. 2 illustrated the flow chart of DBC.

### Similarity Score Measurement and Data Transformation.

We have used correlation to construct similarity scores. Because correlation is invariant under scale and intercept changes, the effect of overall shifting or rescaling in the new sample will be eliminated. Furthermore, certain transformation would be necessary to avoid the outlier effect on the similarity score measurement. In following cancer classification examples, the sample profile is standardized with a normal score transformation, i.e., the transformed profile is , where Φ(.) is the cumulative normal distribution, *R*_{i} is the rank of the *i*th gene, and *P* is the total number of marker genes. By applying normal score transformation, DBC is invariant to any monotone transformation of data, including most data normalization and baseline correction methods. Simulation study showed the method with such normal score transformation was quite robust to classify new samples with increasing level of noise or overall shifting or both problems.

## Acknowledgments.

This work was supported in part by National Science Foundation Grants DMS0406091 and DMS-0707160 and by National Science Council Grants from Taiwan NSC95-3114-P-002-005-Y, NSC97-2627-P-001-003, and NSC98-2314-B-001-001-MY3.

## Footnotes

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/cgi/content/full/0910140107/DCSupplemental.

Data deposition footnote: The software and datasets are available at http://www.stat.ucla.edu/~wxl/research/microarray/DBC/index.htm

## References

**National Academy of Sciences**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (273K) |
- Citation

- Kernel-imbedded Gaussian processes for disease classification using microarray gene expression data.[BMC Bioinformatics. 2007]
*Zhao X, Cheung LW.**BMC Bioinformatics. 2007 Feb 28; 8:67. Epub 2007 Feb 28.* - Simple decision rules for classifying human cancers from gene expression profiles.[Bioinformatics. 2005]
*Tan AC, Naiman DQ, Xu L, Winslow RL, Geman D.**Bioinformatics. 2005 Oct 15; 21(20):3896-904. Epub 2005 Aug 16.* - A regression-based K nearest neighbor algorithm for gene function prediction from heterogeneous data.[BMC Bioinformatics. 2006]
*Yao Z, Ruzzo WL.**BMC Bioinformatics. 2006 Mar 20; 7 Suppl 1:S11. Epub 2006 Mar 20.* - A primer on learning in Bayesian networks for computational biology.[PLoS Comput Biol. 2007]
*Needham CJ, Bradford JR, Bulpitt AJ, Westhead DR.**PLoS Comput Biol. 2007 Aug; 3(8):e129.* - Machine learning for in silico virtual screening and chemical genomics: new strategies.[Comb Chem High Throughput Screen. 2008]
*Vert JP, Jacob L.**Comb Chem High Throughput Screen. 2008 Sep; 11(8):677-85.*

- Investigating microRNA-Target Interaction-Supported Tissues in Human Cancer Tissues Based on miRNA and Target Gene Expression Profiling[PLoS ONE. ]
*Hsieh WJ, Lin FM, Huang HD, Wang H.**PLoS ONE. 9(4)e95697* - Knowledge discovery by accuracy maximization[Proceedings of the National Academy of Scie...]
*Cacciatore S, Luchinat C, Tenori L.**Proceedings of the National Academy of Sciences of the United States of America. 2014 Apr 8; 111(14)5117-5122* - 100% Classification Accuracy Considered Harmful: The Normalized Information Transfer Factor Explains the Accuracy Paradox[PLoS ONE. ]
*Valverde-Albacete FJ, Peláez-Moreno C.**PLoS ONE. 9(1)e84217* - TSG: a new algorithm for binary and multi-class cancer classification and informative genes selection[BMC Medical Genomics. ]
*Wang H, Zhang H, Dai Z, Chen MS, Yuan Z.**BMC Medical Genomics. 6(Suppl 1)S3* - Metagenomic biomarker discovery and explanation[Genome Biology. 2011]
*Segata N, Izard J, Waldron L, Gevers D, Miropolsky L, Garrett WS, Huttenhower C.**Genome Biology. 2011; 12(6)R60*

- Exploring the within- and between-class correlation distributions for tumor clas...Exploring the within- and between-class correlation distributions for tumor classificationProceedings of the National Academy of Sciences of the United States of America. 2010 Apr 13; 107(15)6737

Your browsing activity is empty.

Activity recording is turned off.

See more...