• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of pnasPNASInfo for AuthorsSubscriptionsAboutThis Article
Proc Natl Acad Sci U S A. Apr 13, 2010; 107(15): 6737–6742.
Published online Mar 25, 2010. doi:  10.1073/pnas.0910140107
PMCID: PMC2872377
Statistics, Biophysics and Computational Biology

Exploring the within- and between-class correlation distributions for tumor classification

Abstract

To many biomedical researchers, effective tumor classification methods such as the support vector machine often appear like a black box not only because the procedures are complex but also because the required specifications, such as the choice of a kernel function, suffer from a clear guidance either mathematically or biologically. As commonly observed, samples within the same tumor class tend to be more similar in gene expression than samples from different tumor classes. But can this well-received observation lead to a useful procedure of classification and prediction? To address this issue, we first conceived a statistical framework and derived general conditions to serve as the theoretical foundation that supported the aforementioned empirical observation. Then we constructed a classification procedure that fully utilized the information obtained by comparing the distributions of within-class correlations with between-class correlations via Kullback–Leibler divergence. We compared our approach with many machine-learning techniques by applying to 22 binary- and multiclass gene-expression datasets involving human cancers. The results showed that our method performed as efficiently as support vector machine and Naïve Bayesian and outperformed other learning methods (decision trees, linear discriminate analysis, and k-nearest neighbor). In addition, we conducted a simulation study and showed that our method would be more effective if the arriving new samples are subject to the often-encountered baseline shift or increased noise level problems. Our method can be extended for general classification problems when only the similarity scores between samples are available.

Keywords: cancer research, gene expression

Cancer type and stage classification/prediction is an important aspect for cancer diagnosis and treatment. Recent researches suggested that molecular diagnostics via high throughput data could make cancer prediction more objective and accurate and it may help clinicians choose the most appropriate treatments (110). Various cancer classification and prediction methods for gene-expression data, including Fisher’s linear discriminant analysis, support vector machine, and various Bayesian classification procedures, are the off-the-shelf procedures taken from the rich statistics and computer science literature. The rationale behind such methods is seldom backed up by biological intuition. Consequently, the intriguing and subtle nature of each procedure often appears like a black box to most biological practitioners. While such practices have led to a number of important discoveries in microarray studies, not only are the end-users often confused on when and why they may prefer one procedure to another, but also the theoreticians cannot offer a clear spectrum in guiding the choice of methods. For example, the successful use of support vector machine requires the proper specification of a kernel function. Yet, unfortunately, the critical decision of kernel selection has often been made somewhat arbitrarily via rounds of trials and errors, enforcing the black-box image of an already complex procedure to most biologists.

To conceptualize the problem, suppose there are K tumor classes and that there are P genes under consideration. These genes may come from a list specified by biologists to meet their aims of study. For example, a team specializing in metastasis may be interested in knowing how informative a given set of metastasis-related genes is for predicting cancer types. Alternatively, the list may come from screening by some available marker gene selection approach. For a tumor sample, we denote the relevant gene-expression data by An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq1.jpg. We would like to use An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq2.jpg to predict the class label of this sample Y. From the Bayesian point of view, a new tumor sample should be assigned to the class that achieves the highest posterior probability:

equation image

whereAn external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq3.jpg is the p-dimensional conditional density function of the predictor variable in the jth class and pj is the proportion of the jth tumor class in the target population (j = 1,2,…,K).

In the above conditional-density-driven formulation of cancer classification, how to accurately and efficiently estimate the high dimensional conditional density functions of each class from a relatively small number of samples is full of challenge. Linear discriminant analysis (LDA) (11) assumes the conditional densities of each class are multivariate Gaussian (normal) and uses the sample mean vector and sample covariance matrix calculated from the training samples to estimate these population parameters. But, the multivariate Gaussian assumption is hard to justify. Alternatively, non-parametric density estimation methods, such as kernel estimation or K nearest neighbor (KNN) (12, 13), can be used, but they are known to suffer from curse of dimensionality.

Here, we present an alternative formulation of cancer classification problem that is based on the biologists’ intuition that samples within the same tumor class tend to have more similar profiles in gene expression. The correlation between sample profiles tends to be higher if two samples come from the same tumor class (Fig. S1). Indeed, this is the same intuition behind many clustering methods used for discovering new cancer subtypes (2, 4, 6, 9, 1421). However, to the best of our knowledge, this biological intuition has not been exploited in the classification problem. Our goal is to lay down the theoretical ground and present a method, termed “distribution based classification” (DBC), that fully utilizes the information obtained by comparing the distributions of the correlations computed within the same class and those from between classes (SI Theoretical Justification of DBC and Fig. S2).

In DBC, the prediction of which class a new sample belongs to is based on the comparison between correlation distributions. Denote the distribution of the correlation between the new sample profile and each training sample profile in class k by An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq4.jpg. This distribution will be compared with two types of correlation distributions generated from training samples only: fkk, the distribution of correlations between any two training samples from class k and fkj, the distribution of the correlations between one sample from class k and the other sample from class j, where k ≠ j. Kullback–Leibler (KL) distance is applied here to measure the similarity between two distributions. We assign the new sample to the kth class when the KL distance between An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq5.jpg and fkk is smaller than the KL distance between An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq6.jpg and fkj(j ≠ k).

Application on Cancer Classification Using Microarray Data

The performance of DBC on cancer classification using real microarray gene expression data is investigated in this section. A total of 22 publicly available microarray datasets related with the studies of human cancer were collected. The basic information and references about these data are summarized in Tables S1 and S2 for binary-class and multiple-class datasets, respectively (SI Supplementary Tables for 22 Gene Expression Datasets).

Five commonly used classification methods were investigated and compared with our proposed DBC method. These methods are K nearest neighbor (KNN), support vector machines (SVM), linear discriminate analysis (LDA), decision tree method (DT), and naive bayesian method (NB) (SI Other Machine Learning Methods).

Data Preprocessing and Marker Gene Selection.

A large number of genes have near constant expression across samples. We applied a preprocessing procedure to all 22 gene-expression datasets for a preliminary selection of genes before a cross-validation test.

For each gene-expression dataset, 100 simulations of 3-fold cross-validation (SI Cross-Validation Test) were used to compare the performance of different classification methods. In each simulation, two thirds of the samples were used as the training set and the other one-third were retained as the testing set. For each class in the training set, the most representative genes were selected simply based on the t-statistic. The top P (P = 20 was used in this study) genes with the largest absolute value of t-statistic were selected as the marker genes for this class. The representative genes for all classes were selected and pooled to construct a gene set for classification. The detail information about data preprocessing and marker gene selections is available in SI Data Preprocessing and Maker Gene Selection; see also Fig. S3.

Prediction Results on 22 Real Microarray Datasets.

The average accuracy rates of the 100 simulations of 3-fold cross-validation were reported in Table 1 for binary-class datasets and in Table 2 for multiple-class datasets. On average, SVM (84.36%), NB (84.05%), and KNN (84.44%) performed slightly worse but comparable with DBC (84.69%), and outperformed other classification methods, such as LDA (75.35%) and DT (78.36%) on 10 binary-class gene-expression datasets. For 12 multiple-class datasets, DBC achieved the average accuracy rate 91.75% that was comparable with NB (92.80%) and SVM (93.61%) and outperformed the rest classification methods (LDA 81.15%, DT 81.55%, and KNN 90.90%).

Table 1.
The average accuracy rates for binary-class gene-expression datasets by simulation 100 times of 3-fold cross-validation test.
Table 2.
The average accuracy rates for multiple-class gene-expression datasets by simulation 100 times of 3-fold cross-validation test.

Simulation Results for Classifying Independent Testing Samples.

Quite often, samples in the independent test set may have higher noise level than in the training set and the baseline level may also be shifted by an unknown amount. In this section, we conducted two simulation studies to demonstrate that DBC performed much better than other methods when encountering such problems.

We considered the two-class (e.g., normal tissue versus cancer tissue) problem. In the first study, we simulated gene-expression data from the following model:

equation image
[1]

where j is the index for marker genes, j = 1,2,…,J, and xk[set membership]{0,1} is the class label of the kth sample, where k = 1,2,…,K. And ejk is the error term following normal distribution with mean zero and standard deviation (SD) σ.

The training dataset included 50 marker genes and 50 samples in each class and was simulated with aj and bj uniformly distributed in [-10,10] and [-2,2], resp., and σtraining = 2. Testing dataset included 10 samples in each class. Four types of testing dataset were simulated to reflect four different situations in reality. First, the testing data shares exactly the same parameters as the training data, as if the testing experiment was done at exactly the same conditions as the training experiment. Second, the testing data has an increased noise level due to either chemical or electrical reasons; hence the variance of error term increases (e.g., σtesting = 4). Third, the testing data has an overall baseline shifting due to overloading or other reasons; hence a random shift (e.g., a random shift follows a normal distribution with mean five and SD two) is added to each expression value. Finally, the testing data with both increasing noise level and overall baseline shifting problems was simulated. Table 3 shows the average accuracy rates of 100 simulations for classifying testing samples simulated under each situation. Generally speaking, KNN (93.97% and 94.10%), NB (95.44% and 96.00%), and SVM (93.55% and 94.40%) performed slightly worse but comparably with DBC (96.57% and 97.25%) in 3-fold cross-validation test and in independent testing data I. When independent testing data have either an increased noise level or an overall baseline shifting or both, which frequently occurred in reality, although the performance of all methods declined, DBC outperformed other methods to an even larger extent. Especially, DBC was quite robust to classify samples with overall baseline shifting.

Table 3.
The average accuracy rates for classifying different types of independent testing samples in 100 simulations.

For the second simulation study, we started with a real data, the prostate gene-expression data (2). But during the 3-fold cross validation, one-third of data used for testing were disturbed in the following ways:

  1. adding a random normal noise with mean 0 and SD An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq7.jpg, where An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq8.jpg was estimated by the sample standard deviation of testing data,
  2. creating a shifted baseline expression level by adding a constant, generated from the normal distribution with mean 4 and SD An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq9.jpg, to all samples in the training set, and
  3. doing both (1) and (2).

The results from 100 simulations are summarized in Table 4. We found that DBC behaved the best on all three disturbed testing datasets.

Table 4.
The average accuracy rates for classifying different types of disturbed testing samples in 100 simulations using prostate gene-expression data.

Conclusion and Discussion

In this paper, we introduced the classification method DBC and compared it with several methods including KNN, SVM, LDA, DT, and NB to 22 different microarray gene-expression datasets related with human cancer. Our work supports the idea that biological intuition can be an important impetus for constructing effective statistical methods.

The method of DBC for cancer classification is rooted on the appealing empirical observation that samples within the same tumor class tend to be more similar in gene expression than samples from different tumor classes. Fig. S1 showed such an example for the leukemia dataset. The mathematical foundation of this empirical observation is provided in SI Theoretical Justification of DBC.

The successful application of DBC may depend on an implicit assumption of homogeneity within each tumor class. Before using our method, researchers should examine the pattern of each within-class correlation distribution closely. Intuitively, if we observe multiple modes or clusters, then this may indicate the existence of tumor subtypes in a tumor class. We may conduct cluster analysis within each class to detect subclasses. A multiple classification should be constructed to distinguish each subclass and we may modify our method by constructing distribution of correlations for each subclass.

One extension of our method would be in the situations where only the similarity scores between members were available. For the setting presented in the introduction section, the characteristic of a sample is measured by a p-dimensional variable. But there are also situations where researchers are able to make sample-to-sample comparisons and obtain similarity scores between samples. Suppose the information about the n samples in the training set has been summarized by a similarity score matrix of dimension n by n, denoted as S. When a new sample arrives, the similarity scores between the new sample and all the samples in the training set are obtained and presented as an n-dimensional vector An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq10.jpg. For such situations, many popular classification methods, such as LDA, DT, and SVM, which build their classifiers on the characteristics associated with each member, may no longer be applicable. One may apply the K-nearest neighbor (KNN) method of classification. The method assigns the new sample to the class with the majority votes from members with the K largest similarity scores. Clearly, KNN only utilizes the information in An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq11.jpg for prediction and it totally ignores the information included in S. In contrast, DBC fully utilizes the information obtained in both S and An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq12.jpg and compares the distribution of An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq13.jpg with the distribution of the within-class similarity scores and the distribution of between-class similarity scores.

DBC can sidestep the difficult high dimensional density estimation problem frequently encountered due to the abundance of predictor variables. It utilizes the difference between two types of one-dimensional distributions, where the distribution of within-class sample-correlations is skewed more to the right compared to the distribution of between-class sample-correlations. Hence, instead of estimating the posterior probability of the new sample belonging to the kth class given the observed new sample profile, we assign the new sample to the kth class if the KL distance between the distribution of An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq14.jpg and fkk is the smallest distances between the distribution of An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq15.jpg and fkj(j ≠ k). Here, we don’t need any parametric assumption about the conditional densities of each class and we don’t even need to prespecify any parameter to compute above KL distances. Thus, DBC can be viewed as a parameter-free algorithm. However, there is still room for modification. For instance, the KL distance may be replaced by other equally popular measures of distance between probability distributions.

From the theoretical aspect, it is not easy to demonstrate a clear significant improvement over existing methods. One anonymous referee has suggested a starting point by examining a basic two sample problems with unequal mean and variance from normal distributions (SI Application of DBC on 1-d Example). Our method is seen to behave like a compromise between the LDA rule and the QDA rule. Alternatively, we may use some idealized examples to challenge the existing methods. For this purpose, we constructed a geometric family of classification problems with increasing dimensionality (SI Classification on Overlapped Circles or Spheres and Fig. S4). Our method outperformed all existing ones by a large margin (Tables S3, S4, S5, and S6). These examples also highlight the weakness of SVM and they by themselves can be useful counterexamples for understanding the limitation of SVM.

In the study of the gene-expression data, marker gene selection is at least as important as the classification procedures. Good marker genes are able to not only help predict class identities accurately but also provide information to understand the underlying development of cancer or other disease. Many methods have been proposed to find such marker genes. The most commonly used methods are based on univariate ranking using either t-statistic or between- to within-sum-of-squares ratio. One concern is whether the result we reported would be sensitive to the number of top genes selected. We conducted a simulation analysis (SI The Number of Marker Genes Selected for Each Class) to demonstrate how different methods performed when different number of marker genes selected. As a result, the classification methods that build up their classifiers based on the similarity score, such as KNN and DBC, are more robust and less dependent on the number of marker gene selected than the methods based on the absolute gene-expression values, such as SVM, DT, and LDA.

Quite often researchers may want to study only a selective set of genes. Here is a prostate-cancer example. Prostate cancer is the third most common cancer and causes 6% of cancer deaths in men, and it has been extensively studied (2226). Listed below are 40 well-known prostate-related gene symbols:

KLK3; AR; AZGP1; CDKN1A; COPEB; CYPEB; CYP3A4; HPN; KAI1; MSR1; MUC1; MXI1; MYC; PIM; HPC1; PTEN; RNASEL; SRD5A2; PCAP; MAD1L1; HPCX; CHEK2; ELAC2; CAPB; HPC2; PRCA1; KLF6; CHEK2; LFS2; EPHB2; P53; RB; RAS; CDKN2; POLB; hPMS2; P27; CDKN1B; KAI-1; GSTPI.

Among them, the first 23 genes come from Online Mendelian Inheritance in Man (http://www.ncbi.nlm.nih.gov/sites/entrez?db=OMIM). The rest of gene symbols come from Principles and Practice of Oncology (27). These genes are believed to be prostate-cancer related. We used these symbols as keywords to search the Prostate gene-expression data (2) and found 45 entries because of multiple probes for some genes. We applied DBC to these probes and compared the performance with other procedures. Fig. 1 shows how accurate these classification methods are based on 100 3-fold cross-validations. On average, DBC achieved the highest accuracy rate (92.91%) and followed by SVM and NB with average accuracy rates 90.08% and 90.68%, respectively. KNN and DT performed relatively worse with average accuracy rate 88.82% and 87.88%. LDA behaved worst with the lowest average accuracy rate 84.87%. Furthermore, NB and DBC behaved most robust to random split into test and training set. The standard deviations of the accuracy rates for NB and DBC were as low as 0.0129 and 0.0149, resp. While other methods almost doubled or tripled that standard deviation (KNN 0.0213, SVM 0.0237, DT 0.0299, and LDA 0.0346). Hence, DBC appears most reliable on classification using disease-related gene along.

Fig. 1.
Box plot for accuracy rates from 100 3-fold cross-validation tests using disease-related genes alone for different methods.

Methods

Theoretical Justification.

In this section, we derive the general conditions under which the expected within-class sample-correlation is higher than the expected between-class sample-correlation. These conditions will serve as the theoretical justification for our proposed method.

For simplicity, suppose we have only two classes (e.g., normal tissue versus cancer tissue). The class indicator variable is Y[set membership]{0,1}. Let An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq16.jpg represents the conditional density function of the expression profiles of the p genes for samples from the normal tissue class and let the mean vector be An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq17.jpg and the covariance matrix be An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq18.jpg. For the tumor tissue class, the notations are changed to An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq19.jpg, An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq20.jpg, and An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq21.jpg, resp., where An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq22.jpg and An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq23.jpg calculate the mean vector and the covariance matrix of An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq24.jpg when treating An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq25.jpg as a p-dim random variable. Now let An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq26.jpg and An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq27.jpg be the gene-expression profiles of any two samples and consider their Pearson correlation:

equation image

The distribution of An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq28.jpg, where both of An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq29.jpg come from the normal class, is called the distribution of within-normal-class sample-correlation. Similarly, we can define the distribution of within-cancer-class sample-correlation and the distribution of between-class sample-correlation, resp., when both of An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq30.jpg come from the cancer class and, resp., when An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq31.jpg come from different classes. Then we can derive that

Lemma.
  1. The first order approximation to the expected value of the distribution of within-normal-class sample-correlations is
    equation image
  2. The first order approximation to the expected value of the distribution of within-cancer-class sample-correlations is
    equation image
  3. The first order approximation to the expected value of the distribution of between-class sample-correlations is
    equation image

Intuitively, the expected within-class sample-correlation is dependent on the ratio of two variances An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq32.jpg (or An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq33.jpg); where An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq34.jpg (or An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq35.jpg) measures the variance of the error about how far a sample profile departs from its mean vector and An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq36.jpg (or An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq37.jpg) measures the variance of the mean vector. Hence, if the variance of the error is relatively large compared with the variance of the mean vector, the within-class sample-correlation tends to be lower. On the other hand, if the variance of the error is relatively small, the within-class sample-correlation tends to be higher. After some simple mathematical derivation, we have:

Theorem.

If, and only if,

equation image

we haveρ01 < min{ρ0,ρ1}.

This theorem provides conditions under which the expected between-class sample-correlation is smaller than the expected within-class sample-correlations.

Because ρμ,θ measures the correlation between two mean vectors, it has a natural bound between -1 and 1. First, when these two mean vectors are negatively correlated or uncorrelated, i.e., -1 ≤ ρμ,θ ≤ 0, the above constrain is naturally satisfied. Second, even when these two mean vectors are positively correlated, if these two expected within-class sample-correlations are close to each other (i.e., if the ratios of the error variance and the mean variance are about the same in different classes), the above condition is still easy to satisfy. Please refer to SI Theoretical Justification of DBC for the detailed derivation.

Statistical Model and Hypothesis.

Without loss of generality, assume there are K classes and N training samples. For each sample, P variables are measured. A similarity score is specified to measure the similarity between two sample profiles; for example, the correlation coefficient is a commonly used similarity score in microarray data analysis. In the learning step, the distribution of the similarity scores between sample pairs from the ith and the jth class is estimated and denoted as fij. A reference similarity distribution matrix is then constructed and denoted as {fij}K×K (i,j = 1,2,…,K and fij = fji). Intuitively, samples in the same class should share certain common features; hence the similarity scores between them tend to be higher than the similarity scores for samples from different classes. In other words, the distribution of fii(i = 1,2,…,K) is more likely to shift to the right, whereasAn external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq38.jpg shift to the left.

In the prediction step, to predict the class label of a new testing sample, K parallel hypothesis tests will be preformed simultaneously, i.e., H0: the new sample belongs to the kth class, H1: the new sample does not belong to the kth class. where k = 1,…,K. For the kth hypothesis test, the distribution of the similarity scores between the new sample and the reference training samples in the kth class is computed and denoted as An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq39.jpg. Intuitively, if the new sample is really from the kth class, then An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq40.jpg should be close to fkk by the construction of fkk. Otherwise, An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq41.jpg should be close to one of An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq42.jpg. Hence, the decision rule is to assign the new sample to the kth class if the difference between An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq43.jpg and fkk (measured in KL distance) is smaller than the difference between An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq44.jpg and An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq45.jpg; otherwise, to “not belong to the kth class.” Furthermore we combine the discriminating power of above K “weaker” tests to make more reliable prediction. When the results of these K tests are consistent, a class label will be assigned (based on the testing results); otherwise, an “unclassified” label will be assigned.

For unclassified samples, a more sophisticated weighted KL decision rule could be constructed. The weights, denoted as wk, are equally distributed among classes that claim this “unclassified” sample, or among all classes if no class claims it. Hence, the decision rule is to assign the new sample to the kth class that achieves the smallest weighted KL distance, An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq46.jpg, either among all classes claiming this “unclassified” sample, or among all classes if no class claims it. Fig. 2 illustrated the flow chart of DBC.

Fig. 2.
A carton example to illustrate the flow chart of DBC. Step I (Top): feature detection by estimating the reference similarity distribution matrix {fij}K×K. Step II (Middle): prediction based on K parallel hypothesis tests. Step III (Bottom): decision ...

Similarity Score Measurement and Data Transformation.

We have used correlation to construct similarity scores. Because correlation is invariant under scale and intercept changes, the effect of overall shifting or rescaling in the new sample will be eliminated. Furthermore, certain transformation would be necessary to avoid the outlier effect on the similarity score measurement. In following cancer classification examples, the sample profile is standardized with a normal score transformation, i.e., the transformed profile is An external file that holds a picture, illustration, etc.
Object name is pnas.0910140107eq47.jpg, where Φ(.) is the cumulative normal distribution, Ri is the rank of the ith gene, and P is the total number of marker genes. By applying normal score transformation, DBC is invariant to any monotone transformation of data, including most data normalization and baseline correction methods. Simulation study showed the method with such normal score transformation was quite robust to classify new samples with increasing level of noise or overall shifting or both problems.

Supplementary Material

Supporting Information:

Acknowledgments.

This work was supported in part by National Science Foundation Grants DMS0406091 and DMS-0707160 and by National Science Council Grants from Taiwan NSC95-3114-P-002-005-Y, NSC97-2627-P-001-003, and NSC98-2314-B-001-001-MY3.

Footnotes

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/cgi/content/full/0910140107/DCSupplemental.

Data deposition footnote: The software and datasets are available at http://www.stat.ucla.edu/~wxl/research/microarray/DBC/index.htm

References

1. Garber ME, et al. Diversity of gene expression in adenocarcinoma of the lung. Proc Natl Acad Sci USA. 2001;98(24):13784–13789. [PMC free article] [PubMed]
2. Lapointe J, et al. Gene expression profiling identifies clinically relevant subtypes of prostate cancer. Proc Natl Acad Sci USA. 2004;101(3):811–816. [PMC free article] [PubMed]
3. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 1998;95(25):14863–14868. [PMC free article] [PubMed]
4. Golub TR, et al. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science. 1999;286(5439):531–537. [PubMed]
5. Wong DJ, et al. Revealing targeted therapy for human cancer by gene module maps. Cancer Res. 2008;68(2):369–378. [PubMed]
6. Pomeroy SL, et al. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature. 2002;415(6870):436–442. [PubMed]
7. Shipp MA, et al. Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med. 2002;8(1):68–74. [PubMed]
8. Alizadeh AA, et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature. 2000;403(6769):503–511. [PubMed]
9. Perou CM, et al. Molecular portraits of human breast tumours. Nature. 2000;406(6797):747–752. [PubMed]
10. Alter O, Brown PO, Botstein D. Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA. 2000;97(18):10101–10106. [PMC free article] [PubMed]
11. Fukunaga K. Introduction to Statistical Pattern Recognition. Second Edition. San Diego, CA: Academic Press; 1990. p. 592.
12. Patrick EA. Fundamentals of Pattern Recognition. 400 Ed. Englewood Cliffs, NJ: Prentice Hall; 1972.
13. Cover T, Hart P. Nearest neighbor pattern classification. IEEE T Inform Theory. 1967;13:21–27.
14. Welsh JB, et al. Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer. Cancer Res. 2001;61(16):5974–5978. [PubMed]
15. Alon U, et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA. 1999;96(12):6745–6750. [PMC free article] [PubMed]
16. Bhattacharjee A, et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci USA. 2001;98(24):13790–13795. [PMC free article] [PubMed]
17. Armstrong SA, et al. MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat Genet. 2002;30(1):41–47. [PubMed]
18. Wigle DA, et al. Molecular profiling of non-small cell lung cancer and correlation with disease-free survival. Cancer Res. 2002;62(11):3005–3008. [PubMed]
19. Yeoh EJ, et al. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell. 2002;1(2):133–143. [PubMed]
20. Stuart RO, et al. In silico dissection of cell-type-associated patterns of gene expression in prostate cancer. Proc Natl Acad Sci USA. 2004;101(2):615–620. [PMC free article] [PubMed]
21. Qiu P, Wang ZJ, Liu KJ. Ensemble dependence model for classification and prediction of cancer and normal gene expression data. Bioinformatics. 2005;21(14):3114–3121. [PubMed]
22. Dong JT, et al. KAI1, a metastasis suppressor gene for prostate cancer on human chromosome 11p11.2. Science. 1995;268(5212):884–886. [PubMed]
23. Berthon P, et al. Predisposing gene for early-onset prostate cancer, localized on chromosome 1q42.2-43. Am J Hum Genet. 1998;62(6):1416–1424. [PMC free article] [PubMed]
24. Berry R, et al. Evidence for a prostate cancer-susceptibility locus on chromosome 20. Am J Hum Genet. 2000;67(1):82–91. [PMC free article] [PubMed]
25. Witte JS, et al. Genomewide scan for prostate cancer-aggressiveness loci. Am J Hum Genet. 2000;67(1):92–99. [PMC free article] [PubMed]
26. Xu J, et al. Linkage of prostate cancer susceptibility loci to chromosome 1. Hum Genet. 2001;108(4):335–345. [PubMed]
27. Vincent TD, Hellman S, Rosenberg SA. Cancer: Principles and Practice of Oncology. 2005.

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • MedGen
    MedGen
    Related information in MedGen
  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...