• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptNIH Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Ann Stat. Author manuscript; available in PMC Jan 23, 2009.
Published in final edited form as:
Ann Stat. 2008; 36(6): 2605–2637.
doi:  10.1214/07-AOS504
PMCID: PMC2630123

High Dimensional Classification Using Features Annealed Independence Rules


Classification using high-dimensional features arises frequently in many contemporary statistical studies such as tumor classification using microarray or other high-throughput data. The impact of dimensionality on classifications is largely poorly understood. In a seminal paper, Bickel and Levina (2004) show that the Fisher discriminant performs poorly due to diverging spectra and they propose to use the independence rule to overcome the problem. We first demonstrate that even for the independence classification rule, classification using all the features can be as bad as the random guessing due to noise accumulation in estimating population centroids in high-dimensional feature space. In fact, we demonstrate further that almost all linear discriminants can perform as bad as the random guessing. Thus, it is paramountly important to select a subset of important features for high-dimensional classification, resulting in Features Annealed Independence Rules (FAIR). The conditions under which all the important features can be selected by the two-sample t-statistic are established. The choice of the optimal number of features, or equivalently, the threshold value of the test statistics are proposed based on an upper bound of the classification error. Simulation studies and real data analysis support our theoretical results and demonstrate convincingly the advantage of our new classification procedure.

Keywords: Classification, feature extraction, high dimensionality, independence rule, misclassification rates

1 Introduction

With rapid advance of imaging technology, high-throughput data such as microarray and proteomics data are frequently seen in many contemporary statistical studies. For instance, in the analysis of Microarray data, the dimensionality is frequently thousands or more, while the sample size is typically in the order of tens (West et al., 2001; Dudoit et al., 2002). See Fan and Ren (2006) for an overview. The large number of features presents an intrinsic challenge to classification problems. For an overview of statistical challenges associated with high dimensionality, see Fan and Li (2006).

Classical methods of classification break down when the dimensionality is extremely large. For example, even when the covariance matrix is known, Bickel and Levina (2004) demonstrate convincingly that the Fisher discriminant analysis performs poorly in a minimax sense due to the diverging spectra (e.g., the condition number goes to infinity as dimensionality diverges) frequently encountered in the high-dimensional covariance matrices. Even if the true covariance matrix is not ill conditioned, the singularity of the sample covariance matrix will make the Fisher discrimination rule inapplicable when the dimensionality is larger than sample size. Bickel and Levina (2004) show that the independence rule overcomes the above two problems. However, in tumor classification using microarray data, we hope to find tens of genes that have high discriminative power. The independence rule, studied by Bickel and Levina (2004), does not possess this kind of properties.

The diffculty of high-dimensional classification is intrinsically caused by the existence of many noise features that do not contribute to the reduction of misclassification rate. Though the importance of dimension reduction and feature selection has been stressed and many methods have been proposed in the literature, very little research has been done on theoretical analysis of the impacts of high dimensionality on classification. For example, using most discrimination rules such as the linear discriminants, we need to estimate the population mean vectors from the sample. When the dimensionality is high, even though each component of the population mean vectors can be estimated with accuracy, the aggregated estimation error can be very large and this has adverse effects on the misclassification rate. Therefore, when there is only a fraction of features that account for most of the variation in the data such as tumor classification using gene expression data, using all features will increase the misclassification rate.

To illustrate the idea, we study independence classification rule. Specifically, we give an explicit formula on how the signal and noise affect the misclassification rates. We show formally how large the signal to noise ratio can be such that the effect of noise accumulation can be ignored, and how small this ratio can be before the independence classifier performs as bad as the random guessing. Indeed, as demonstrated in Section 2, the impact of the dimensionality can be very drastic. For the independence rule, the misclassification rate can be as high as the random guessing even when the problem is perfectly classifiable. In fact, we demonstrate that almost all linear discriminants can not perform any better than random guessing, due to the noise accumulation in the estimation of the population mean vectors, unless the signals are very strong, namely the population mean vectors are very far apart.

The above discussion reveals that feature selection is necessary for high-dimensional classification problems. When the independence rule is applied to selected features, the resulting Feature Annealed Independent Rules (FAIR) overcome both the issues of interpretability and the noise accumulation. One can extract the important features via variable selection techniques such as the penalized quasi-likelihood function. See Fan and Li (2006) for an overview. One can also employ a simple two-sample t-test as in Tibshirani et al.(2002) to identify important genes for the tumor classification, resulting in the nearest shrunken centroids method. Such a simple method corresponds to a componentwise regression method or a ridge regression method with ridge parameters tending to ∞ (Fan and Lv, 2007). Hence, it is a specific and useful example of the penalized quasi-likelihood method for feature selection. It is surprising that such a simple proposal can indeed extract all important features. Indeed, we demonstrate that under suitable conditions, the two sample t-statistic can identify all the features that efficiently characterize both classes.

Another popular class of the dimension reduction methods is projection. They have been widely applied to the classification based on the gene expression data. See, for example, principal component analysis in Ghosh (2002), Zou et al. (2004), and Bair et al.(2004); partial least squares in Nguyen and Rocke (2002), Huang and Pan (2003), and Boulesteix (2004); and sliced inverse regression in Chiaromonte and Martinelli (2002), Antoniadis et al.(2003), and Bura and Pfeiffer (2003). These projection methods attempt to find directions that can result in small classification errors. In fact, the directions found by these methods usually put much more weights on features that have large classification power. In general, however, linear projection methods are likely to perform poorly unless the projection vector is sparse, namely, the effective number of selected features is small. This is due to the aforementioned noise accumulation prominently featured in high-dimensional problems, recalling discrimination based on linear projections onto almost all directions can perform as bad as the random guessing.

As direct application of the independence rule is not efficient, we propose a specific form of FAIR. Our FAIR selects the statistically most significant m features according to the componentwise two-sample t-statistics between two classes, and applies the independence classifiers to these m features. Interesting questions include how to choose the optimal m, or equivalently, the threshold value of t-statistic, such that the classification error is minimized, and how this classifier performs compared with the independence rule without feature selection and the oracle-assisted FAIR. All these questions will be formally answered in this paper. Surprisingly, these results are similar to those for the adaptive Neyman test in Fan (1996). The theoretical results also indicate that FAIR without oracle information performs worse than the one with oracle information, and the difference of classification error depends on the threshold value, which is consistent with the common sense.

There is a huge literature on classification. To name a few in addition to those mention before, Bai and Saranadasa (1996) dealt with the effect of high dimensionality in a two-sample problem from a hypothesis testing viewpoint; Friedman (1989) proposed a regularized discriminant analysis to deal with the problems associated with high dimension while performing computations in the regular way; Dettling and Bühlmann (2003) and Bühlmann and Yu (2003) study boosting with logit loss and L2 loss, respectively, and demonstrate the good performances of these methods in high-dimensional setting; Greenshtein and Ritov (2004), Greenshtein (2006) and Meinshausen (2005) introduced and studied the concept of persistence, which places more emphasis on misclassification rates or expected loss rather than the accuracy of estimated parameters.

This article is organized as follows. In Section 2, we demonstrate the impact of dimensionality on the independence classification rule, and show that discrimination based on projecting observations onto almost all linear directions is nearly the same as random guessing. We establish, in Section 3, the conditions under which two sample t-test can identify all the important features with probability tending to 1. In Section 4, we propose FAIR and give an upper bound of its classification error. Simulation studies and real data analyses are conducted in Section 5. The conclusion of our study is summarized in Section 6. All proofs are given in the Appendix.

2 Impact of High Dimensionality

Consider the p-dimensional classification problem between two classes C1 and C2. Suppose that from class Ck, we have nk observations Y k1, (...), Y knk in Rp. The j-th feature of the i-th sample from class Ck satisfies the model


where μkj is the mean effect of the j-th feature in class Ck and ϵkij is the corresponding Gaussian random noise for i-th observation. In matrix notation, the above model can be written as


where μk = (μk1, (...), μkp)′ is the mean vector of class Ck and ϵki = (ϵki1, (...), ϵkip)′ has the distribution N (0, Σk). We assume that all observations are independent across samples and in addition, within class Ck, observations Yk1, (...), Yknk are also identically distributed. Throughout this paper, we make the assumption that the two classes have compatible sample sizes, i.e., c1n1/n2c2 with c1 and c2 some positive constants.

We first investigate the impact of high dimensionality on classification. For simplicity, we temporarily assume that the two classes C1 and C2 have the same covariance matrix Σ. To illustrate our idea, we consider the independence classification rule, which classifies the new feature vector x into class C1 if


where μ = (μ1 + μ2)/2 and D = diag(Σ). This classifier has been thoroughly studied in Bickel and Levina (2004). They showed that in the classification of two normal populations, this independence rule greatly outperforms the Fisher linear discriminant rule under broad conditions when the number of variables is large.

The independence rule depends on the marginal parameters μ1, μ2 and D=diag{σ12,,σp2}. They can easily be estimated from the samples:




where Skj2=i=1nk(YkijYkj)2(nk1) is the sample variance of the j-th feature in class k and Ykj=i=1nkYkink. Hence, the plug-in discrimination function is


Denote the parameter by θ = (μ1, μ2, Σ). If we have a new observation X from class C1, then the misclassification rate of δ^ is




and Φ(·) is the standard Gaussian distribution function. The worst case classification error is


where Γ is some parameter space to be defined. Let n = n1 + n2. In our asymptotic analysis, we always consider the misclassification rate of observations from C1, since the misclassification rate of observations from C2 can be easily obtained by interchanging n1 with n2 and μ1 with μ2. The high dimensionality is modeled through its dependence on n, namely pn → ∞. However, we will suppress its dependence on n whenever there is no confusion.

Let R = D-1/2 ΣD-1/2 be the correlation matrix, and λmax(R) be its largest eigenvalue, and α [equivalent]1, (...), αp)′ = μ1 - μ2. Consider the parameter space


where Cp is a deterministic positive sequence that depends only on the dimensionality p, and b0 is a positive constant. Note that α′D-1α corresponds to the overall strength of signals, and the first condition α′D-1α ≥ Cp imposes a lower bound on the strength of signals. The second condition λmax(R) ≤ b0 requires that the maximum eigenvalue of R should not exceed a positive constant. But since there are no restrictions on the smallest eigenvalue of R, the condition number can still diverge. The third condition min1jp,k=1,2σkj2>0 ensures that there are no deterministic features that make classification trivial and the diagonal matrix D is always invertible. We will consider the asymptotic behavior of W(δ^,θ) and W(δ^).

Theorem 1

Suppose that log p = o(n), n = o(p) and nCp → ∞. Then (i) The classification error W (δ, θ) with θ ϵ Γ is bounded from above as


(ii) Suppose p/(nCp) → 0. For the worst case classification error W (δ), we have


Specifically, when {n1n2pn}12CpC0 with C0 a nonnegative constant, then


In particular, if C0 = 0, then W(δ^)P12.

Theorem 1 reveals the trade-off between the signal strength Cp and the dimensionality, reflected in the term Cpp when all features are used for classification. It states that the independence rule δ^ would be no better than the random guessing due to noise accumulation, unless the signal levels are extremely high, say, {np}12CpB for some B > 0. Indeed, discrimination based on linear projections to almost all directions performs nearly the same as random guessing, as shown in the theorem below. The poor performance is caused by noise accumulation in the estimation of μ1 and μ2.

Theorem 2

Suppose that a is a p-dimensional uniformly distributed unit random vector on a (p - 1)-dimensional sphere. Let λ1, (...), λp be the eigenvalues of the covariance matrix Σ. Suppose limp1p2j=1pλj2< and limp1pj=1pλj=τ with τ a positive constant. Moveover, assume that p-1α′α → 0. Then if we project all the observations onto the vector a and use the classifier


the misclassification rate of δ^a satisfies


where the probability is taken with respect to a and X ϵ C1.

3 Feature Selection by Two-Sample t-Test

To extract salient features, we appeal to the two sample t-test statistics. Other componentwise tests such as the rank sum test can also be used, but we do not pursue those in detail. The two-sample t-statistic for feature j is defined as


where Ykj and Skj2 are the same as those defined in Section 1. We work under more relaxed technical conditions: the normality assumption is not needed. Instead, we assume merely that the noise vectors ϵki,i = 1, (...), nk are i.i.d. within class Ck with mean 0 and covariance matrix Σk, and are independent between classes. The covariance matrix Σ1 can also differ from Σ2.

To show that the t-statistic can select all the important features with probability 1, we need the following condition.

Condition 1

  1. Assume that the vector α = μ1 - μ2 is sparse and without loss of generality, only the first s entries are nonzero.
  2. Suppose that ϵkij and ϵkij21 satisfy the Cramér's condition, i.e., there exist constants ν1, ν2, M1 and M2, such that Eϵkijmm!M1m2ν12 and Eϵkij2σkj2mm!M2m2ν22 for all m = 1, 2, (...).
  3. Assume that the diagonal elements of both Σ1 and Σ2 are bounded away from 0.

The following theorem describes the situation under which the two sample t-test can pick up all important features by choosing an appropriate critical value. Recall that c1n1/n2c2 and n = n1 + n2.

Theorem 3

Let s be a sequence such that log(p - s) = o(nγ) and logs=o(n12γβn) for someβn → ∞ and 0<γ<13. Suppose that min1jsαjσ1j2+σ2j2=nγβn. Then under Condition 1, for x ~ cnγ/2 with c some positive constant, we have


In the proof of Theorem 3, we used the moderate deviation results of the two-sample t-statistic (see Cao, 2007 or Shao, 2005). Theorem 3 allows the lowest signal level to decay with sample size n. As long as the rate of decay is not too fast and the sample size is not too small, the two sample t-test can pick up all the important features with probability tending to 1.

4 Features Annealed Independence Rules

We apply the independence classifier to the selected features, resulting in a Features Annealed Independence Rule (FAIR). In many applications such as tumor classification using gene expression data, we would expect that elements in the population mean difference vector α are sparse: most entries are small. Thus, even if we could use t-test to correctly extract out all these features, the resulting choice is not necessarily optimal, since the noise accumulation can even exceed the signal accumulation for faint features. This can be seen from Theorem 1. Therefore, it is necessary to further single out the most important features that help reduce misclassification rate.

To help us select the number of features, or the critical value of the test statistic, we first consider the ideal situation that the important features are located at the first m coordinates and our task is to merely select m to minimize the misclassification rate. This is the case when we have the ideal information about the relative importance of features, as measured by |αj|/σj, say. When such an oracle information is unavailable, we will learn it from the data. In the situation that we have vague knowledge about the importance of features such as tumor classification using gene expression data, we can give high ranks to features with large |αj|/σj.

In the presentation below, unless otherwise specified, we assume that the two classes C1 and C2 are both from Gaussian distributions and the common covariance matrix is the identity, i.e., Σ1 = Σ2 = I. If this common covariance matrix is known, the independence classifier δ^ becomes the nearest centroids classifier


If only the first m dimensions are used in the classification, the corresponding features annealed independence classifier becomes


where the superscript m means that the vector is truncated after the first m entries. This is indeed the same as the nearest shrunken centroids method of Tibshirani et al. (2002).

Theorem 4

Consider the truncated classifier δ^NCmn for a given sequence mn. Suppose that nmnj=1mnαj2 as mn → ∞. Then the classification error of δ^NCmn is


where n = n1 + n2 as defined in Section 2.

In the following, we suppress the dependence of m on n when there is no confusion. The above theorem reveals that the ideal choice on the number of features is


It can be estimated as


where α^j=μ^1jμ^2j. The expression for m0 quantifies how the signal and the noise affect the misclassification rates as the dimensionality m increases. In particular, when n1 = n2, the express reduces to m0=argmax1mp[m12j=1mαj2]22n+j=1mαj2m. The term m12j=1mαj2 reflects the trade-off between the signal and noise as dimensionality m increases.

The good performance of the classifier δ^NCm depends on the assumption that the largest entries of α cluster at the first m dimensions. An ideal version of the classifier δ^NC is to select a subset A={j:αj>a} and use this subset to construct independence classifier. Let m be the number of elements in A. The oracle classifier can be written as


The misclassification rate is approximately


when nmjAαj2 and m → ∞. This is straightforward from Theorem 4. In practice, we do not have such an oracle, and selecting the subset A is difficult. A simple procedure is to use the feature annealed independence rule based on the hard thresholding:


We study the classification error of FAIR and the impact of the threshold b on the classification result in the following theorem.

Theorem 5

Suppose that maxjAcαj<bn and log(pm)[n(bnmaxjAcαj)2]0 with m=A. Moreover, assume that nmjAαj2 and jAαj[njAαj2]0. Then


Notice that the upper bound of W(δ^FAIRbn,θ) in Theorem 5 is greater than the classification error in Theorem 4, and the magnitude of difference depends on mbn2. This is expected as estimating the set A increases the classification error. These results are similar to those in Fan (1996) for high-dimensional hypothesis testing.

When the common covariance matrix is different from the identity, FAIR takes a slightly different form to adapt to the unknown componentwise variance:


where Tj is the two sample t-statistic. It is clear from (4.2) that FAIR works the same way as that we first sort the features by the absolute values of their t-statistics in the descending order, and then take out the first m features to classify the data. The number of features can be selected by minimizing the upper bound of the classification error given in Theorem 1. The optimal m in this sense is


where λmaxm is the largest eigenvalue of the correlation matrix Rm of the truncated observations. It can be estimated from the samples:


Note that the factor λmaxm in (4.3) increases with m, which makes m^1 usually smaller than m^0.

5 Numerical Studies

In this section we use a simulation study and three real data analyses to illustrate our theoretical results and to verify the performance of our newly proposed classifier FAIR.

5.1 Simulation Study

We first introduce the model. The covariance matrices Σ1 and Σ2 for the two classes are chosen to be the same. For the distribution of the error ϵij in (2.1), we use the same model as that in Fan, Hall and Yao (2006). Specifically, features are divided into three groups. Within each group, features share one unobservable common factor with different factor loadings. In addition, there is an unobservable common factor among all the features across three groups. For simplicity, we assume that the number of features p is a multiple of 3. Let Zij be a sequence of independent N(0, 1) random variables, and χij2 be a sequence of independent random variables of the same distribution as (χd2d)2d with χd2 the Chi-square distribution with degrees of freedom d. In the simulation we set d = 6.

Let {aj} and {bj} be factor loading coefficients. Then the error in (2.1) is defined as


where aij = 0 except that a1j = aj for j = 1, (...), p/3, a2j = aj for j = (p/3) + 1, (...), 2p/3, and a3j = aj for j = (2p/3)+1, (...), p. Therefore, ϵij = 0 and var(ϵij) = 1, and in general, within group correlation is greater than the between group correlation. The factor loadings aj and bj are independently generated from uniform distributions U(0, 0.4) and U(0, 0.2). The mean vector μ1 for class C1 is taken from a realization of the mixture of a point mass at 0 and a double-exponential distribution:


where c ϵ (0, 1) is a constant. In the simulation, we set p = 4, 500 and c = 0.02. In other words, there are around 90 signal features on an average, many of which are weak signals. Without loss of generality, μ2 is set to be 0. Figure 1 shows the true mean difference vector α, which is fixed across all simulations. It is clear that there are only very few features with signal levels exceeding 1 standard deviation of the noise.

Figure 1
True mean difference vector α. x-axis represents the dimensionality, and y-axis shows the values of corresponding entries of α.

With the parameters and model above, for each simulation, we generate n1 = 30 training data from class C1 and n2 = 30 training data from C2. In addition, separate 200 samples are generated from each of the two classes in each simulation, and these 400 vectors are used as test samples. We apply our newly proposed classifier FAIR to the simulated data. Specifically, for each feature, the t-test statistic in (3.1) is calculated using the training sample. Then the features are sorted in the decreasing order of the absolute values of their t-statistics. We then examine the impact of the number of features m on the misclassification rate. In each simulation, with m ranging from 1 to 4500, we construct the feature annealed independence classifiers using the training samples, and then apply these classifiers to the 400 test samples. The classification errors are compared to those of the independence rule with the oracle ordering information, which is constructed by repeating the above procedure except that in the first step the features are ordered by their true signal levels, |α|, instead of by their t-statistics

The above procedure is repeated 100 times, and averages and standard errors of the misclassification rates (based on 400 test samples in each simulation) are calculated across the 100 simulations. Note that the average of the 100 misclassification rates is indeed computed based on 100 × 400 testing samples.

Figure 2 depicts the misclassification rate as a function of the number of features m. The solid curves represent the average of classification rates across the 100 simulations, and the corresponding dashed curves are 2 standard errors (i.e. the standard deviation of 100 misclassification rates divided by 10) away from the solid one. The misclassification rates using the first 80 features in Figure 2(a) are zoomed in Figure 2(b). Figures 2(c) and 2(d) are the same as 2(a) and 2(b) except that the features are arranged in the decreasing order of |α|, i.e., the results are based on the oracle-assisted feature annealed independence classifier. From these plots we see that the classification results of FAIR are close to those of the oracle-assisted independence classifier. Moreover, as the dimensionality m grows, the misclassification rate increases steadily due to the noise accumulation. When all the features are included, i.e. m = 4500, the misclassification rate is 0.2522, whereas the minimum classification errors are 0.0128 in plot 2(b) and 0.0020 in plot 2(d). These results are consistent with Theorem 1. We also tried to decrease the signal levels, i.e., the mean of the double exponential distribution, or to increase the dimensionality p, and found that the classification error tend to 0.5 when all the dimensions are included. Comparing Figures 2(a) and 2(b) to Figures 2(c) and 2(d), we see that the features ordered by t-statistics has higher misclassification rates than those ordered by the oracle. Also, using t-statistics results in larger minimum classification errors (see plots 2(b) and 2(d)), but the differences are not very large.

Figure 2
Number of features versus misclassification rates. The solid curves represent the averages of classification errors across 100 simulations. The dashed curves are 2 standard errors away from the solid curves. The x-axis represents the number of features ...

Figure 3 shows the classification errors of the independence rule based on projected samples onto randomly chosen directions across 100 simulations. Specifically, in each of the simulations in Figure 2, we generate a direction vector a randomly from the (p - 1)-dimensional unit sphere, then project all the data in that simulation onto the direction a, and finally apply the Fisher discriminant to the projected data (see (2.3)). The average of these misclassification rates is 0.4986 and the corresponding standard deviation is 0.0318. These results are consistent with our Theorem 2.

Figure 3
Classification errors of the independence rule based on projected samples onto randomly chosen directions over 100 simulations.

Finally, we examine the effectiveness of our proposed method (4.3) for selecting features in FAIR. In each of the 100 simulations, we apply (4.3) to choose the number of features and compute the resulting misclassification rate based on 400 test samples. We also use the nearest shrunken centroids of Tibshirani et al. (2003) to select the important features. Figure 4 summarizes these results. The thin curves correspond to the nearest shrunken centroids method, and the thick curves correspond to FAIR. Figure 4(a) presents the number of features calculated from these two methods, and Figure 4(b) shows the corresponding misclassification rates. For our newly proposed classifier FAIR, the average of the optimal number of features over 100 simulations is 29.71, which is very close to the smallest number of features with the minimum misclassification rate in Figure 2(d). The misclassification rates of FAIR in Figure 4(b) have average 0.0154 and standard deviation 0.0085, indicating the outstanding performance of FAIR. Nearest shrunken centroids method is unstable in selecting features. Over the 100 simulations, there are several realizations in which it chooses plenty of features. We truncated Figure 4 to make it easier to view. The average number of features chosen by the nearest shrunken centroids is 28.43, and the average classification error is 0.0216, with corresponding standard deviation 0.0179. It is clear that nearest shrunken centroids method tends to choose less features than FAIR, but the misclassification rates are larger.

Figure 4
The thick curves correspond to FAIR, while the thin curves correspond to the nearest shrunken centroids method. (a) The numbers of features chosen by (4.3) and by the nearest shrunken centroids method over 100 simulations. (b) Corresponding classification ...

5.2 Real Data Analysis

5.2.1 Leukemia Data

Leukemia data from high-density Affymetrix oligonucleotide arrays were previously analyzed in Golub et al.(1999), and are available at http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi. There are 7129 genes and 72 samples coming from two classes: 47 in class ALL (acute lymphocytic leukemia) and 25 in class AML (acute mylogenous leukemia). Among these 72 samples, 38 (27 in class ALL and 11 in class AML) are set to be training samples and 34 (20 in class ALL and 14 in class AML) are set as test samples.

Before classification, we standardize each sample to zero mean and unit variance as done by Dudoit et al. (2002). The classification results from the nearest shrunken centroids (NSC hereafter) method and FAIR are shown in Table 1. The nearest shrunken centroids method picks up 21 genes and makes 1 training error and 3 test errors, while our method chooses 11 genes and makes 1 training error and 1 test error. Tibshirani et al.(2002) proposed and applied the nearest shrunken centroids method to the unstandardized Leukemia dataset. They chose 21 genes and made 1 training error and 2 test errors. Our results are still superior to theirs.

Table 1
Classification errors of Leukemia data set

To further evaluate the performance of the two classifiers, we randomly split the 72 samples into training and test sets. Specifically, we set approximately 100γ% of the observations from class ALL and 100γ% of the observations from class AML as training samples, and the rest as test samples. FAIR and NSC are applied to the training data, and their performances are evaluated by the test samples. The above procedure is repeated 100 times for γ = 0.4, 0.5 and 0.6, respectively, and the distributions of test errors of FAIR, NSC and the independence rule without feature selection are summarized in Figure 5. In each of the splits, we also calculated the difference of test errors between NSC and FAIR, i.e., the test error of FAIR minus that of NSC, and the distribution is summarized in Figure 5. The top panel of Figure 6 shows the number of features selected by FAIR and NSC for γ = 0.4. The results for the other two values of γ are similar so we do not present here to save the space. From these figures we can see that the performance of independence rule improves significantly after feature selection. The classification errors of NSC and FAIR are approximately the same. As we have already noticed in the simulation study, NSC is not good with feature selection, that is, the number of features selected by NSC is very large and unstable, while the number of features selected by FAIR is quite reasonable and stable over different random splits. Clearly, the independent rule without feature selection performs poorly.

Figure 5
Leukemia data. Boxplots of test errors of FAIR, NSC and the independence rule without feature selection over 100 random splits of 72 samples, where 100γ% of the samples from both classes are set as training samples. The three plots from left to ...
Figure 6
Leukemia, Lung Cancer, and Prostate data sets. The number of features selected by FAIR and NSC over 100 random splits of the total samples. In each split, 100γ% of the samples from both class are set as training samples, and the rest are used ...

5.2.2 Lung Cancer Data

We evaluate our method by classifying between malignant pleural mesothelioma (MPM) and adenocarcinoma (ADCA) of the lung. Lung cancer data were analyzed by Gordon et al.(2002) and are available at http://www.chestsurg.org. There are 181 tissue samples (31 MPM and 150 ADCA). The training set contains 32 of them, with 16 from MPM and 16 from ADCA. The rest 149 samples are used for testing (15 from MPM and 134 from ADCA). Each sample is described by 12533 genes.

As in the Leukemia data set, we first standardize the data to zero mean and unit variance, and then apply the two classification methods to the standardized data set. Classification results are summarized in Table 2. Although FAIR uses 5 more genes than the nearest shrunken centroids method, it has better classification results: both methods perfectly classify the training samples, while our classification procedure has smaller test error.

Table 2
Classification errors of Lung Cancer data

We follow the same procedure as that in Leukemia example to randomly split the 181 samples into training and test sets. FAIR and NSC are applied to the training data, and the test errors are calculated using the test data. The procedure is repeated 100 times with γ = 0.4, 0.5 and 0.6, respectively, and the test error distributions of FAIR, NSC and the independence rule without feature selection can be found in Figure 7. We also present the difference of the test errors between FAIR and NSC in Figure 7. The numbers of features used by FAIR and NSC with γ = 0.5 are shown in the middle panel of Figure 6. Figure 7 shows again that feature selection is very important in high dimensional classification. The performance of FAIR is close to NSC in terms of classification error (Figure 7), but FAIR is stable in feature selection, as shown in the middle panel of Figure 6. One possible reason of Figure 7 might be that the signal strength in this Lung Cancer dataset is relatively weak, and more features are needed to obtain the optimal performance. However, the estimate of the largest eigenvalue is not accurate anymore when the number of features is large, which results in inaccurate estimates of m1 in (4.3).

Figure 7
Lung cancer data. The same as Figure 5 except that the data set is different.

5.2.3 Prostate Cancer Data

The last example uses the prostate cancer data studied in Singh et al.(2002). The data set is available at http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi. The training data set contains 102 patient samples, 52 of which (labeled as “tumor”) are prostate tumor samples and 50 of which (labeled as “Normal”) are prostate samples. There are around 12600 genes. An independent set of test samples is from a different experiment and has 25 tumor and 9 normal samples.

We preprocess the data by standardizing the gene expression data as before. The classification results are summarized in Table 3. We make the same test error as and a bit larger training error than the nearest shrunken centroids method, but the number of selected genes we use is much less.

Table 3
Classification errors of Prostate Cancer data set

The samples are randomly split into training and test sets in the same way as before, the test errors are calculated, and the number of features used by these two methods are recorded. Figure 8 shows the test errors of FAIR, NSC and the independence rule without feature selection, and the difference of the test errors of FAIR and NSC. The bottom panel of Figure 6 presents the numbers of features used by FAIR and NSC in each random split for γ = 0.6. As we mentioned before, the plots for γ = 0.4 and 0.5 are similar so we omit them in the paper. The performance of FAIR is better than that of NSC both in terms of classification error and in terms of the selection of features. The good performance of FAIR might be caused by the strong signal level of few features in this data set. Due to the strong signal level, FAIR can attain the optimal performance with small number of features. Thus, the estimate of m1 in (4.3) is accurate and hence the actual performance of FAIR is good.

Figure 8
Prostate cancer data. The same as Figure 5 except that the data set is different.

6 Conclusion

This paper studies the impact of high dimensionality on classifications. To illustrate the idea, we have considered the independence classification rule, which avoids the difficulty of estimating large covariance matrix and the diverging condition number frequently associated with the large covariance matrix. When only a subset of the features capture the characteristics of two groups, classification using all dimensions would intrinsically classify the noises. We prove that classification based on linear projections onto almost all directions performs nearly the same as random guessing. Hence, it is necessary to choose direction vectors which put more weights on important features.

The two-sample t-test can be used to choose the important features. We have shown that under mild conditions, the two sample t-test can select all the important features with probability one. The features annealed independence rule using hard thresholding, FAIR, is proposed, with the number of features selected by a data-driven rule. An upper bound of the classification error of FAIR is explicitly given. We also give suggestions on the optimal number of features used in classification. Simulation studies and real data analysis support our theoretical results convincingly.


Financial support from the NSF grants DMS-0354223 and DMS-0704337 and NIH grant R01-GM072611 is gratefully acknowledged. The authors acknowledge gratefully the helpful comments of referees that led to the improvement of the presentation and the results of the paper.

7 Appendix

Proof of Theorem 1

For θ [set membership] Γ, Ψ defined in (2.2) can be bounded as


where we have used the assumption that λmax(R) ≤ b0. Denote by


We next study the asymptotic behavior of Ψ~.

Since Condition 1(b) in Section 3 is satisfied automatically for normal distribution, by Lemma 2 below we have D^=D(1+oP(1)), where oP(1) holds uniformly across all diagonal elements. Thus, the right hand side of (7.1) can be written as


We first consider the denominator. Notice that it can be decomposed as


where σj2 is the j-th diagonal entry of D, σ^j2, is the j-th diagonal entry of D^, and ϵ^kj=i=1nkϵkijnk,k=1,2. Notice that ϵ^1ϵ^2~N(0,nn1n2Σ). By singular value decomposition we have


where QR is orthogonal matrix and VR=diag{λR,1,,λR,p} be the eigenvalues of the correlation matrix R. Define ϵ~=n1n2nVR12QRD12(ϵ^1ϵ^2), then ϵ~~N(0,I). Hence,


Since i=1pλR,i=p and λR,i0 for all i = 1, (...), p, we have 1p2i=1pλR,i2<. By weak law of large number we have


Next, we consider I1. Note that I1 has the distribution I1N(0,nn1n2αD1ΣD1α). Since λmaxb0,nαD1αnCp and


we have I1=αD1αoP(1). This together with (7.2) and (7.3) yields


Now, we consider the numerator. It can be decomposed as


Denote by I~3=αjσj2(ϵ^2j). Note that


Define Fj=n2αjσj2ϵ^2jαD1α, then σFj2var(Fj)1 for all j. For the normal distribution, we have the following tail probability inequality


Since FjN(0,σFj2), by the above inequality we have


with C some positive constant, for all x > 0 and j = 1,(...), p. By Lemma 2.2.10 of van de Vaart and Wellner (1996, P102), we have


where K is some universal constant. This together with (7.5) ensures that




Now we only need to consider I~3. Note that I~3=αjσj2ϵ^2jN(0,1n2αD1ΣD1α). Since the variance term can be bounded as


By the assumption that nαD1α and λmax(R) is bounded, we have I~3=12αD1αoP(1). Combining this with (7.6) leads to


We now examine I4 and I5. By the similar proof to (7.3) above we have


Thus the numerator can be written as


and by (7.4)


Since ax1+a2x is an increasing function of x and αj2σj2Cp, in view of (7.1) and the definition of the parameter space Γ, we have


If p/(nCp) → 0, then W(δ^)=1Φ(12[n1n2(pnb0)]12Cp{1+oP(1)}). Furthermore, if {n1n2pn}12CpC0 with C0 some constant, then


This completes the proof.

Proof of Theorem 2

suppose we have a new observation X from class C1. Then the posterior classification error of using δ^a() is


Where Ψa=aμ1aμ^aΣa,Φ() is the standard Gaussian distribution function, and Ea means expectation taken with respect to a. We are going to show that


which together with the continuity of ·(.) and the dominated convergence theorem gives


Therefore, the posterior error W(δ^a,θ) is no better than the random guessing.

Now, let us prove (7.7). Note that the random vector a can be written as


where Z is a p-dimensional standard Gaussian distributed random vector, independent of all the observations Y ki and X. Therefore,


where α=μ1μ2 and ϵ^k=1nki=1nkϵki,k=1,2. By the singular value decomposition we have


where Q is an orthogonal matrix and V=diag{λ1,,λp} is a diagonal matrix. Let Z~=QZ, then Z~ is also a p-dimensional standard Gaussian random vector. Hence the denominator of Ψa can be written as


where Z~j is the j-th entry of Z~. Since it is assumed that limp1p2j=1pλj2< and limp1pj=1pλj=τ for some positive constant τ, by the weak law of large numbers, we have


Next, we study the numerator of Ψa in (7.8). Since 1pj=1pαj20, the first term of the numerator converges to 0 in probability, i.e.,


Let ε=n1n2n(ϵ^1+ϵ^2) and ε~=V12Qε, then ε~ has distribution N(0, I) and is independent of Z~. The second term of the numerator can be written as


Since nn1n2pj=1pλj0<, it follows from the weak law of large number that


This together with (7.8), (7.9), and (7.10) completes the proof.

We need the the following two lemmas to prove Theorem 3.

Lemma 1

[Cao (2005)] Let n = n1 + n2. Assume that there exist 0 < c1c2 < 1 such that c1n1/n2c2. Let T~j=Tjμj1μj2S1j2n1+S1j2n2. Then for any xx(n1,n2) satisfying x → ∞ and x = o(n1/2),


If in addition, if we have only EY1ij3< and EY2ij3<, then


where d=(EY1ij3+EY2ij3)(var(Y1ij)+var(Y2ij))32 and O(1) is a finite constant depending only on c1 and c2. In particular,


uniformly in x(0,o(n16)).

Lemma 2

Suppose Condition 1(b) holds and log p = o(n). Let Skj2 be the sample variance defined in Section 1, and σkj2 be the variance of the j-th feature in class Ck. Suppose min σkj2 is bounded away from 0. Then we have the following uniform convergence result


Proof of Lemma 2

For any ε>0, we know when nk is very large,


It follows from Bernstein's inequality that




Since log p = o(n), we have I1 = o(1) and I2 = o(1). These together with (7.11) completes the proof of Lemma 2.

Proof of Theorem 3

We devidte the proof into two parts. (a) Let us first look at the probability P(maxj>sTj>x). Clearly,


Note that for all j > s, αj = μj1 - μj2 = 0. By Condition 1(b) and Lemma 1, the following inequality holds for 0 ≤ xn1/6/d,


where C is a constant that only depends on c1 and c2, and


with σkj2 the j-th diagonal element of Σk. For the normal distribution, we have the following tail probability inequality


This together with the symmetry of Tj gives


Combining the above inequality with (7.12), we have


Since log(p - s) = o(nγ) with 0<γ<13, if we let xcnγ2, then


which along with (7.12) yields


(b) Next, we consider P (minjs |Tj| ≤ x). Notice that for js, αj = μ1j - μ2j ≠ 0. Let ηj=αjS1j2n1+S1j2n2 and define


Then following the same lines as those in (a), we have


It follows from Lemma 2 that,


Hence, uniformly over j = 1, (...), s, we have




with c2 defined in Theorem 3. Let α0=minjsμj1μj2σ1j2+c2σ2j2. Then it follows that


By part (a), we know that x ~ cnγ/2 and log(p-s) = o(nγ). Thus if α0minjsμj1μj2σj12+σj22=nγβn for some βn, then similarly to part (a), we have


Combination of part (a) and part (b) completes the proof.

Proof of Theorem 4

The classification error of the truncated classifier δ^NCm is


We first consider the denominator. Note that α^jN(αj,nn1n2). It can be shown that


which together with the assumptions nmj=1mαj2 gives


Next, let us look at the numerator. We decompose it as


Since the second term above has the distribution N(0,j=1mαj2n2), it follows from the assumption nj=1mαj2 that


The third term in (7.13) can be written as


Hence the numerator is


Therefore, the classification error is


This concludes the proof.

Proof of Theorem 5

Note that the classification error of δ^FAIRbn is


We divide the proof into two parts: the numerator andthe denominator.

(a) First, we study the numerator of ΨH. It can be decomposed as


where I1=jA(μ1jμ^j)α^j1{α^jbn} and I2=jAc(μ1jμ^j)α^j1{α^jbn} with Ac the complementary of the set A. Note that


Since α^jN(αj,nn1n2), it follows from the normal tail probability inequality that for every jAc and bn>maxjAcαj,


where M is a generic constant. Thus for every ε > 0, if log(pm)[n(bnmaxjAcαj)2]0 and maxjAcαj<bn, we have


which tends to zero. Hence,


We next consider I2,2. Since E(ϵ^2j)2=1n2,log(pm)[n(bnmaxjAcαj)2]0, and maxjAcαj<bn, we have


which converges to 0. Therefore,


Then, we consider I2,3. Since c1n1/n2c2 and E(ϵ^1j2ϵ^2j2)2=3n12+3n222n1n2n12n223c2+32c1c1n22, by (7.14) we have for every ε > 0,


where M is some generic constant. Thus, I2,3P0. Combination of this with (7.15) and (7.16) entails


We now deal with I1. Decompose I1 similarly as


We first study I1,2. By using α^jN(αj,nn1n2), it can be shown that


since nmj=1mαj2, we have (4nn1n2jAαj2+2mn2n12n22)12jAαj20. Therefore,


Next, we look at I1,1. For any ε > 0,


When n is large enough, the above probability can be bounded by


which along with the assumption jAαj[njAαj2]0 gives


It follows that the numerator is bounded from below by


(b) Now, we study the denominator of Ψ. Let


We first show that J2P0. Note that Eα^j4=αj4+6n(n1n2)1αj2+3n2(n1n2)2. Thus,


This together with (7.14) and the assumption that log(pm)[n(bnmaxjAcαj)2]0 yields J2P0 as n,p. Now we study term J1. By (7.17), we have


Hence the denominator is bounded from a above by (1+oP(1))jAαj2+mnn1n2. Therefore,


It follows that the classification error is bounded from above by


This completes the proof.

Contributor Information

Jianqing Fan, Princeton University.

Yingying Fan, Harvard University.


  • ANTONIADIS A, LAMBERT-LACROIX S, LEBLANC F. Effective dimension reduction methods for tumor classification using gene expression data. Bioinformatics. 2003;19:563–570. [PubMed]
  • BAI Z, SARANADASA H. Effect of high dimension : by an example of a two sample problem. Statistica Sinica. 1996;6:311–329.
  • BAIR E, HASTIE T, DEBASHIS P, TIBSHIRANI R. Prediction by supervised principal components. The Annals of Statistics. 2007 to appear.
  • BICKEL PJ, LEVINA E. Some theory for Fisher's linear discriminant function, “naive Bayes”, and some alternatives when there are many more variables than observations. Bernoulli. 2004;10:989–1010.
  • BOULESTEIX A. PLS Dimension reduction for classification with microarray data. Statistical Applications in Genetics and Molecular Biology. 2004;3:1–33. [PubMed]
  • BÜHLMANN P, YU B. Boosting with the L2 loss: regression and classification. Journal of the American Statistical Association. 2003;98:324–339.
  • BURA E, PFEIFFER RM. Graphical methods for class prediction using dimension reduction techniques on DNA microarray data. Bioinformatics. 2003;19:1252–1258. [PubMed]
  • CAO HY. Moderate deviations for two sample t-statistics. Probability and Statistics. 2007 Forthcoming.
  • CHIAROMONTE F, MARTINELLI J. Dimension reduction strategies for analyzing global gene expression data with a response. Mathematical Biosciences. 2002;176:123–144. [PubMed]
  • DETTLING M, BÜHLMANN P. Boosting for tumor classification with gene expression data. Bioinformatics. 2003;19(9):1061–1069. [PubMed]
  • DUDOIT S, FRIDLYARD J, SPEED TP. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association. 2002;97:77–87.
  • FAN J. Test of significance based on wavelet thresholding and Neyman's truncation. Journal of the American Statistical Association. 1996;91:674–688.
  • FAN J, HALL P, YAO Q. To how many simultaneous hypothesis tests can normal, student's t or Bootstrap calibration be applied? Manuscript. 2006
  • FAN J, LI R. Statistical challenges with high dimensionality: feature selection in knowledge discovery. In: Sanz-Sole M, Soria J, Varona JL, Verdera J, editors. Proceedings of the International Congress of Mathematicians. Vol. III. 2006. pp. 595–622.
  • FAN J, REN Y. Statistical analysis of DNA microarray data. Clinical Cancer Research. 2006;12:4469–4473. [PubMed]
  • FAN J, LV J. Sure independence screening for ultra-high dimensional feature space. Manuscript. 2007
  • FRIEDMAN JH. Regularized discriminant analysis. Journal of the American Statistical Association. 1989;84:165–175.
  • GHOSH D. Singular value decomposition regression modeling for classification of tumors from microarray experiments. Proceedings of the Pacific Symposium on Biocomputing. 2002:11462–11467. [PubMed]
  • GREENSHTEIN E. Best subset selection, persistence in high dimensional statistical learning and optimization under l1 constraint. Ann. Statist. 2006 to appear.
  • GREENSHTEIN E, RITOV Y. Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli. 2004;10:971–988.
  • HUANG X, PAN W. Linear regression and two-class classification with gene expression data. Bioinformatics. 2003;19:2072–2978. [PubMed]
  • LIN Z, LU C. Limit Theory for Mixing Dependent Random Variables. Kluwer Academic Publishers; Dordrecht: 1996.
  • MEINSHAUSEN N. Relaxed Lasso. Computational Statistics and Data Analysis. 2007 to appear.
  • NGUYEN DV, ROCKE DM. Tumor classification by partial least squares using microarray gene expression data. Bioinformatics. 2002;18:39–50. [PubMed]
  • SHAO QM. Self-normalized Limit Theorems in Probability and Statistics. Manuscript. 2005
  • TIBSHIRANI R, HASTIE T, NARASIMHAN B, CHU G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. 2002;99:6567–6572. [PMC free article] [PubMed]
  • VAN DER VAART AW, WELLNER JA. Weak Convergence and Empirical Processes. Springer-Verlag; New York: 1996.
  • WEST M, BLANCHETTE C, FRESSMAN H, HUANG E, ISHIDA S, SPANG R, ZUAN H, MARKS JR, NEVINS JR. Predicting the clinical status of human breast cancer using gene expression profiles. Proc. Natl. Acad. Sci. 2001;98:11462–11467. [PMC free article] [PubMed]
  • ZOU H, HASTIE T, TIBSHIRANI R. Sparse principal component analysis. Technical report. 2004


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...