- Journal List
- NIHPA Author Manuscripts
- PMC2630123

# High Dimensional Classification Using Features Annealed Independence Rules

^{}Corresponding author.

## Abstract

Classification using high-dimensional features arises frequently in many contemporary statistical studies such as tumor classification using microarray or other high-throughput data. The impact of dimensionality on classifications is largely poorly understood. In a seminal paper, Bickel and Levina (2004) show that the Fisher discriminant performs poorly due to diverging spectra and they propose to use the independence rule to overcome the problem. We first demonstrate that even for the independence classification rule, classification using all the features can be as bad as the random guessing due to noise accumulation in estimating population centroids in high-dimensional feature space. In fact, we demonstrate further that almost all linear discriminants can perform as bad as the random guessing. Thus, it is paramountly important to select a subset of important features for high-dimensional classification, resulting in Features Annealed Independence Rules (FAIR). The conditions under which all the important features can be selected by the two-sample *t*-statistic are established. The choice of the optimal number of features, or equivalently, the threshold value of the test statistics are proposed based on an upper bound of the classification error. Simulation studies and real data analysis support our theoretical results and demonstrate convincingly the advantage of our new classification procedure.

**Keywords:**Classification, feature extraction, high dimensionality, independence rule, misclassification rates

## 1 Introduction

With rapid advance of imaging technology, high-throughput data such as microarray and proteomics data are frequently seen in many contemporary statistical studies. For instance, in the analysis of Microarray data, the dimensionality is frequently thousands or more, while the sample size is typically in the order of tens (West *et al.*, 2001; Dudoit *et al.*, 2002). See Fan and Ren (2006) for an overview. The large number of features presents an intrinsic challenge to classification problems. For an overview of statistical challenges associated with high dimensionality, see Fan and Li (2006).

Classical methods of classification break down when the dimensionality is extremely large. For example, even when the covariance matrix is known, Bickel and Levina (2004) demonstrate convincingly that the Fisher discriminant analysis performs poorly in a minimax sense due to the diverging spectra (*e.g.*, the condition number goes to infinity as dimensionality diverges) frequently encountered in the high-dimensional covariance matrices. Even if the true covariance matrix is not ill conditioned, the singularity of the sample covariance matrix will make the Fisher discrimination rule inapplicable when the dimensionality is larger than sample size. Bickel and Levina (2004) show that the independence rule overcomes the above two problems. However, in tumor classification using microarray data, we hope to find tens of genes that have high discriminative power. The independence rule, studied by Bickel and Levina (2004), does not possess this kind of properties.

The diffculty of high-dimensional classification is intrinsically caused by the existence of many noise features that do not contribute to the reduction of misclassification rate. Though the importance of dimension reduction and feature selection has been stressed and many methods have been proposed in the literature, very little research has been done on theoretical analysis of the impacts of high dimensionality on classification. For example, using most discrimination rules such as the linear discriminants, we need to estimate the population mean vectors from the sample. When the dimensionality is high, even though each component of the population mean vectors can be estimated with accuracy, the aggregated estimation error can be very large and this has adverse effects on the misclassification rate. Therefore, when there is only a fraction of features that account for most of the variation in the data such as tumor classification using gene expression data, using all features will increase the misclassification rate.

To illustrate the idea, we study independence classification rule. Specifically, we give an explicit formula on how the signal and noise affect the misclassification rates. We show formally how large the signal to noise ratio can be such that the effect of noise accumulation can be ignored, and how small this ratio can be before the independence classifier performs as bad as the random guessing. Indeed, as demonstrated in Section 2, the impact of the dimensionality can be very drastic. For the independence rule, the misclassification rate can be as high as the random guessing even when the problem is perfectly classifiable. In fact, we demonstrate that almost all linear discriminants can not perform any better than random guessing, due to the noise accumulation in the estimation of the population mean vectors, unless the signals are very strong, namely the population mean vectors are very far apart.

The above discussion reveals that feature selection is necessary for high-dimensional classification problems. When the independence rule is applied to selected features, the resulting Feature Annealed Independent Rules (FAIR) overcome both the issues of interpretability and the noise accumulation. One can extract the important features via variable selection techniques such as the penalized quasi-likelihood function. See Fan and Li (2006) for an overview. One can also employ a simple two-sample *t*-test as in Tibshirani *et al.*(2002) to identify important genes for the tumor classification, resulting in the nearest shrunken centroids method. Such a simple method corresponds to a componentwise regression method or a ridge regression method with ridge parameters tending to ∞ (Fan and Lv, 2007). Hence, it is a specific and useful example of the penalized quasi-likelihood method for feature selection. It is surprising that such a simple proposal can indeed extract all important features. Indeed, we demonstrate that under suitable conditions, the two sample *t*-statistic can identify all the features that efficiently characterize both classes.

Another popular class of the dimension reduction methods is projection. They have been widely applied to the classification based on the gene expression data. See, for example, principal component analysis in Ghosh (2002), Zou *et al.* (2004), and Bair *et al.*(2004); partial least squares in Nguyen and Rocke (2002), Huang and Pan (2003), and Boulesteix (2004); and sliced inverse regression in Chiaromonte and Martinelli (2002), Antoniadis *et al.*(2003), and Bura and Pfeiffer (2003). These projection methods attempt to find directions that can result in small classification errors. In fact, the directions found by these methods usually put much more weights on features that have large classification power. In general, however, linear projection methods are likely to perform poorly unless the projection vector is sparse, namely, the effective number of selected features is small. This is due to the aforementioned noise accumulation prominently featured in high-dimensional problems, recalling discrimination based on linear projections onto almost all directions can perform as bad as the random guessing.

As direct application of the independence rule is not efficient, we propose a specific form of FAIR. Our FAIR selects the statistically most significant *m* features according to the componentwise two-sample *t*-statistics between two classes, and applies the independence classifiers to these *m* features. Interesting questions include how to choose the optimal *m*, or equivalently, the threshold value of *t*-statistic, such that the classification error is minimized, and how this classifier performs compared with the independence rule without feature selection and the oracle-assisted FAIR. All these questions will be formally answered in this paper. Surprisingly, these results are similar to those for the adaptive Neyman test in Fan (1996). The theoretical results also indicate that FAIR without oracle information performs worse than the one with oracle information, and the difference of classification error depends on the threshold value, which is consistent with the common sense.

There is a huge literature on classification. To name a few in addition to those mention before, Bai and Saranadasa (1996) dealt with the effect of high dimensionality in a two-sample problem from a hypothesis testing viewpoint; Friedman (1989) proposed a regularized discriminant analysis to deal with the problems associated with high dimension while performing computations in the regular way; Dettling and Bühlmann (2003) and Bühlmann and Yu (2003) study boosting with logit loss and *L*_{2} loss, respectively, and demonstrate the good performances of these methods in high-dimensional setting; Greenshtein and Ritov (2004), Greenshtein (2006) and Meinshausen (2005) introduced and studied the concept of persistence, which places more emphasis on misclassification rates or expected loss rather than the accuracy of estimated parameters.

This article is organized as follows. In Section 2, we demonstrate the impact of dimensionality on the independence classification rule, and show that discrimination based on projecting observations onto almost all linear directions is nearly the same as random guessing. We establish, in Section 3, the conditions under which two sample *t*-test can identify all the important features with probability tending to 1. In Section 4, we propose FAIR and give an upper bound of its classification error. Simulation studies and real data analyses are conducted in Section 5. The conclusion of our study is summarized in Section 6. All proofs are given in the Appendix.

## 2 Impact of High Dimensionality

Consider the *p*-dimensional classification problem between two classes *C*_{1} and *C*_{2}. Suppose that from class *C _{k}*, we have

*n*observations

_{k}

*Y*_{k1}, ,

*Y*_{knk}in ${\mathbb{R}}^{p}$. The

*j*-th feature of the

*i*-th sample from class

*C*satisfies the model

_{k}where μ* _{kj}* is the mean effect of the

*j*-th feature in class

*C*and ϵ

_{k}_{kij}is the corresponding Gaussian random noise for

*i*-th observation. In matrix notation, the above model can be written as

where μ* _{k}* = (μ

_{k1}, , μ

*)′ is the mean vector of class*

_{kp}*C*and ϵ

_{k}*= (ϵ*

_{ki}_{ki1}, , ϵ

_{kip})′ has the distribution

*N*(

**0**,

**Σ**

*). We assume that all observations are independent across samples and in addition, within class*

_{k}*C*, observations

_{k}

**Y**_{k1}, ,

*are also identically distributed. Throughout this paper, we make the assumption that the two classes have compatible sample sizes, i.e.,*

**Y**_{kn}k*c*

_{1}≤

*n*

_{1}/

*n*

_{2}≤

*c*

_{2}with

*c*

_{1}and

*c*

_{2}some positive constants.

We first investigate the impact of high dimensionality on classification. For simplicity, we temporarily assume that the two classes *C*_{1} and *C*_{2} have the same covariance matrix **Σ**. To illustrate our idea, we consider the independence classification rule, which classifies the new feature vector *x* into class *C*_{1} if

where μ = (μ_{1} + μ_{2})/2 and **D** = diag(**Σ**). This classifier has been thoroughly studied in Bickel and Levina (2004). They showed that in the classification of two normal populations, this independence rule greatly outperforms the Fisher linear discriminant rule under broad conditions when the number of variables is large.

The independence rule depends on the marginal parameters μ_{1}, μ_{2} and $\mathbf{D}=\mathrm{diag}\{{\sigma}_{1}^{2},\cdots ,{\sigma}_{p}^{2}\}$. They can easily be estimated from the samples:

and

where ${S}_{kj}^{2}={\sum}_{i=1}^{{n}_{k}}{({Y}_{kij}-{\stackrel{\u2012}{Y}}_{kj})}^{2}\u2215({n}_{k}-1)$ is the sample variance of the *j*-th feature in class *k* and ${\stackrel{\u2012}{Y}}_{kj}={\sum}_{i=1}^{{n}_{k}}{Y}_{ki}\u2215{n}_{k}$. Hence, the plug-in discrimination function is

Denote the parameter by θ = (μ_{1}, μ_{2}, **Σ**). If we have a new observation * X* from class

*C*

_{1}, then the misclassification rate of $\hat{\delta}$ is

where

and Φ(·) is the standard Gaussian distribution function. The worst case classification error is

where Γ is some parameter space to be defined. Let *n* = *n*_{1} + *n*_{2}. In our asymptotic analysis, we always consider the misclassification rate of observations from *C*_{1}, since the misclassification rate of observations from *C*_{2} can be easily obtained by interchanging *n*_{1} with *n*_{2} and μ_{1} with μ_{2}. The high dimensionality is modeled through its dependence on *n*, namely *p*_{n} → ∞. However, we will suppress its dependence on *n* whenever there is no confusion.

Let **R** = **D**^{-1/2} **ΣD**^{-1/2} be the correlation matrix, and λ_{max}(**R**) be its largest eigenvalue, and α (α_{1}, , α* _{p}*)′ = μ

_{1}- μ

_{2}. Consider the parameter space

where *C _{p}* is a deterministic positive sequence that depends only on the dimensionality

*p*, and

*b*

_{0}is a positive constant. Note that α′

**D**

^{-1}α corresponds to the overall strength of signals, and the first condition α′

**D**

^{-1}α ≥

*C*imposes a lower bound on the strength of signals. The second condition λ

_{p}_{max}(

**R**) ≤

*b*

_{0}requires that the maximum eigenvalue of

**R**should not exceed a positive constant. But since there are no restrictions on the smallest eigenvalue of

**R**, the condition number can still diverge. The third condition $\underset{1\le j\le p,k=1,2}{\mathrm{min}}{\sigma}_{kj}^{2}>0$ ensures that there are no deterministic features that make classification trivial and the diagonal matrix

**D**is always invertible. We will consider the asymptotic behavior of $W(\hat{\delta},\theta )$ and $W\left(\hat{\delta}\right)$.

### Theorem 1

*Suppose that* log *p* = *o*(*n*), *n* = *o*(*p*) and *nC _{p}* → ∞. Then (i) The classification error W (δ, θ) with θ ϵ Γ is bounded from above as

(ii) Suppose p/(nC_{p}) → 0. For the worst case classification error W (δ), we have

Specifically, when ${\left\{{\scriptstyle \frac{{n}_{1}{n}_{2}}{pn}}\right\}}^{1\u22152}{C}_{p}\to {C}_{0}$ with C_{0} a nonnegative constant, then

In particular, if C_{0} = 0, then $W\left(\hat{\delta}\right)\phantom{\rule{thickmathspace}{0ex}}\stackrel{P}{\to}\phantom{\rule{thickmathspace}{0ex}}{\scriptstyle \frac{1}{2}}$.

Theorem 1 reveals the trade-off between the signal strength *C _{p}* and the dimensionality, reflected in the term ${C}_{p}\u2215\sqrt{p}$ when all features are used for classification. It states that the independence rule $\hat{\delta}$ would be no better than the random guessing due to noise accumulation, unless the signal levels are extremely high, say, ${\left\{{\scriptstyle \frac{n}{p}}\right\}}^{1\u22152}{C}_{p}\ge B$ for some

*B*> 0. Indeed, discrimination based on linear projections to almost all directions performs nearly the same as random guessing, as shown in the theorem below. The poor performance is caused by noise accumulation in the estimation of μ

_{1}and μ

_{2}.

### Theorem 2

*Suppose that* **a** is a p-dimensional uniformly distributed unit random vector on a (p - 1)-dimensional sphere. Let λ_{1}, , λ_{p} be the eigenvalues of the covariance matrix **Σ**. Suppose ${\mathrm{lim}}_{p}\phantom{\rule{thickmathspace}{0ex}}{\scriptstyle \frac{1}{{p}^{2}}}{\sum}_{j=1}^{p}{\lambda}_{j}^{2}<\infty $ and ${\mathrm{lim}}_{p}\phantom{\rule{thickmathspace}{0ex}}{\scriptstyle \frac{1}{p}}{\sum}_{j=1}^{p}{\lambda}_{j}=\tau $ with τ a positive constant. Moveover, assume that *p*^{-1}α′α → 0. Then if we project all the observations onto the vector **a** and use the classifier

the misclassification rate of $\hat{\delta}\mathbf{a}$ satisfies

where the probability is taken with respect to **a** and **X** ϵ C_{1}.

## 3 Feature Selection by Two-Sample *t*-Test

To extract salient features, we appeal to the two sample *t*-test statistics. Other componentwise tests such as the rank sum test can also be used, but we do not pursue those in detail. The two-sample *t*-statistic for feature *j* is defined as

where ${\stackrel{\u2012}{Y}}_{kj}$ and ${S}_{kj}^{2}$ are the same as those defined in Section 1. We work under more relaxed technical conditions: the normality assumption is not needed. Instead, we assume merely that the noise vectors ϵ* _{ki}*,

*i*= 1

*,*,

*n*are i.i.d. within class

_{k}*C*with mean

_{k}**0**and covariance matrix

**Σ**

_{k}, and are independent between classes. The covariance matrix

**Σ**

_{1}can also differ from

**Σ**

_{2}.

To show that the *t*-statistic can select all the important features with probability 1, we need the following condition.

### Condition 1

- Assume that the vector
*α*=*μ*_{1}-*μ*_{2}is sparse and without loss of generality, only the first*s*entries are nonzero. - Suppose that ϵ
and ${\u03f5}_{kij}^{2}-1$ satisfy the Cramér's condition, i.e., there exist constants ν_{kij}_{1}, ν_{2},*M*_{1}and*M*_{2}, such that $E{\mid {\u03f5}_{kij}\mid}^{m}\le m!{M}_{1}^{m-2}{\nu}_{1}\u22152$ and $E{\mid {\u03f5}_{kij}^{2}-{\sigma}_{kj}^{2}\mid}^{m}\le m!{M}_{2}^{m-2}{\nu}_{2}\u22152$ for all*m*= 1, 2, . - Assume that the diagonal elements of both
**Σ**_{1}and**Σ**_{2}are bounded away from 0.

The following theorem describes the situation under which the two sample *t*-test can pick up all important features by choosing an appropriate critical value. Recall that *c*_{1} ≤ *n _{1}*/

*n*

_{2}≤

*c*

_{2}and

*n*=

*n*

_{1}+

*n*

_{2}.

### Theorem 3

*Let s be a sequence such that* log(*p* - *s*) = *o*(*n*^{γ}) *and* $\mathrm{log}\phantom{\rule{thickmathspace}{0ex}}s=o\left({n}^{{\scriptstyle \frac{1}{2}}-\gamma}{\beta}_{n}\right)$ *for someβ*_{n} → ∞ *and* $0<\gamma <{\scriptstyle \frac{1}{3}}$. *Suppose that* $\underset{1\le j\le s}{\mathrm{min}}\frac{\mid {\alpha}_{j}\mid}{\sqrt{{\sigma}_{1j}^{2}+{\sigma}_{2j}^{2}}}={n}^{-\gamma}{\beta}_{n}$. Then under Condition 1, for *x* ~ *cn*^{γ/2} *with c some positive constant, we have*

In the proof of Theorem 3, we used the moderate deviation results of the two-sample *t*-statistic (see Cao, 2007 or Shao, 2005). Theorem 3 allows the lowest signal level to decay with sample size *n*. As long as the rate of decay is not too fast and the sample size is not too small, the two sample *t*-test can pick up all the important features with probability tending to 1.

## 4 Features Annealed Independence Rules

We apply the independence classifier to the selected features, resulting in a Features Annealed Independence Rule (FAIR). In many applications such as tumor classification using gene expression data, we would expect that elements in the population mean difference vector *α* are sparse: most entries are small. Thus, even if we could use *t*-test to correctly extract out all these features, the resulting choice is not necessarily optimal, since the noise accumulation can even exceed the signal accumulation for faint features. This can be seen from Theorem 1. Therefore, it is necessary to further single out the most important features that help reduce misclassification rate.

To help us select the number of features, or the critical value of the test statistic, we first consider the ideal situation that the important features are located at the first *m* coordinates and our task is to merely select *m* to minimize the misclassification rate. This is the case when we have the ideal information about the relative importance of features, as measured by *|**α*_{j}*|*/*σ*_{j}, say. When such an oracle information is unavailable, we will learn it from the data. In the situation that we have vague knowledge about the importance of features such as tumor classification using gene expression data, we can give high ranks to features with large *|**α*_{j}*|*/*σ*_{j}.

In the presentation below, unless otherwise specified, we assume that the two classes *C*_{1} and *C*_{2} are both from Gaussian distributions and the common covariance matrix is the identity, i.e., **Σ**_{1} = **Σ**_{2} = **I**. If this common covariance matrix is known, the independence classifier $\hat{\delta}$ becomes the nearest centroids classifier

If only the first *m* dimensions are used in the classification, the corresponding features annealed independence classifier becomes

where the superscript *m* means that the vector is truncated after the first *m* entries. This is indeed the same as the nearest shrunken centroids method of Tibshirani *et al.* (2002).

### Theorem 4

*Consider the truncated classifier* ${\hat{\delta}}_{NC}^{{m}_{n}}$ *for a given sequence m _{n}. Suppose that* ${\scriptstyle \frac{n}{\sqrt{{m}_{n}}}}{\sum}_{j=1}^{{m}_{n}}{\alpha}_{j}^{2}\to \infty $

*as m*

_{n}→ ∞.

*Then the classification error of*${\hat{\delta}}_{NC}^{{m}_{n}}$

*is*

*where n* = *n*_{1} + *n*_{2} *as defined in Section 2.*

In the following, we suppress the dependence of *m* on *n* when there is no confusion. The above theorem reveals that the ideal choice on the number of features is

It can be estimated as

where ${\hat{\alpha}}_{j}={\hat{\mu}}_{1j}-{\hat{\mu}}_{2j}$. The expression for *m*_{0} quantifies how the signal and the noise affect the misclassification rates as the dimensionality *m* increases. In particular, when *n*_{1} = *n*_{2}, the express reduces to ${m}_{0}={\mathrm{argmax}}_{1\le m\le p}\frac{{\left[{m}^{-1\u22152}{\sum}_{j=1}^{m}{\alpha}_{j}^{2}\right]}^{2}}{2\u2215n+{\sum}_{j=1}^{m}{\alpha}_{j}^{2}\u2215m}$. The term ${m}^{-1\u22152}{\sum}_{j=1}^{m}{\alpha}_{j}^{2}$ reflects the trade-off between the signal and noise as dimensionality *m* increases.

The good performance of the classifier ${\hat{\delta}}_{\mathrm{NC}}^{m}$ depends on the assumption that the largest entries of **α** cluster at the first *m* dimensions. An ideal version of the classifier $\hat{\delta}\mathrm{NC}$ is to select a subset $\mathcal{A}=\{j:\mid {\alpha}_{j}\mid >a\}$ and use this subset to construct independence classifier. Let *m* be the number of elements in $\mathcal{A}$. The oracle classifier can be written as

The misclassification rate is approximately

when ${\scriptstyle \frac{n}{\sqrt{m}}}{\sum}_{j\in \mathcal{A}}{\alpha}_{j}^{2}\to \infty $ and *m* → ∞. This is straightforward from Theorem 4. In practice, we do not have such an oracle, and selecting the subset $\mathcal{A}$ is difficult. A simple procedure is to use the feature annealed independence rule based on the hard thresholding:

We study the classification error of FAIR and the impact of the threshold *b* on the classification result in the following theorem.

### Theorem 5

*Suppose that* ${\mathrm{max}}_{j\in {\mathcal{A}}^{c}}\phantom{\rule{thickmathspace}{0ex}}\mid {\alpha}_{j}\mid <{b}_{n}$ *and* $\mathrm{log}(p-m)\u2215\left[n{({b}_{n}-{\mathrm{max}}_{j\in {\mathcal{A}}^{c}}\mid {\alpha}_{j}\mid )}^{2}\right]\to 0$ *with* $m=\mid \mathcal{A}\mid $. *Moreover, assume that* ${\scriptstyle \frac{n}{\sqrt{m}}}{\sum}_{j\in \mathcal{A}}{\alpha}_{j}^{2}\to \infty $ *and* ${\sum}_{j\in \mathcal{A}}\phantom{\rule{thinmathspace}{0ex}}\mid {\alpha}_{j}\mid \u2215\left[\sqrt{n}{\sum}_{j\in \mathcal{A}}{\alpha}_{j}^{2}\right]\to 0$. *Then*

Notice that the upper bound of $W({\hat{\delta}}_{\mathrm{FAIR}}^{{b}_{n}},\theta )$ in Theorem 5 is greater than the classification error in Theorem 4, and the magnitude of difference depends on $m{b}_{n}^{2}$. This is expected as estimating the set $\mathcal{A}$ increases the classification error. These results are similar to those in Fan (1996) for high-dimensional hypothesis testing.

When the common covariance matrix is different from the identity, FAIR takes a slightly different form to adapt to the unknown componentwise variance:

where *T _{j}* is the two sample

*t*-statistic. It is clear from (4.2) that FAIR works the same way as that we first sort the features by the absolute values of their

*t*-statistics in the descending order, and then take out the first

*m*features to classify the data. The number of features can be selected by minimizing the upper bound of the classification error given in Theorem 1. The optimal

*m*in this sense is

where ${\lambda}_{\mathrm{max}}^{m}$ is the largest eigenvalue of the correlation matrix **R**^{m} of the truncated observations. It can be estimated from the samples:

Note that the factor ${\lambda}_{\mathrm{max}}^{m}$ in (4.3) increases with *m*, which makes ${\hat{m}}_{1}$ usually smaller than ${\hat{m}}_{0}$.

## 5 Numerical Studies

In this section we use a simulation study and three real data analyses to illustrate our theoretical results and to verify the performance of our newly proposed classifier FAIR.

### 5.1 Simulation Study

We first introduce the model. The covariance matrices **Σ**_{1} and **Σ**_{2} for the two classes are chosen to be the same. For the distribution of the error ϵ* _{ij}* in (2.1), we use the same model as that in Fan, Hall and Yao (2006). Specifically, features are divided into three groups. Within each group, features share one unobservable common factor with different factor loadings. In addition, there is an unobservable common factor among all the features across three groups. For simplicity, we assume that the number of features

*p*is a multiple of 3. Let

*Z*be a sequence of independent

_{ij}*N*(0, 1) random variables, and ${\chi}_{ij}^{2}$ be a sequence of independent random variables of the same distribution as $({\chi}_{d}^{2}-d)\u2215\sqrt{2d}$ with ${\chi}_{d}^{2}$ the Chi-square distribution with degrees of freedom

*d*. In the simulation we set

*d*= 6.

Let {*a _{j}*} and {

*b*} be factor loading coefficients. Then the error in (2.1) is defined as

_{j}where *a _{ij}* = 0 except that

*a*=

_{1j}*a*for

_{j}*j*= 1, ,

*p*/3,

*a*

_{2j}=

*a*for

_{j}*j*= (

*p*/3) + 1, , 2

*p*/3, and

*a*=

_{3j}*a*for

_{j}*j*= (2

*p*/3)+1, ,

*p*. Therefore, ϵ

*= 0 and var(ϵ*

_{ij}_{ij}) = 1, and in general, within group correlation is greater than the between group correlation. The factor loadings

*a*and

_{j}*b*are independently generated from uniform distributions

_{j}*U*(0, 0.4) and

*U*(0, 0.2). The mean vector μ

_{1}for class

*C*

_{1}is taken from a realization of the mixture of a point mass at 0 and a double-exponential distribution:

where *c* ϵ (0, 1) is a constant. In the simulation, we set *p* = 4, 500 and *c* = 0.02. In other words, there are around 90 signal features on an average, many of which are weak signals. Without loss of generality, μ_{2} is set to be 0. Figure 1 shows the true mean difference vector α, which is fixed across all simulations. It is clear that there are only very few features with signal levels exceeding 1 standard deviation of the noise.

*x*-axis represents the dimensionality, and

*y*-axis shows the values of corresponding entries of α.

With the parameters and model above, for each simulation, we generate *n*_{1} = 30 training data from class *C*_{1} and *n*_{2} = 30 training data from *C*_{2}. In addition, separate 200 samples are generated from each of the two classes in each simulation, and these 400 vectors are used as test samples. We apply our newly proposed classifier FAIR to the simulated data. Specifically, for each feature, the *t*-test statistic in (3.1) is calculated using the training sample. Then the features are sorted in the decreasing order of the absolute values of their *t*-statistics. We then examine the impact of the number of features *m* on the misclassification rate. In each simulation, with *m* ranging from 1 to 4500, we construct the feature annealed independence classifiers using the training samples, and then apply these classifiers to the 400 test samples. The classification errors are compared to those of the independence rule with the oracle ordering information, which is constructed by repeating the above procedure except that in the first step the features are ordered by their true signal levels, |α|, instead of by their *t*-statistics

The above procedure is repeated 100 times, and averages and standard errors of the misclassification rates (based on 400 test samples in each simulation) are calculated across the 100 simulations. Note that the average of the 100 misclassification rates is indeed computed based on 100 × 400 testing samples.

Figure 2 depicts the misclassification rate as a function of the number of features *m*. The solid curves represent the average of classification rates across the 100 simulations, and the corresponding dashed curves are 2 standard errors (i.e. the standard deviation of 100 misclassification rates divided by 10) away from the solid one. The misclassification rates using the first 80 features in Figure 2(a) are zoomed in Figure 2(b). Figures 2(c) and 2(d) are the same as 2(a) and 2(b) except that the features are arranged in the decreasing order of |α|, i.e., the results are based on the oracle-assisted feature annealed independence classifier. From these plots we see that the classification results of FAIR are close to those of the oracle-assisted independence classifier. Moreover, as the dimensionality *m* grows, the misclassification rate increases steadily due to the noise accumulation. When all the features are included, i.e. *m* = 4500, the misclassification rate is 0.2522, whereas the minimum classification errors are 0.0128 in plot 2(b) and 0.0020 in plot 2(d). These results are consistent with Theorem 1. We also tried to decrease the signal levels, i.e., the mean of the double exponential distribution, or to increase the dimensionality *p*, and found that the classification error tend to 0.5 when all the dimensions are included. Comparing Figures 2(a) and 2(b) to Figures 2(c) and 2(d), we see that the features ordered by *t*-statistics has higher misclassification rates than those ordered by the oracle. Also, using *t*-statistics results in larger minimum classification errors (see plots 2(b) and 2(d)), but the differences are not very large.

**...**

Figure 3 shows the classification errors of the independence rule based on projected samples onto randomly chosen directions across 100 simulations. Specifically, in each of the simulations in Figure 2, we generate a direction vector a randomly from the (*p* - 1)-dimensional unit sphere, then project all the data in that simulation onto the direction a, and finally apply the Fisher discriminant to the projected data (see (2.3)). The average of these misclassification rates is 0.4986 and the corresponding standard deviation is 0.0318. These results are consistent with our Theorem 2.

Finally, we examine the effectiveness of our proposed method (4.3) for selecting features in FAIR. In each of the 100 simulations, we apply (4.3) to choose the number of features and compute the resulting misclassification rate based on 400 test samples. We also use the nearest shrunken centroids of Tibshirani *et al.* (2003) to select the important features. Figure 4 summarizes these results. The thin curves correspond to the nearest shrunken centroids method, and the thick curves correspond to FAIR. Figure 4(a) presents the number of features calculated from these two methods, and Figure 4(b) shows the corresponding misclassification rates. For our newly proposed classifier FAIR, the average of the optimal number of features over 100 simulations is 29.71, which is very close to the smallest number of features with the minimum misclassification rate in Figure 2(d). The misclassification rates of FAIR in Figure 4(b) have average 0.0154 and standard deviation 0.0085, indicating the outstanding performance of FAIR. Nearest shrunken centroids method is unstable in selecting features. Over the 100 simulations, there are several realizations in which it chooses plenty of features. We truncated Figure 4 to make it easier to view. The average number of features chosen by the nearest shrunken centroids is 28.43, and the average classification error is 0.0216, with corresponding standard deviation 0.0179. It is clear that nearest shrunken centroids method tends to choose less features than FAIR, but the misclassification rates are larger.

### 5.2 Real Data Analysis

#### 5.2.1 Leukemia Data

Leukemia data from high-density Affymetrix oligonucleotide arrays were previously analyzed in Golub *et al.*(1999), and are available at http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi. There are 7129 genes and 72 samples coming from two classes: 47 in class ALL (acute lymphocytic leukemia) and 25 in class AML (acute mylogenous leukemia). Among these 72 samples, 38 (27 in class ALL and 11 in class AML) are set to be training samples and 34 (20 in class ALL and 14 in class AML) are set as test samples.

Before classification, we standardize each sample to zero mean and unit variance as done by Dudoit *et al.* (2002). The classification results from the nearest shrunken centroids (NSC hereafter) method and FAIR are shown in Table 1. The nearest shrunken centroids method picks up 21 genes and makes 1 training error and 3 test errors, while our method chooses 11 genes and makes 1 training error and 1 test error. Tibshirani *et al.*(2002) proposed and applied the nearest shrunken centroids method to the unstandardized Leukemia dataset. They chose 21 genes and made 1 training error and 2 test errors. Our results are still superior to theirs.

To further evaluate the performance of the two classifiers, we randomly split the 72 samples into training and test sets. Specifically, we set approximately 100γ% of the observations from class ALL and 100γ% of the observations from class AML as training samples, and the rest as test samples. FAIR and NSC are applied to the training data, and their performances are evaluated by the test samples. The above procedure is repeated 100 times for γ = 0.4, 0.5 and 0.6, respectively, and the distributions of test errors of FAIR, NSC and the independence rule without feature selection are summarized in Figure 5. In each of the splits, we also calculated the difference of test errors between NSC and FAIR, i.e., the test error of FAIR minus that of NSC, and the distribution is summarized in Figure 5. The top panel of Figure 6 shows the number of features selected by FAIR and NSC for γ = 0.4. The results for the other two values of γ are similar so we do not present here to save the space. From these figures we can see that the performance of independence rule improves significantly after feature selection. The classification errors of NSC and FAIR are approximately the same. As we have already noticed in the simulation study, NSC is not good with feature selection, that is, the number of features selected by NSC is very large and unstable, while the number of features selected by FAIR is quite reasonable and stable over different random splits. Clearly, the independent rule without feature selection performs poorly.

**...**

#### 5.2.2 Lung Cancer Data

We evaluate our method by classifying between malignant pleural mesothelioma (MPM) and adenocarcinoma (ADCA) of the lung. Lung cancer data were analyzed by Gordon *et al.*(2002) and are available at http://www.chestsurg.org. There are 181 tissue samples (31 MPM and 150 ADCA). The training set contains 32 of them, with 16 from MPM and 16 from ADCA. The rest 149 samples are used for testing (15 from MPM and 134 from ADCA). Each sample is described by 12533 genes.

As in the Leukemia data set, we first standardize the data to zero mean and unit variance, and then apply the two classification methods to the standardized data set. Classification results are summarized in Table 2. Although FAIR uses 5 more genes than the nearest shrunken centroids method, it has better classification results: both methods perfectly classify the training samples, while our classification procedure has smaller test error.

We follow the same procedure as that in Leukemia example to randomly split the 181 samples into training and test sets. FAIR and NSC are applied to the training data, and the test errors are calculated using the test data. The procedure is repeated 100 times with γ = 0.4, 0.5 and 0.6, respectively, and the test error distributions of FAIR, NSC and the independence rule without feature selection can be found in Figure 7. We also present the difference of the test errors between FAIR and NSC in Figure 7. The numbers of features used by FAIR and NSC with γ = 0.5 are shown in the middle panel of Figure 6. Figure 7 shows again that feature selection is very important in high dimensional classification. The performance of FAIR is close to NSC in terms of classification error (Figure 7), but FAIR is stable in feature selection, as shown in the middle panel of Figure 6. One possible reason of Figure 7 might be that the signal strength in this Lung Cancer dataset is relatively weak, and more features are needed to obtain the optimal performance. However, the estimate of the largest eigenvalue is not accurate anymore when the number of features is large, which results in inaccurate estimates of *m*_{1} in (4.3).

#### 5.2.3 Prostate Cancer Data

The last example uses the prostate cancer data studied in Singh *et al.*(2002). The data set is available at http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi. The training data set contains 102 patient samples, 52 of which (labeled as “tumor”) are prostate tumor samples and 50 of which (labeled as “Normal”) are prostate samples. There are around 12600 genes. An independent set of test samples is from a different experiment and has 25 tumor and 9 normal samples.

We preprocess the data by standardizing the gene expression data as before. The classification results are summarized in Table 3. We make the same test error as and a bit larger training error than the nearest shrunken centroids method, but the number of selected genes we use is much less.

The samples are randomly split into training and test sets in the same way as before, the test errors are calculated, and the number of features used by these two methods are recorded. Figure 8 shows the test errors of FAIR, NSC and the independence rule without feature selection, and the difference of the test errors of FAIR and NSC. The bottom panel of Figure 6 presents the numbers of features used by FAIR and NSC in each random split for γ = 0.6. As we mentioned before, the plots for γ = 0.4 and 0.5 are similar so we omit them in the paper. The performance of FAIR is better than that of NSC both in terms of classification error and in terms of the selection of features. The good performance of FAIR might be caused by the strong signal level of few features in this data set. Due to the strong signal level, FAIR can attain the optimal performance with small number of features. Thus, the estimate of *m*_{1} in (4.3) is accurate and hence the actual performance of FAIR is good.

## 6 Conclusion

This paper studies the impact of high dimensionality on classifications. To illustrate the idea, we have considered the independence classification rule, which avoids the difficulty of estimating large covariance matrix and the diverging condition number frequently associated with the large covariance matrix. When only a subset of the features capture the characteristics of two groups, classification using all dimensions would intrinsically classify the noises. We prove that classification based on linear projections onto almost all directions performs nearly the same as random guessing. Hence, it is necessary to choose direction vectors which put more weights on important features.

The two-sample *t*-test can be used to choose the important features. We have shown that under mild conditions, the two sample *t*-test can select all the important features with probability one. The features annealed independence rule using hard thresholding, FAIR, is proposed, with the number of features selected by a data-driven rule. An upper bound of the classification error of FAIR is explicitly given. We also give suggestions on the optimal number of features used in classification. Simulation studies and real data analysis support our theoretical results convincingly.

## Acknowledgments

Financial support from the NSF grants DMS-0354223 and DMS-0704337 and NIH grant R01-GM072611 is gratefully acknowledged. The authors acknowledge gratefully the helpful comments of referees that led to the improvement of the presentation and the results of the paper.

## 7 Appendix

#### Proof of Theorem 1

For θ Γ, Ψ defined in (2.2) can be bounded as

where we have used the assumption that λ_{max}(**R**) ≤ *b*_{0}. Denote by

We next study the asymptotic behavior of $\stackrel{~}{\Psi}$.

Since Condition 1(b) in Section 3 is satisfied automatically for normal distribution, by Lemma 2 below we have $\hat{D}=\mathbf{D}(1+{o}_{P}\left(1\right))$, where *o _{P}*(1) holds uniformly across all diagonal elements. Thus, the right hand side of (7.1) can be written as

We first consider the denominator. Notice that it can be decomposed as

where ${\sigma}_{j}^{2}$ is the *j*-th diagonal entry of **D**, ${\hat{\sigma}}_{j}^{2}$, is the *j*-th diagonal entry of $\hat{D}$, and ${\hat{\u03f5}}_{kj}={\sum}_{i=1}^{{n}_{k}}{\u03f5}_{kij}\u2215{n}_{k},\phantom{\rule{thinmathspace}{0ex}}k=1,2$. Notice that ${\hat{\u03f5}}_{1}-{\hat{\u03f5}}_{2}~N(0,{\scriptstyle \frac{n}{{n}_{1}{n}_{2}}}\Sigma )$. By singular value decomposition we have

where **Q**_{R} is orthogonal matrix and ${\mathbf{V}}_{R}=\mathrm{diag}\{{\lambda}_{R,1},\cdots ,{\lambda}_{R,p}\}$ be the eigenvalues of the correlation matrix **R**. Define $\stackrel{~}{\u03f5}=\sqrt{{n}_{1}{n}_{2}\u2215n}{\mathbf{V}}_{R}^{-1\u22152}{\mathbf{Q}}_{R}^{\prime}{\mathbf{D}}^{-1\u22152}({\hat{\u03f5}}_{1}-{\hat{\u03f5}}_{2})$, then $\stackrel{~}{\u03f5}~N(0,\mathbf{I})$. Hence,

Since ${\sum}_{i=1}^{p}{\lambda}_{R,i}=p$ and ${\lambda}_{R,i}\ge 0$ for all *i* = 1, , *p*, we have ${\scriptstyle \frac{1}{{p}^{2}}}{\sum}_{i=1}^{p}{\lambda}_{R,i}^{2}<\infty $. By weak law of large number we have

Next, we consider *I*_{1}. Note that *I*_{1} has the distribution ${I}_{1}\sim N(0,{\scriptstyle \frac{n}{{n}_{1}{n}_{2}}}{\alpha \prime D}^{-1}{\Sigma D}^{-1}\alpha )$. Since ${\lambda}_{\mathrm{max}}\le {b}_{0},\phantom{\rule{thickmathspace}{0ex}}n{\alpha \prime D}^{-1}\alpha \ge n{C}_{p}\to \infty $ and

we have ${I}_{1}={\alpha \prime D}^{-1}\alpha {o}_{P}\left(1\right)$. This together with (7.2) and (7.3) yields

Now, we consider the numerator. It can be decomposed as

Denote by ${\stackrel{~}{I}}_{3}=\sum {\scriptstyle \frac{{\alpha}_{j}}{{\sigma}_{j}^{2}}}\left({\hat{\u03f5}}_{2j}\right)$. Note that

Define ${F}_{j}=\sqrt{{n}_{2}}{\scriptstyle \frac{{\alpha}_{j}}{{\sigma}_{j}^{2}}}{\hat{\u03f5}}_{2j}\u2215{\alpha \prime D}^{-1}\alpha $, then ${\sigma}_{{F}_{j}}^{2}\equiv \mathrm{var}\left({F}_{j}\right)\le 1$ for all *j*. For the normal distribution, we have the following tail probability inequality

Since ${F}_{j}\sim N(0,{\sigma}_{{F}_{j}}^{2})$, by the above inequality we have

with *C* some positive constant, for all *x* > 0 and *j* = 1,, *p*. By Lemma 2.2.10 of van de Vaart and Wellner (1996, P102), we have

where *K* is some universal constant. This together with (7.5) ensures that

Hence,

Now we only need to consider ${\stackrel{~}{I}}_{3}$. Note that ${\stackrel{~}{I}}_{3}=\sum {\scriptstyle \frac{{\alpha}_{j}}{{\sigma}_{j}^{2}}}{\hat{\u03f5}}_{2j}\sim N(0,{\scriptstyle \frac{1}{{n}_{2}}}{\alpha \prime D}^{-1}{\Sigma D}^{-1}\alpha )$. Since the variance term can be bounded as

By the assumption that $n{\alpha \prime D}^{-1}\alpha \to \infty $ and λ_{max}(**R**) is bounded, we have ${\stackrel{~}{I}}_{3}={\scriptstyle \frac{1}{2}}{\alpha \prime D}^{-1}\alpha {o}_{P}\left(1\right)$. Combining this with (7.6) leads to

We now examine *I*_{4} and *I*_{5}. By the similar proof to (7.3) above we have

Thus the numerator can be written as

and by (7.4)

Since ${\scriptstyle \frac{ax}{\sqrt{1+{a}^{2}x}}}$ is an increasing function of *x* and $\sum {\scriptstyle \frac{{\alpha}_{j}^{2}}{{\sigma}_{j}^{2}}}\ge {C}_{p}$, in view of (7.1) and the definition of the parameter space Γ, we have

If *p*/(*nC _{p}*) → 0, then $W\left(\hat{\delta}\right)=1-\Phi \left({\scriptstyle \frac{1}{2}}{[{n}_{1}{n}_{2}\u2215\left(pn{b}_{0}\right)]}^{1\u22152}{C}_{p}\{1+{o}_{P}\left(1\right)\}\right)$. Furthermore, if ${\left\{{\scriptstyle \frac{{n}_{1}{n}_{2}}{pn}}\right\}}^{1\u22152}{C}_{p}\to {C}_{0}$ with

*C*

_{0}some constant, then

This completes the proof.

#### Proof of Theorem 2

suppose we have a new observation ** X** from class

*C*

_{1}. Then the posterior classification error of using ${\hat{\delta}}_{\mathbf{a}}(\cdot )$ is

Where ${\Psi}_{\mathbf{a}}={\scriptstyle \frac{\mathbf{a}\prime {\mu}_{1}-\mathbf{a}\prime \hat{\mu}}{\sqrt{a\prime \Sigma a}}},\phantom{\rule{thickmathspace}{0ex}}\Phi (\cdot )$ is the standard Gaussian distribution function, and *E*^{a} means expectation taken with respect to **a**. We are going to show that

which together with the continuity of ·(.) and the dominated convergence theorem gives

Therefore, the posterior error $W({\hat{\delta}}_{\mathbf{a}},\theta )$ is no better than the random guessing.

Now, let us prove (7.7). Note that the random vector **a** can be written as

where **Z** is a *p*-dimensional standard Gaussian distributed random vector, independent of all the observations *Y*_{ki} and ** X**. Therefore,

where $\alpha ={\mu}_{1}-{\mu}_{2}$ and ${\hat{\u03f5}}_{k}={\scriptstyle \frac{1}{{n}_{k}}}{\sum}_{i=1}^{{n}_{k}}{\u03f5}_{ki},\phantom{\rule{thickmathspace}{0ex}}k=1,2$. By the singular value decomposition we have

where **Q** is an orthogonal matrix and $\mathbf{V}=\mathrm{diag}\{{\lambda}_{1},\cdots ,{\lambda}_{p}\}$ is a diagonal matrix. Let $\stackrel{~}{Z}=\mathbf{QZ}$, then $\stackrel{~}{Z}$ is also a *p*-dimensional standard Gaussian random vector. Hence the denominator of Ψ_{a} can be written as

where ${\stackrel{~}{Z}}_{j}$ is the *j*-th entry of $\stackrel{~}{Z}$. Since it is assumed that ${\mathrm{lim}}_{p}{\scriptstyle \frac{1}{{p}^{2}}}{\sum}_{j=1}^{p}{\lambda}_{j}^{2}<\infty $ and ${\mathrm{lim}}_{p}{\scriptstyle \frac{1}{p}}{\sum}_{j=1}^{p}{\lambda}_{j}=\tau $ for some positive constant τ, by the weak law of large numbers, we have

Next, we study the numerator of Ψ_{a} in (7.8). Since ${\scriptstyle \frac{1}{p}}{\sum}_{j=1}^{p}{\alpha}_{j}^{2}\to 0$, the first term of the numerator converges to 0 in probability, i.e.,

Let $\epsilon =\sqrt{{\scriptstyle \frac{{n}_{1}{n}_{2}}{n}}}({\hat{\u03f5}}_{1}+{\hat{\u03f5}}_{2})$ and $\stackrel{~}{\epsilon}={\mathbf{V}}^{-1\u22152}\mathbf{Q}\epsilon $, then $\stackrel{~}{\epsilon}$ has distribution *N*(0, **I**) and is independent of $\stackrel{~}{Z}$. The second term of the numerator can be written as

Since ${\scriptstyle \frac{n}{{n}_{1}{n}_{2}p}}{\sum}_{j=1}^{p}{\lambda}_{j}\to 0<\infty $, it follows from the weak law of large number that

This together with (7.8), (7.9), and (7.10) completes the proof.

We need the the following two lemmas to prove Theorem 3.

##### Lemma 1

*[Cao (2005)] Let n* = *n*_{1} + *n*_{2}. *Assume that there exist* 0 < *c*_{1} ≤ *c*_{2} < 1 *such that c*_{1} ≤ *n*_{1}/*n*_{2} ≤ *c*_{2}. *Let* ${\stackrel{~}{T}}_{j}={T}_{j}-{\scriptstyle \frac{{\mu}_{j1}-{\mu}_{j2}}{\sqrt{{S}_{1j}^{2}\u2215{n}_{1}+{S}_{1j}^{2}\u2215{n}_{2}}}}$. Then for any $x\equiv x({n}_{1},{n}_{2})$ *satisfying x* → ∞ and *x* = *o*(*n*^{1/2}),

*If in addition, if we have only* $E{\mid {Y}_{1ij}\mid}^{3}<\infty $ and $E{\mid {Y}_{2ij}\mid}^{3}<\infty $, then

*where* $d=(E{\mid {Y}_{1ij}\mid}^{3}+E{\mid {Y}_{2ij}\mid}^{3})\u2215{(\mathrm{var}\left({Y}_{1ij}\right)+\mathrm{var}\left({Y}_{2ij}\right))}^{3\u22152}$ *and O*(1) *is a finite constant depending only on c*_{1} *and c _{2}*.

*In particular*,

*uniformly in* $x\in (0,o\left({n}^{1\u22156}\right))$.

##### Lemma 2

Suppose Condition 1(b) holds and log p = o(n). Let ${S}_{kj}^{2}$ be the sample variance defined in Section 1, and ${\sigma}_{kj}^{2}$ be the variance of the j-th feature in class C_{k}. Suppose min ${\sigma}_{kj}^{2}$ is bounded away from 0. Then we have the following uniform convergence result

#### Proof of Lemma 2

For any $\epsilon >0$, we know when n_{k} is very large,

It follows from Bernstein's inequality that

and

Since log *p* = *o*(*n*), we have *I*_{1} = *o*(1) and *I*_{2} = *o*(1). These together with (7.11) completes the proof of Lemma 2.

#### Proof of Theorem 3

We devidte the proof into two parts. (a) Let us first look at the probability $P({\mathrm{max}}_{j>s}\mid {T}_{j}\mid >x)$. Clearly,

Note that for all *j* > *s*, α_{j} = μ_{j1} - μ_{j2} = 0. By Condition 1(b) and Lemma 1, the following inequality holds for 0 ≤ *x* ≤ *n*^{1/6}*/d*,

where *C* is a constant that only depends on *c*_{1} and *c*_{2}, and

with ${\sigma}_{kj}^{2}$ the *j*-th diagonal element of ${\Sigma}_{k}$. For the normal distribution, we have the following tail probability inequality

This together with the symmetry of *T*_{j} gives

Combining the above inequality with (7.12), we have

Since log(*p* - *s*) = *o*(*n*^{γ}) with $0<\gamma <{\scriptstyle \frac{1}{3}}$, if we let $x\sim c{n}^{\gamma \u22152}$, then

which along with (7.12) yields

(b) Next, we consider *P* (min_{j≤s} |*T*_{j}| ≤ *x*). Notice that for *j* ≤ *s*, α_{j} = μ_{1j} - μ_{2j} ≠ 0. Let ${\eta}_{j}={\scriptstyle \frac{{\alpha}_{j}}{\sqrt{{S}_{1j}^{2}\u2215{n}_{1}+{S}_{1j}^{2}\u2215{n}_{2}}}}$ and define

Then following the same lines as those in (a), we have

It follows from Lemma 2 that,

Hence, uniformly over *j* = 1, , *s*, we have

Therefore,

with *c*_{2} defined in Theorem 3. Let ${\alpha}_{0}={\mathrm{min}}_{j\le s}\mid {\mu}_{j1}-{\mu}_{j2}\mid \u2215\sqrt{{\sigma}_{1j}^{2}+{c}_{2}{\sigma}_{2j}^{2}}$. Then it follows that

By part (a), we know that *x* ~ *cn*^{γ/2} and log(*p*-*s*) = *o*(*n*^{γ}). Thus if ${\alpha}_{0}\sim \underset{j\le s}{\mathrm{min}}{\scriptstyle \frac{\mid {\mu}_{j1}-{\mu}_{j2}\mid}{\sqrt{{\sigma}_{j1}^{2}+{\sigma}_{j2}^{2}}}}={n}^{-\gamma}{\beta}_{n}$ for some ${\beta}_{n}\to \infty $, then similarly to part (a), we have

Combination of part (a) and part (b) completes the proof.

#### Proof of Theorem 4

The classification error of the truncated classifier ${\hat{\delta}}_{\mathrm{NC}}^{m}$ is

We first consider the denominator. Note that ${\hat{\alpha}}_{j}\sim N({\alpha}_{j},{\scriptstyle \frac{n}{{n}_{1}{n}_{2}}})$. It can be shown that

which together with the assumptions ${\scriptstyle \frac{n}{\sqrt{m}}}{\sum}_{j=1}^{m}{\alpha}_{j}^{2}\to \infty $ gives

Next, let us look at the numerator. We decompose it as

Since the second term above has the distribution $N(0,{\sum}_{j=1}^{m}{\alpha}_{j}^{2}\u2215{n}_{2})$, it follows from the assumption $n{\sum}_{j=1}^{m}{\alpha}_{j}^{2}\to \infty $ that

The third term in (7.13) can be written as

Hence the numerator is

Therefore, the classification error is

This concludes the proof.

#### Proof of Theorem 5

Note that the classification error of ${\hat{\delta}}_{\mathrm{FAIR}}^{{b}_{n}}$ is

We divide the proof into two parts: the numerator andthe denominator.

(a) First, we study the numerator of Ψ^{H}. It can be decomposed as

where ${I}_{1}={\sum}_{j\in \mathcal{A}}({\mu}_{1j}-{\hat{\mu}}_{j}){\hat{\alpha}}_{j}1\{\mid {\hat{\alpha}}_{j}\mid \ge {b}_{n}\}$ and ${I}_{2}={\sum}_{j\in {\mathcal{A}}^{c}}({\mu}_{1j}-{\hat{\mu}}_{j}){\hat{\alpha}}_{j}1\{\mid {\hat{\alpha}}_{j}\mid \ge {b}_{n}\}$ with ${\mathcal{A}}^{c}$ the complementary of the set $\mathcal{A}$. Note that

Since ${\hat{\alpha}}_{j}\sim N({\alpha}_{j},{\scriptstyle \frac{n}{{n}_{1}{n}_{2}}})$, it follows from the normal tail probability inequality that for every $j\in {\mathcal{A}}^{c}$ and ${b}_{n}>{\mathrm{max}}_{j\in {\mathcal{A}}^{c}}\phantom{\rule{thickmathspace}{0ex}}\mid {\alpha}_{j}\mid $,

where *M* is a generic constant. Thus for every ε > 0, if $\mathrm{log}(p-m)\u2215\left[n{({b}_{n}-{\mathrm{max}}_{j\in {\mathcal{A}}^{c}}\mid {\alpha}_{j}\mid )}^{2}\right]\to 0$ and ${\mathrm{max}}_{j\in {\mathcal{A}}^{c}}\mid {\alpha}_{j}\mid <{b}_{n}$, we have

which tends to zero. Hence,

We next consider *I*_{2,2}. Since $E{\left({\hat{\u03f5}}_{2j}\right)}^{2}={\scriptstyle \frac{1}{{n}_{2}}},\phantom{\rule{thickmathspace}{0ex}}\mathrm{log}(p-m)\u2215\left[n{({b}_{n}-{\mathrm{max}}_{j\in {\mathcal{A}}^{c}}\mid {\alpha}_{j}\mid )}^{2}\right]\to 0$, and ${\mathrm{max}}_{j\in {\mathcal{A}}^{c}}\mid {\alpha}_{j}\mid <{b}_{n}$, we have

which converges to 0. Therefore,

Then, we consider *I*_{2,3}. Since *c*_{1} ≤ *n*_{1}/*n*_{2} ≤ *c*_{2} and $E{({\hat{\u03f5}}_{1j}^{2}-{\hat{\u03f5}}_{2j}^{2})}^{2}={\scriptstyle \frac{3{n}_{1}^{2}+3{n}_{2}^{2}-2{n}_{1}{n}_{2}}{{n}_{1}^{2}{n}_{2}^{2}}}\le {\scriptstyle \frac{3{c}_{2}+3-2{c}_{1}}{{c}_{1}{n}_{2}^{2}}}$, by (7.14) we have for every ε > 0,

where *M* is some generic constant. Thus, ${I}_{2,3}\stackrel{\mathrm{P}}{\to}0$. Combination of this with (7.15) and (7.16) entails

We now deal with *I*_{1}. Decompose *I*_{1} similarly as

We first study *I*_{1,2}. By using ${\hat{\alpha}}_{j}\sim N({\alpha}_{j},{\scriptstyle \frac{n}{{n}_{1}{n}_{2}}})$, it can be shown that

since ${\scriptstyle \frac{n}{\sqrt{m}}}{\sum}_{j=1}^{m}{\alpha}_{j}^{2}\to \infty $, we have ${(\frac{4n}{{n}_{1}{n}_{2}}{\sum}_{j\in \mathcal{A}}{\alpha}_{j}^{2}+{\scriptstyle \frac{2m{n}^{2}}{{n}_{1}^{2}{n}_{2}^{2}}})}^{1\u22152}\u2215{\sum}_{j\in \mathcal{A}}{\alpha}_{j}^{2}\to 0$. Therefore,

Next, we look at *I*_{1,1}. For any ε > 0,

When *n* is large enough, the above probability can be bounded by

which along with the assumption ${\sum}_{j\in \mathcal{A}}\mid {\alpha}_{j}\mid \u2215\left[\sqrt{n}{\sum}_{j\in \mathcal{A}}{\alpha}_{j}^{2}\right]\to 0$ gives

It follows that the numerator is bounded from below by

(b) Now, we study the denominator of Ψ. Let

We first show that ${J}_{2}\stackrel{\mathrm{P}}{\to}0$. Note that $E{\hat{\alpha}}_{j}^{4}={\alpha}_{j}^{4}+6n{\left({n}_{1}{n}_{2}\right)}^{-1}{\alpha}_{j}^{2}+3{n}^{2}{\left({n}_{1}{n}_{2}\right)}^{-2}$. Thus,

This together with (7.14) and the assumption that $\mathrm{log}(p-m)\u2215\left[n{({b}_{n}-{\mathrm{max}}_{j\in {\mathcal{A}}^{c}}\phantom{\rule{thickmathspace}{0ex}}\mid {\alpha}_{j}\mid )}^{2}\right]\to 0$ yields ${J}_{2}\stackrel{\mathrm{P}}{\to}0$ as $n\to \infty ,\phantom{\rule{thickmathspace}{0ex}}p\to \infty $. Now we study term *J*_{1}. By (7.17), we have

Hence the denominator is bounded from a above by $(1+{o}_{P}\left(1\right)){\sum}_{j\in \mathcal{A}}{\alpha}_{j}^{2}+{\scriptstyle \frac{mn}{{n}_{1}{n}_{2}}}$. Therefore,

It follows that the classification error is bounded from above by

This completes the proof.

## Contributor Information

Jianqing Fan, Princeton University.

Yingying Fan, Harvard University.

## REFERENCES

- ANTONIADIS A, LAMBERT-LACROIX S, LEBLANC F. Effective dimension reduction methods for tumor classification using gene expression data. Bioinformatics. 2003;19:563–570. [PubMed]
- BAI Z, SARANADASA H. Effect of high dimension : by an example of a two sample problem. Statistica Sinica. 1996;6:311–329.
- BAIR E, HASTIE T, DEBASHIS P, TIBSHIRANI R. Prediction by supervised principal components. The Annals of Statistics. 2007 to appear.
- BICKEL PJ, LEVINA E. Some theory for Fisher's linear discriminant function, “naive Bayes”, and some alternatives when there are many more variables than observations. Bernoulli. 2004;10:989–1010.
- BOULESTEIX A. PLS Dimension reduction for classification with microarray data. Statistical Applications in Genetics and Molecular Biology. 2004;3:1–33. [PubMed]
- BÜHLMANN P, YU B. Boosting with the
*L*_{2}loss: regression and classification. Journal of the American Statistical Association. 2003;98:324–339. - BURA E, PFEIFFER RM. Graphical methods for class prediction using dimension reduction techniques on DNA microarray data. Bioinformatics. 2003;19:1252–1258. [PubMed]
- CAO HY. Moderate deviations for two sample t-statistics. Probability and Statistics. 2007 Forthcoming.
- CHIAROMONTE F, MARTINELLI J. Dimension reduction strategies for analyzing global gene expression data with a response. Mathematical Biosciences. 2002;176:123–144. [PubMed]
- DETTLING M, BÜHLMANN P. Boosting for tumor classification with gene expression data. Bioinformatics. 2003;19(9):1061–1069. [PubMed]
- DUDOIT S, FRIDLYARD J, SPEED TP. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association. 2002;97:77–87.
- FAN J. Test of significance based on wavelet thresholding and Neyman's truncation. Journal of the American Statistical Association. 1996;91:674–688.
- FAN J, HALL P, YAO Q. To how many simultaneous hypothesis tests can normal, student's
*t*or Bootstrap calibration be applied? Manuscript. 2006 - FAN J, LI R. Statistical challenges with high dimensionality: feature selection in knowledge discovery. In: Sanz-Sole M, Soria J, Varona JL, Verdera J, editors. Proceedings of the International Congress of Mathematicians. Vol. III. 2006. pp. 595–622.
- FAN J, REN Y. Statistical analysis of DNA microarray data. Clinical Cancer Research. 2006;12:4469–4473. [PubMed]
- FAN J, LV J. Sure independence screening for ultra-high dimensional feature space. Manuscript. 2007
- FRIEDMAN JH. Regularized discriminant analysis. Journal of the American Statistical Association. 1989;84:165–175.
- GHOSH D. Singular value decomposition regression modeling for classification of tumors from microarray experiments. Proceedings of the Pacific Symposium on Biocomputing. 2002:11462–11467. [PubMed]
- GREENSHTEIN E. Best subset selection, persistence in high dimensional statistical learning and optimization under
*l*_{1}constraint. Ann. Statist. 2006 to appear. - GREENSHTEIN E, RITOV Y. Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli. 2004;10:971–988.
- HUANG X, PAN W. Linear regression and two-class classification with gene expression data. Bioinformatics. 2003;19:2072–2978. [PubMed]
- LIN Z, LU C. Limit Theory for Mixing Dependent Random Variables. Kluwer Academic Publishers; Dordrecht: 1996.
- MEINSHAUSEN N. Relaxed Lasso. Computational Statistics and Data Analysis. 2007 to appear.
- NGUYEN DV, ROCKE DM. Tumor classification by partial least squares using microarray gene expression data. Bioinformatics. 2002;18:39–50. [PubMed]
- SHAO QM. Self-normalized Limit Theorems in Probability and Statistics. Manuscript. 2005
- TIBSHIRANI R, HASTIE T, NARASIMHAN B, CHU G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. 2002;99:6567–6572. [PMC free article] [PubMed]
- VAN DER VAART AW, WELLNER JA. Weak Convergence and Empirical Processes. Springer-Verlag; New York: 1996.
- WEST M, BLANCHETTE C, FRESSMAN H, HUANG E, ISHIDA S, SPANG R, ZUAN H, MARKS JR, NEVINS JR. Predicting the clinical status of human breast cancer using gene expression profiles. Proc. Natl. Acad. Sci. 2001;98:11462–11467. [PMC free article] [PubMed]
- ZOU H, HASTIE T, TIBSHIRANI R. Sparse principal component analysis. Technical report. 2004

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (2.4M) |
- Citation

- A ROAD to Classification in High Dimensional Space.[J R Stat Soc Series B Stat Methodol. 2012]
*Fan J, Feng Y, Tong X.**J R Stat Soc Series B Stat Methodol. 2012 Sep; 74(4):745-771. Epub 2012 Apr 12.* - The Fisher-Markov selector: fast selecting maximally separable feature subset for multiclass classification with applications to high-dimensional data.[IEEE Trans Pattern Anal Mach Intell. 2011]
*Cheng Q, Zhou H, Cheng J.**IEEE Trans Pattern Anal Mach Intell. 2011 Jun; 33(6):1217-33.* - Ultrahigh dimensional feature selection: beyond the linear model.[J Mach Learn Res. 2009]
*Fan J, Samworth R, Wu Y.**J Mach Learn Res. 2009; 10:2013-2038.* - Simultaneous gene clustering and subset selection for sample classification via MDL.[Bioinformatics. 2003]
*Jörnsten R, Yu B.**Bioinformatics. 2003 Jun 12; 19(9):1100-9.* - Trainable fusion rules. I. Large sample size case.[Neural Netw. 2006]
*Raudys S.**Neural Netw. 2006 Dec; 19(10):1506-16. Epub 2006 Apr 3.*

- Challenges of Big Data Analysis[National science review. 2014]
*Fan J, Han F, Liu H.**National science review. 2014 Jun; 1(2)293-314* - VARIABLE SELECTION IN LINEAR MIXED EFFECTS MODELS[Annals of statistics. 2012]
*Fan Y, Li R.**Annals of statistics. 2012 Aug 1; 40(4)2043-2068* - Multi-Task Linear Programming Discriminant Analysis for the Identification of Progressive MCI Individuals[PLoS ONE. ]
*Yu G, Liu Y, Thung KH, Shen D.**PLoS ONE. 9(5)e96458* - Criteria for the use of omics-based predictors in clinical trials: explanation and elaboration[BMC Medicine. ]
*McShane LM, Cavenagh MM, Lively TG, Eberhard DA, Bigbee WL, Williams PM, Mesirov JP, Polley MY, Kim KY, Tricoli JV, Taylor JM, Shuman DJ, Simon RM, Doroshow JH, Conley BA.**BMC Medicine. 11220* - Large Sample Size, Wide Variant Spectrum, and Advanced Machine-Learning Technique Boost Risk Prediction for Inflammatory Bowel Disease[American Journal of Human Genetics. 2013]
*Wei Z, Wang W, Bradfield J, Li J, Cardinale C, Frackelton E, Kim C, Mentch F, Van Steen K, Visscher PM, Baldassano RN, Hakonarson H, the International IBD Genetics Consortium.**American Journal of Human Genetics. 2013 Jun 6; 92(6)1008-1012*

- PubMedPubMedPubMed citations for these articles

- High Dimensional Classification Using Features Annealed Independence RulesHigh Dimensional Classification Using Features Annealed Independence RulesNIHPA Author Manuscripts. 2008; 36(6)2605

Your browsing activity is empty.

Activity recording is turned off.

See more...