• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of biostsLink to Publisher's site
Biostatistics. Apr 2009; 10(2): 282–296.
Published online Nov 27, 2008. doi:  10.1093/biostatistics/kxn035
PMCID: PMC2733174

A method for constructing a confidence bound for the actual error rate of a prediction rule in high dimensions

Abstract

Constructing a confidence interval for the actual, conditional error rate of a prediction rule from multivariate data is problematic because this error rate is not a population parameter in the traditional sense—it is a functional of the training set. When the training set changes, so does this “parameter.” A valid method for constructing confidence intervals for the actual error rate had been previously developed by McLachlan. However, McLachlan's method cannot be applied in many cancer research settings because it requires the number of samples to be much larger than the number of dimensions (n >> p), and it assumes that no dimension-reducing feature selection step is performed. Here, an alternative to McLachlan's method is presented that can be applied when p >> n, with an additional adjustment in the presence of feature selection. Coverage probabilities of the new method are shown to be nominal or conservative over a wide range of scenarios. The new method is relatively simple to implement and not computationally burdensome.

Keywords: Accuracy, Confidence interval, Error rate, Prediction

1. INTRODUCTION

Recent advances in multiplex and high-throughput assays in cancer research have sparked growing interest in the development of classifiers from multivariate data. Such classifiers could lead to clinical tests that would provide doctors and patients with valuable information on prognosis or likely therapeutic response. Although there are many methods for obtaining valid point estimates for a classifier's accuracy, the number of methods for constructing a confidence bound for a predictor's accuracy (or error rate) remains limited. In this paper, a new method is presented for constructing a confidence bound for the actual (true) accuracy of a prediction rule, which is the accuracy a predictor developed on a data set will have when applied to future samples from the population. This new method overcomes limitations of previous methods which needed to have n >> p and can be applied to p >> n situations. A further adaptation of the method is developed for the setting where p >> n, and there is also a dimension-reduction step.

First, a point of clarification seems needed. Confidence intervals or bounds are typically constructed for population parameters. In the classical formulation, a (1−α)100% confidence interval for a parameter θ based on data x satisfies P[θI(x)]=1α for every θ, where I(x) is an interval constructed based on the observed data x (see, e.g. Rao, 1973). Here, the I(x) notation indicates that the constructed interval is a function of the data x. In words, one typically says that intervals constructed in this way will, in the long run, contain θ(1α)100% of the time. But the actual, conditional error rate is not a population parameter like θ in this formulation because the value of the actual error rate depends on the samples selected for the training set. Therefore, the classical formulation is modified slightly, becoming P[θ(x)I(x)]=1α. For further discussion of the actual (true) error rate, the reader is referred to McLachlan (1992) or Efron and Tibshirani (1997).

Methods for constructing confidence intervals for population parameters are usually divided into approximate versus exact or parametric versus nonparametric. Approximate methods for interval construction utilize the approximate distribution of a test statistic (usually based on the central limit theorem) and are generally based on normal quantiles, whereas exact methods are based on quantiles of the exact distribution of the test statistic for specific parameter settings. Parametric methods assume that the data come from a specified parametric probability model. Confidence intervals can be based either on mathematical computations or on Monte Carlo; by generating many simulated data sets from the population distribution, quantiles of the test statistic distribution associated with particular parameter settings can be estimated, facilitating interval construction. Nonparametric methods assume only that the sample is random and representative of the population. The most common nonparametric method is the nonparametric bootstrap (Efron, 1985, 1987; Efron and Tibshirani, 1993), which estimates quantiles of the test statistic distribution by simulating data sets based on resampling of the original data rather than on an assumed probability model.

When the quantity of interest is not a population parameter, such as the actual error rate, then many of the methods described in the last paragraph cannot be utilized. For example, suppose one assumes a parametric model and one wants an interval for the actual error rate. Then, a pure Monte Carlo approach would fix settings of the parameters and generate many Monte Carlo simulation data sets to obtain the corresponding distribution of the test statistic given those parameter values. But note that this approach “will not work.” The problem is that, because the actual error rate is a function of the training data, every time a new data set is generated, even though the parameters are fixed, “the quantity of interest changes.” Therefore, simple Monte Carlo results will not produce the distribution of the test statistic conditional on the “quantity of interest” (the actual error rate), because simple Monte Carlo conditions on the wrong thing—namely, some underlying population parameters that are most likely nuisance parameters. Thus, even in the relatively simple case of an assumed parametric probability model, it is not straightforward to construct an interval. Not surprisingly, the situation gets even more complex in the nonparametric setting. Although obtaining a point estimate for the actual error rate is not problematic (e.g. Efron and Tibshirani, 1997), using resampling methods to obtain a valid nonparametric confidence interval for the actual error rate is.

McLachlan (1975, 1992) produced a method for obtaining a confidence interval for the actual error rate. The method is parametric and approximate as described above, being based on the multivariate normal model and estimating test statistic quantiles using normal quantiles and an estimate of the standard deviation. The estimation is based on the empirical Mahalanobis distance between the class means, An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx1_ht.jpg (e.g. Devroye and others, 1991, p. 30), where An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx2_ht.jpg is the vector of class means for samples from class i and S1 is the inverse sample covariance matrix. The Mahalanobis distance can only be estimated well when n >> p, so that McLachlan's method is restricted to this setting. The method presented in this paper replaces McLachlan's statistic, the sample Mahalanobis distance, with the cross-validated accuracy and then uses exact methods to construct the interval. The use of exact methods is made possible by first identifying a subspace associated with a particular error rate, obtaining the conditional distribution of a linear predictor over that subspace, and then the conditional distribution of sample sets of size n given that point on the subspace; then Monte Carlo methods can be applied to estimate required quantiles. The new method therefore does not require inversion of the sample covariance matrix and can be used in the setting of high-dimensional data. Also importantly, the methodology is extended to the setting of high-dimensional data with a dimension-reducing feature selection step. Feature selection can have important effects on cross validation (see, e.g. Ambroise and McLachlan, 2002; Simon and others, 2003), and ignoring these effects will lead to inadequate statistical procedures, so that this extension is critical.

There are many types of bounds for classifiers in the pattern recognition literature. The Vapnik–Chervonenkis theory (e.g. Devroye and others, 1991) addresses minimization of the empirical risk of a classifier. The Chernoff and Bhattacharyya bounds (e.g. Duda and others, 2001; Fukunaga, 1990) provide probability bounds for the expected error rate. Rogers and Wagner (Devroye and others, 1991) provide an upper bound on the difference between the actual error rate and the cross-validated error rate. Although these bounds can give important insights into the classification problem, there is no straightforward way to extend them to produce statistical confidence intervals (other than 100% confidence level intervals).

In the particular applied setting of microarray data, Michiels and others (2005) used a resampling procedure to construct error rate confidence intervals on a number of microarray data sets. A comparison of bootstrap methods, and new bootstrap approach, on microarray data was presented in Jiang and others (2008). Xu and others (2006) examine the joint distribution of the estimated and true error for a variety of feature-label distributions, classification and estimation rules. Confidence interval construction is closely related to sample size determination, and sample size methods for classifier development in microarray studies were presented in Fu and others (2005), Mukherjee and others (2003), and Dobbin and Simon (2007).

This paper develops a novel approach to using cross-validated estimates to construct a confidence interval for prediction error. This general approach is described in Section 2. Section 3 presents mathematical development results for the normal homoscedastic model, the extension to the case of gene selection, and related discussion. Section 4 presents the results of the Monte Carlo simulations assessing the coverage probabilities and of the applications to several high-dimensional microarray data sets. Section 5 briefly summarizes the conclusions.

2. THE GENERAL APPROACH FOR CONSTRUCTING A CONFIDENCE INTERVAL FOR THEACTUAL ERROR RATE

Suppose a population An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx3_ht.jpg can be partitioned into 2 classes, C1 and C2. A collection of samples from the population will be used to develop a classification function that produces predictions based on a vector of observations x on an individual. The parametric model stipulates the distribution in each class. Also suppose that a set of n samples has been studied and a specific predictor developed. Call the predictive function An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx4_ht.jpg An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx5_ht.jpg is a function that classifies any sample into either C1 or C2 based on the observation x.

Suppose a k-fold cross validation is used to estimate the actual error rate of An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx5_ht.jpg. Call this error rate An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx6_ht.jpg. A (1α)100% upper confidence bound on the actual error rate of An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx5_ht.jpg is

An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx7_ht.jpg

where An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx8_ht.jpg is the estimated cumulative distribution function of the cross-validated error rate for the fixed true error rate ϵ. This is the usual exact approach to confidence bound or interval construction (see, e.g. Stuart and others, 1999). If one knew the distribution function, G of the cross-validated error rates for a fixed actual error rate, then the confidence interval would be truly “exact.” In actual practical implementation with estimated An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx8_ht.jpg, the value of ϵUpperbound is computationally burdensome to estimate precisely, but in real-world settings, a rough 2-digit approximation is usually adequate and this level of precision can be obtained easily. To simplify presentation, discussion will focus on leave-one-out cross validation (LOOCV) for the rest of this paper.

3. MATHEMATICAL DEVELOPMENT FOR THE MULTIVARIATE NORMAL HOMOSCEDASTIC MODEL

The multivariate normal homoscedastic model for the data vectors is

An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx9_ht.jpg

where φ(·|μ,Σ) is the multivariate normal density with mean vector μ and covariance matrix Σ and p1 is the proportion in the population from class C1. To simplify presentation, the population is centered around zero. To further simplify presentation, it will be assumed that p1=1/2 and that n/2 are sampled at random from each class for the training set.

3.1. The geometry of the subspace of linear classifiers with a specified accuracy

In Appendix A, it is shown that a linear predictor L with actual error rate ϵ must satisfy the equation

An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx10_ht.jpg

where kϵ=Φ1(1ϵ). Φ is the cumulative normal distribution function. This subspace is denoted as An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx11_ht.jpg. This result is a natural extension of results from Dobbin and Simon (2007). If the covariance matrix Σ is known and nonsingular, then a data transformation can produce an identity covariance matrix of the form Σ=Ip, where Ip denotes a p by p identity matrix. It is shown in Appendix B that when Σ=Ip, An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx12_ht.jpg is the cone of points that form the angle

An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx13_ht.jpg

with the vector μ. This identifies the solution subspace An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx12_ht.jpg in the p-dimensional space of L. The solution space is diagrammed in Figure 1.

Fig. 1.

Geometry of the solution space.1

3.2. The conditional distribution of linear classifiers over the subspace

For mathematical simplicity, this paper will develop the model using the predictor

An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx14_ht.jpg
(3.1)

where An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx15_ht.jpg is the vector of means for class Cc. The joint distribution of the sample, An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx16_ht.jpg, and L, conditional on μ and Σ, can then be used to calculate the marginal distribution of L and L given An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx17_ht.jpg.

Given μ, fixing the actual error rate ϵ is equivalent to restricting the distribution of L to the subspace An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx12_ht.jpg. However, μ is unknown and must be estimated. Moreover, the larger |μ|, the smaller E[ϵ]. The actual error rate will center around the expected error rate with some variation, and if n is reasonably large, this variation will likely be reasonably small. In other words, ϵE[ϵ||μ|]. Therefore, a reasonable approximation for |μ| can be obtained from the relationship between |μ| and E[ϵ||μ|]. For a range of values of |μ|, Monte Carlo is used to estimate the average error rate corresponding to each value of |μ|. This results in a function An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx18_ht.jpg. Then, An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx19_ht.jpg−1 is used to estimate an approximate |μ| for that ϵ. Alternatively, the approach of Moran (1975) can be used to construct An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx19_ht.jpg.

Define a scalar μϵ =An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx19_ht.jpg−1(ϵ). In Appendix C, it is shown using the delta method and the properties of the noncentral chi-squared distribution that the conditional mean and variance of the length |L| for a given ϵ can be approximated by

An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx20_ht.jpg
(3.2)
An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx21_ht.jpg
(3.3)

These formulas can be used to generate samples over An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx12_ht.jpg with the appropriate mean and variance. The generation can be simplified by using a normal distribution and exploiting the symmetry by restricting simulation to a single ray of the cone of An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx12_ht.jpg.

Finally, note that it is important to estimate the distribution over An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx12_ht.jpg adequately because the relation between the actual error rate of the predictor and the LOOCV error rate will be affected by where on the subspace An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx12_ht.jpg the predictor L falls. In particular, the LOOCV error rate estimate will not be nearly unbiased for all points on the surface An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx12_ht.jpg. This issue is discussed further below in Section 4.

3.3. The conditional distribution of the sample given L

Section 3.2 provides a method for generating L [set membership]An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx12_ht.jpg. In other words, the method generates data from the marginal distribution of L given ϵ utilizing the joint distribution of L and the sample given μ. The conditional distribution of the sample given L is derived in Appendix D and shown to be as follows:

An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx22_ht.jpg

Here, Jm,n indicates a matrix of 1s with m rows and n columns and An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx23_ht.jpg indicates the Kronecker matrix product (Hocking, 1992, p. 672). This formula is used to generate samples conditional on L. In simulations, Σ=I as discussed previously. In some high-dimensional cases, generating samples with this covariance structure may be computationally demanding and it may be adequate to drop the final variance term which should be small for large n, as has been done in the high-dimensional examples in Section 4.

3.4. Calculating the exact error rate when a cutoff is chosen

When estimating coverage probabilities of confidence intervals for the actual error rate, the predictor consists of both L and a cutoff value or cut point to be used for classification. The cut point is usually chosen empirically from the data. Under compound covariate prediction (CCP) used in this paper, the cut point is k =An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx24_ht.jpg1An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx24_ht.jpg2, where An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx24_ht.jpgi is the mean prediction score for samples from class i. Suppose k is the cut point used for classification. Then, it is easy to show that the actual accuracy for a linear predictor L with this cutoff is

An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx25_ht.jpg
(3.4)

3.5. Generalizing to the context of unknown variance and correlations

The mathematical development is based on simplifying assumptions. In non-simulation settings, the covariance matrix is not known and needs to be estimated. If the sample size is large and the dimension is small, then a good estimate of the covariance matrix is possible and the new procedure can be used by substituting the estimated S for Σ. If the dimension is large, then estimation of the covariance matrix will generally not be feasible. A common approach in high dimensions is to assume the covariance matrix is diagonal (Dudoit and others, 2002) and proceed under this assumptions using a standard predictor such as CCP (Radmacher and others, 2002).

When assessing coverage probabilities for the unknown covariance Gaussian case, the predictor L described above will be replaced with the predictor

An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx26_ht.jpg
(3.5)

where An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx27_ht.jpg and An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx28_ht.jpg. Here, An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx29_ht.jpg and S1,i2 are the sample mean and variance in class C1 for feature i, respectively. It is not clear a priori what impact estimating the covariance matrix elements will have on coverage probabilities. Monte Carlo will be used to assess the impact.

3.6. Generalizing to the context of feature selection

In high dimensions, it is common to select features for inclusion in the predictor. Suppose the predictor development algorithm stipulates that q features will be selected from the full set of p features, with q < p. These “selected” features may be either linear combinations of the existing features resulting from dimension-reduction methods, such as principal components, or they may be a simple subset of the existing features, such as a list of q genes involved in a particular gene ontology category, or with the smallest p-values. Denote the Mahalanobis distance between the class means in the original space by Δp and in the reduced q-dimensional space by Δq. Then, it is easy to see that ΔpΔq. This implies that the optimal performance (often called the Bayes error rate) of a predictor is always as good or better in the full dimensional space than in the reduced dimensional space. However, the expected performance associated with a predictor developed on a “finite sample size” may, on the contrary, be better in the reduced dimensional space.

The geometry in the reduced dimensional space is affected by dimension reduction. For example, μ may be shorter after feature selection, reflecting missed informative features. In LOOCV with feature selection, the geometry varies during each LOOCV step. Let (a0,q0) be the actual accuracy and the dimension of the predictor developed on the full data set. Similarly, let (a1,q1),,(an,qn) be the accuracies and dimensions of the predictors developed during LOOCV. For n reasonably large, An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx30_ht.jpga0 since removing a single sample from the training set will not have a large impact on the form of the predictor. Similarly, An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx31_ht.jpgq0 at least to an order of magnitude and depending on how the dimension reduction is done. In both cases, there may be some downward bias for the smaller sample size, but this should be small as long as n is not too small.

The previous method for estimating |μ| needs to be revised in the context of gene selection. Recall that the previous method was based on first a Monte Carlo estimation of the functional relation g(|μ|)=E[ϵ]. Now, let gselection(|μ|) be the function corresponding to a modified algorithm with a gene selection step. This function can be estimated easily by Monte Carlo in the same way g(|μ|) was estimated, except that the Monte Carlo will include the gene selection step. When this was done, it was found generally that gselection(|μ|)g(|μ|) when μ is in the reduced dimensional subspace—in other words, the expected error rate associated with the same μ was higher with gene selection than without gene selection. This is intuitively the right direction since there is some cost associated with looking through a large number of genes p to get down to a selected set q which has to be paid in the gselection case but not in the g case; hence, the former requires larger differences between the populations to achieve the same accuracy.

What really matters to this current methodology is how the switch from g to gselection affects corresponding percentiles of the LOOCV accuracy distribution—that is, the percentiles on which confidence bounds are based. Note that the methodology uses Monte Carlo and (3.4) to estimate gselection, and thus μϵ, and then generates many data sets over An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx12_ht.jpg. The empirical distribution function (EDF) of the LOOCV error rates of these data sets serves as An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx32_ht.jpg in Section 3, and quantiles of this EDF correspond to confidence bounds. Figure 2 shows one example of the impact of gene selection. It shifts LOOCV accuracy percentiles upward toward 1, indicating that failure to perform this adjustment will result in anti-conservative coverage probabilities.

Fig. 2.

Comparison of estimated 90th percentiles of distribution of LOOCV accuracies using naive binomial method versus new method without gene selection versus new method with gene selection. X-axis is the true accuracy of the predictor. Quantiles based on 1000 ...

These considerations lead to the following approach: in the presence of feature selection, apply the method with the modifications that (1) when estimating |μ|, use An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx33_ht.jpg in (3.2) and (3.3) and (2) when estimating An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx8_ht.jpg, the dimension should match the dimension of the selected feature space of the predictor developed on the full data set of n samples.

4. RESULTS

Table 1 shows the coverage probabilites of the actual accuracy of the compound covariate predictor, Lu, in low-dimensional settings. The naive binomial method computes an exact bound based on the observed LOOCV accuracy and using the exact binomial method implemented in “R” version 2.6.2. Comparing the new method to the binomial, note that the new method is overall more conservative and provides coverage probabilities that are equivalent to the nominal or conservative. The binomial does adequately for the most part in these settings but does display some of the expected anti-conservative bias when n=500 and p=20. McLachlan's method is also shown for comparison and performs similar to the new method except as the dimension approaches the sample size, for example, where n=60 and p=50. Then, the new method maintains the correct coverage whereas McLachlan's is extremely anti-conservative. Even with n=60 and p=20, McLachlan's method shows some anti-conservative tendency.

Table 1.

Table of coverage probabilities for compound covariate predictor Lu of (3.5) in low dimensions. |μ| = 1 for all entries. Coverage probabilities estimated from 1000 Monte Carlo simulations

Table 2 shows the coverage probabilities in high-dimensional settings with p=500 dimensions. In this case, the comparison is only to the naive binomial method because McLachlan's method cannot be used when the sample covariance matrix is not invertible. The new method tends to be more conservative than the binomial and better maintains the nominal coverage probabilities throughout. Indeed, the binomial coverage probabilities are mostly below the nominal.

Table 2.

Table of coverage probabilities for compound covariate predictor Lu of (3.5) in high dimensions. Dimension p = 500 for all entries. Coverage probabilities estimated from 1000 Monte Carlo repeats for n = 60 and 200 Monte ...

Table 3 shows the results from high dimension with gene selection. Gene selection appears to negatively impact the coverage probabilities of the naive binomial method, which are here badly anti- conservative. This is probably a result of the fact that the gene selection step of choosing only genes with large differences between the classes will tend to result in “longer” predictors—and as was seen above, the relation between the true error rate and the LOOCV error rate is mediated through the predictor's length. The simple binomial method makes no correction for this fact. However, the new approach does correct for this fact, and one can see this reflected in the coverage probabilities for the new method, which are all at or above the nominal. Also presented for comparison is the bootstrap approach of Jiang and others (2008), which performs similar to the new method. Therefore, the new method provides conservative intervals in the setting of gene selection.

Table 3.

Table of coverage probabilities of Lu in high dimensions with gene selection. In each case, the 12 genes with the most significant p-values from pooled-variance t-tests were selected for use in the predictor. Data are generated with DEG number of informative ...

Table 4 shows the application of the method to several real data sets, along with comparison to the simple binomial method on the same data sets. In each case, there is considerable difference between confidence bounds based on the new method compared to the binomial method. Based on the simulations, this suggests that the binomial intervals for these data sets may be inadequate.

Table 4.

Comparison of lower confidence bounds produced from methods applied to real microarray data sets. For the Bhattacharjee and others (2001) data set, the classes predicted were adenocarcimona versus other, using 94 genes with > 2-fold mean change ...

5. DISCUSSION AND CONCLUSIONS

This paper presented a new method for constructing confidence bounds for the actual error rate of a predictor. The new method was compared to the naive binomial approach and to 2 previously published methods. The new method appears to obtain nominal coverage probabilities in low dimensions, high dimensions, and high dimensions with gene selection, whereas alternative methods sometimes fail to achieve nominal coverage. In high dimensions with gene selection, the new method performed similarly to the bootstrap approach of Jiang and others (2008). Interestingly, the naive binomial method performed particularly poorly in the setting of high dimensions with gene selection, indicating the method is ill-suited to this setting which is most common in microarray experiments. In contrast, the new method is able to capture the effect of gene selection by a relatively simple adjustment, resulting in nominal or slightly conservative coverage probabilities. The methods were also applied to several real microarray data sets.

Another interesting result from this work is that it provides insight into why the feature selection step results in particularly anti-conservative naive binomial method coverage probabilities. The feature selection step will generally impact the average length of a linear predictor. As a result, the distribution of the linear predictor over the subspace corresponding to a fixed true error rate (the cone in Figure 1) is different if feature selection has occurred. The predictor is longer in the gene selection case. But, most importantly, the Monte Carlo investigations showed that the relation between the actual error rate of a predictor and the LOOCV error rate of a predictor is mediated through the length of the predictor. That is, for a fixed actual error rate (e.g. a point on the cone in Figure 1), say of 5%, longer predictors will on average result in LOOCV error rates < 5% and shorter predictors will on average result in LOOCV error rates >5%. This inflates the anti-conservative tendency of the naive binomial intervals.

Robustness of methodologies is a common concern, particularly in microarray studies. The multivariate normal distribution may be violated here. As a result, it makes sense to apply several methodologies to a data set to assess robustness of results. For the specific context of constructing a confidence interval for prediction error, one could apply both the parametric method presented here and the bootstrap method in Jiang and others (2008) to get some assurance of the robustness of the interval result. Also note that the simulations in this paper focused on n reasonably large in absolute terms (n60). The performance of the method when n is small in absolute terms (e.g. n < 40) has not been studied extensively, and application may require further modifications (e.g. second-order Taylor series approximations in (3.2) and (3.3)).

This paper has made some additional simplifying assumptions, such as that the prevalence from each class in the population is the same, that there are only 2 classes, that the covariance matrix is the same within each class, and that the population is centered around zero. If the prevalence from each class is not equal, then the (3.4) should be modified so that 1/2 fractions reflect the true prevalences p1 and p2 and the Monte Carlo procedures should be modified throughout so that the data sets generated reflect the population prevalence. If there are more than 2 classes, then the geometric approach presented here will become more complex and it may not be feasible to adapt this approach to that setting. If the diagonal elements of the covariance matrix differed by class in high dimensions, then CCP and corresponding gene selection would need to be adjusted accordingly. Once a substitute for Lu was established, coverage probabilities would need to be evaluated for the new predictor. Finally, note that the assumption of centering around zero is only made to simplify the mathematical presentation and such centering is not critical to either the basic mathematical results or the CCP method coverage assessments.

This paper has focused on the predictor development method known as CCP with or without feature selection. The method can be adapted to diagonal linear discriminant analysis (DLDA) by substituting DLDA predictors for CCP predictors in the Monte Carlo generation of the LOOCV error rates used to estimate percentiles of the LOOCV distribution. Coverage probabilities would need to be reassessed. Adaption to other linear predictor development algorithms may be possible but require more extensive modifications.

Finally, note that the computational cost required by this approach is higher than McLachlan's approach but still reasonable. McLachlan's approach is a function of the sample Mahalanobis distance. The new approach requires some preliminary Monte Carlo work to estimate g or gselection which is the functional relation between the Mahalanobis distance and the average error rate. In R, steps of this search took 40.860 s for 50 Monte Carlo in the dimension-reduction scenario of Table 3 with p=1000 and n=60. Then, a simple search algorithm with another Monte Carlo is used to find the true accuracy which places the corresponding percentile of the LOOCV accuracy just to one side of the observed LOOCV accuracy. The C++ execution time for the Monte Carlo program was 60.497 s to perform 200 simulations in same scenario. This search can start at the percentile corresponding to the binomial confidence bound and should then converge in a handful of steps assuming high precision is not required.

Acknowledgments

All compound covariate analyses of actual microarray data sets were performed using Biometric Research Branch ArrayTools developed by Dr Richard Simon and Amy Peng Lam. Real data sets were downloaded from the Biometric Research Branch-ArrayTools Data Archive for Human Cancer Gene Expression (http://linus.nci.gov/~brb/DataArchive_New.html). All other calculations were carried out using programming languages R and C++ with Optivec (http://www.optivec.com) shareware version 5.0. Lisa McShane is acknowledged for helpful suggestions regarding the simulations during the manuscript development. Two reviewers are acknowledged for comments that improved the manuscript. Conflict of Interest: None declared.

APPENDIX A: LINEAR PREDICTORS WITH ACTUAL ERROR RATE ϵ

For the predictor to have accuracy 1ϵ on the population, it must be as follows:

An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx36_ht.jpg

Now,

An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx37_ht.jpg

So any classifier L with actual error rate 1ϵ must satisfy

An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx38_ht.jpg

as claimed.

APPENDIX B: THE SUBSPACE An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx12_ht.jpg

When Σ=Ip,

An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx39_ht.jpg

Now, the dot product of L and μ is An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx40_ht.jpg, where θ is the angle between μ and L. Hence,

An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx41_ht.jpg

APPENDIX C: DISTRIBUTION OF |L| GIVEN An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx17_ht.jpg

Without loss of generality, suppose μ has the form (μϵ,0,,0), where μϵ>0. Then,

An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx42_ht.jpg

Now, An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx43_ht.jpg. Also, LΣ1L is noncentral chi squared with p degrees of freedom and non-centrality parameter 4μϵ2n/4=nμϵ2. Therefore, E[LΣ1L]=p+n4μϵ2 and Var(LΣ1L)=2p+n4μϵ2. Hence,

An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx44_ht.jpg

By the delta method, E[g(x)]g(E[x]) and Var[g(x)][g(E[x])]2Var(x), yielding

An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx45_ht.jpg

APPENDIX D: DISTRIBUTION OF THE SAMPLE GIVEN L

We use the Kronecker product notation (Hocking, 1992) defined so that An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx46_ht.jpg. In other words, An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx47_ht.jpg is formed by replacing each element of A by the matrix B multiplied by that element. Under the model, the joint distribution can be written as follows:

An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx48_ht.jpg

Recalling that

An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx49_ht.jpg

leads to the conditional distribution

An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx50_ht.jpg

and

An external file that holds a picture, illustration, etc.
Object name is biostskxn035fx51_ht.jpg

References

  • Ambroise C, McLachlan GJ. Selection bias in gene extraction on the basis of microarray gene expression data. Proceedings of the National Academy of Sciences of the United States of America. 2002;99:6562–6566. [PMC free article] [PubMed]
  • Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proceedings of the National Academy of Sciences of the United States of America. 2001;98:13790–13795. [PMC free article] [PubMed]
  • Devroye L, Gyorfi L, Lugosi G. A Probabilistic Theory of Pattern Recognition. New York: Springer; 1991.
  • Dobbin KK, Simon RM. Sample size planning for developing predictors using high-dimensional DNA microarray data. Biostatistics. 2007;8:101–117. [PubMed]
  • Duda RO, Hart PE, Stork DG. Pattern Classification. 2nd edition. New York: 2001. John Wiley and Sons.
  • Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association. 2002;97:77–87.
  • Efron B. Bootstrap confidence intervals for a class of parametric problems. Biometrika. 1985;2:45–58.
  • Efron B. Better bootstrap confidence intervals (with discussion) Journal of the American Statistical Association. 1987;82:171–200.
  • Efron B, Tibshirani RJ. An Introduction to the Bootstrap. New York: Chapman and Hall; 1993.
  • Efron B, Tibshirani RJ. Improvements on cross-validation: the .632+ bootstrap method. Journal of the American Statistical Association. 1997;92:548–560.
  • Fu WJ, Dougherty ER, Mallick B, Carroll RJ. How many samples are needed to build a classifier: a general sequential approach. Bioinformatics. 2005;21:63–70. [PubMed]
  • Fukunaga K. Introduction to Statistical Pattern Recognition. 2nd edition. San Diego, CA: Academic Press; 1990.
  • Hocking RR. Methods and Applications of Linear Models: Regression and the Analysis of Variance. New York: John Wiley and Sons; 1992.
  • Jiang W, Varma S, Simon R. Calculating confidence intervals for prediction error in microarray classification using resampling. Statistical Applications in Genetics and Molecular Biology. 2008 7, Article 8. [PubMed]
  • McLachlan GJ. Confidence intervals for the conditional probability of misallocation in discriminant analysis. Biometrics. 1975;32:161–167. [PubMed]
  • McLachlan GJ. Discriminant Analysis and Statistical Pattern Recognition. New York: John Wiley and Sons; 1992.
  • Michiels S, Koscielny S, Hill C. Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet. 2005;365:488–492. [PubMed]
  • Moran MA. On the expectation of errors of allocation associated with a linear discriminant function. Biometrika. 1975;62:141–148.
  • Mukherjee S, Tamayo P, Rogers S, Rifkin R, Engle A, Campbell C, Golub TR, Mesirov JP. Estimating dataset size requirements for classifying DNA microarray data. Journal of Computational Biology. 2003;10:119–142. [PubMed]
  • Radmacher MD, McShane LM, Simon R. A paradigm for class prediction using gene expression profiles. Journal of Computational Biology. 2002;9:505–511. [PubMed]
  • Rao CR. Linear Statistical Inference and its Applications. 2nd edition. New York: John Wiley and Sons; 1973.
  • Simon R, Radmacher R, Dobbin K, McShane LM. Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. Journal of the National Cancer Institute. 2003;95:14–18. [PubMed]
  • Stuart A, Ord K, Arnold S. Kendall's Advanced Theory of Statistics, Volume 2A: Classical Inference and the Linear Model. 6th edition. London: Arnold Publishers; 1999.
  • Tian E, Zhan F, Walker R, Rasmussen E, Ma Y, Barlogie B, Shaughnessy JD., Jr. The role of the Wnt-signaling antagonist DKK1 in the development of osteolytic lesions in multiple myeloma. The New England Journal of Medicine. 2003;349:2483–2494. [PubMed]
  • van't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, van Der Kooy K, Marton MJ, Witteveen AT. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415:530–536. [PubMed]
  • Xu Q, Hua J, Braga-Neto U, Xiong Z, Suh E, Dougherty ER. Confidence intervals for the true classification error conditioned on the estimated error. Technology in Cancer Research and Treatment. 2006;5:579–589. [PubMed]

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...