• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptNIH Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
J Am Stat Assoc. Author manuscript; available in PMC Feb 23, 2012.
Published in final edited form as:
PMCID: PMC3285493
NIHMSID: NIHMS266707

Adaptive Confidence Intervals for the Test Error in Classification

Abstract

The estimated test error of a learned classifier is the most commonly reported measure of classifier performance. However, constructing a high quality point estimator of the test error has proved to be very difficult. Furthermore, common interval estimators (e.g. confidence intervals) are based on the point estimator of the test error and thus inherit all the difficulties associated with the point estimation problem. As a result, these confidence intervals do not reliably deliver nominal coverage. In contrast we directly construct the confidence interval by use of smooth data-dependent upper and lower bounds on the test error. We prove that for linear classifiers, the proposed confidence interval automatically adapts to the non-smoothness of the test error, is consistent under fixed and local alternatives, and does not require that the Bayes classifier be linear. Moreover, the method provides nominal coverage on a suite of test problems using a range of classification algorithms and sample sizes.

Keywords: Classification, Test Error, Pretesting, Confidence Intervals, Non-Regular Asymptotics

1 Introduction

In classification problems, we observe a training set of (feature, label) pairs, T={(Xi,Yi)}i=1n. The goal is use this sample to construct a classifier, say c^, so that when presented with a new feature, X, c^(X) will accurately predict the unobserved label, Y . Accurate prediction corresponds to small test error; recall that the test error is given by τ(c^)=P1c^(X)Y where P1c^(X)Y=1c^(x)ydP(x,y) denotes expectation over the distribution P of (X, Y) only, and not the distribution of the training set. The test error τ(c^) is a functional of c^ and thus is a random quantity. For this reason τ(c^) is sometimes referred to as the conditional test error (Efron 1997; Hastie et al. 2009; Chung and Han 2009). Estimation of the test error typically employs resampling. Most commonly, the leave-one-out or k-fold cross-validated test error is reported in practice. Bootstrap estimates of the test error were suggested by Efron (1983) and later refinements were given by Efron and Tibshirani (1995, 1997). There have been a number of simulation studies comparing these approaches; some references include (Efron 1983; Chernick et al. 1985; Kohavi 1995; Krzanowksi and Hand 1996). A nice survey of estimators is given by Schiavo and Hand (2000). However many have documented that estimators of the test error are plagued by bias and high variance across training sets (Zhang 1995; Isaakson 2008; Hastie et al. 2009) and consequently the test error is accepted to be a difficult quantity to estimate accurately. Two reasons for this problematic behavior are that some classification algorithms result in a c^ that is a non-smooth functional of the training set, and, even when c^ is a smooth functional of the training set, the test error is the expectation of a non-smooth function of c^.

An alternative to point estimation is interval estimation (e.g. a confidence interval). However, this approach has also been problematic likely because researchers have followed what we call the “point estimation paradigm”: as a first step a point estimator of the test error is constructed, and as a second step, the distribution of this estimator is approximated. The problem with this approach is that a problematic point estimator of the test error makes the second step very difficult. The point estimation paradigm was employed by Efron and Tibshirani (1997) where the standard error of their smoothed leave-one-out estimator was approximated using the nonparametric delta method. Efron and Tibshirani noted that this approach would not work, however, for their more refined .632 (or .632+) estimators because of non-smoothness. Yang (2006) follows this paradigm as well, using a normal approximation to the repeated split cross-validation estimator. In practice, the point estimation paradigm is often applied by simply bootstrapping the estimator of the test error (see Jiang et al., 2008; Chung and Han 2009). These methods, while intuitive, lack theoretical justification.

We consider interval estimators for linear classifiers constructed from training sets in which the number of features is less than the training set size (p << n). As will be seen, even in this simple setting, natural approaches to constructing interval estimators for the test error can perform poorly. Instead of using the point estimation paradigm, we directly construct the confidence interval by use of smooth data-dependent upper and lower bounds on the test error. These bounds are sufficiently smooth so that their bootstrap distribution can be used to construct valid confidence intervals. Moreover, these bounds are adaptive in the sense that under certain settings exact coverage is delivered.

The outline of this paper is as follows. In Section 2 we illustrate the small sample problems that motivate the use of approximations in a non-regular asymptotic framework. Section 3 introduces the Adaptive Confidence Interval (ACI). The ACI is shown to be consistent under fixed and local alternatives. Section 4 addresses the computational issues involved in constructing the ACI. A computationally efficient (polynomial time) convex relaxation of the ACI is developed and shown to provide nearly identical results to exact computation. Section 5 provides a large experimental study of the ACI and several competitors. A variety of classifiers and sample sizes are considered on a suite of ten examples. The ACI is shown to provide correct coverage while being shorter in length than competing methods. Section 6 discusses a number of generalizations and directions for future research. Most proofs are left to the online supplement.

2 Motivation

Throughout we assume that the training set is an iid sample T={(Xi,Yi)}i=1n drawn from some unknown joint distribution P. The features X are assumed to take values in Rp while the labels are coded Y{1,1}. To construct the linear classifier we fit a linear model f^T(x)=xtβ^n by minimizing a convex criterion function. That is, we construct β^nargminβRpPnL(X,Y,β) where Pn is the empirical measure and L(X, Y, β) is a convex function of β (e.g., hinge loss with an L2 penalty in the case of linear support vector machines). The classifier is the sign of the linear fit; that is, the predicted label y at input x is assigned according to c^(x)=sign(xtβ^n) (define sign(0) = 1). Recall that the test error of the learned classifier is defined as

τ(c^)P1sign(Xtβ^n)Y=P1YXtβ^n<0,

where P denotes expectation with respect to X and Y .

As discussed in the introduction, the test error is a non-smooth functional of the training data. To see this and to gain a clearer understanding of the test error note

τ(c^)=const+Rp[2q(x)1]1xtβ^n<0dPx(x),
(1)

where q(x)PP(Y=1X=x). Recall that sign(2q(x) − 1) is the Bayes classifier. Then

Var(τ(c^))=E(Rp[2q(x)1](1xtβ^n<0E1xtβ^n<0)dPx(x))2,
(2)

where E denotes the expectation over iid training sets of size n drawn from P. The form of Var(τ(c^)) reveals that there are two scenarios in which τ(c^) is highly variable. The first occurs when xtβ^n is likely to be small relative to Var(xtβ^n) over a large range of x where q(x) ≠ 1/2. Notice that this might occur when the classifier does well but is subject to overfitting. The second scenario occurs when xtβ^n is likely to be small relative to Var(xtβ^n) over a small range of x where q(x) is far from 1/2. In this scenario there may be little overfitting but the classifier may be far from the Bayes rule and hence of poor quality. Note that poor classifier performance and overfitting are hallmarks of small samples. In either case, τ(c^) need not concentrate around Eτ(c^).

In order to provide good intuition for the small sample case, we require an asymptotic framework wherein the test error τ(c^) does not concentrate about Eτ(c^), even in large samples. One way of achieving this is to permit P(Xtβ* = 0) to be positive where βargminβRpPL(X,Y,β). This ensures that for all xRp that satisfy xtβ* = 0, the indicator function 1xtβ^n<0=1xtn(β^nβ)<0 never settles down to a constant but rather converges to a non-degenerate distribution. Furthermore, if for a non-null subset of these x’s we have q(x) ≠ 1/2, then Var(τ(c^)) does not converge to zero. Hereafter we refer to this as the non-regular framework. This language is consistent with that of Bickel et al. (2001). However, unlike the usual notion of non-regularity the limiting distribution of n(τ^(c^)τ(c^)) depends not only on the value of β* but also the marginal distribution of X.

To see why it is useful to consider approximations that are valid even in the non-regular asymptotic framework we consider simulated data, which we call the quadratic example. Here the generative model satisfies P(Xtβ* = 0) = 0. Data are generated according to the following mechanism

X1,X2~iidUnif[0,5]~N(0,14)Y=sign(X2(425)X121+).

The working classifier is given by c^(x)=sign(β^n0+β^n1x1+β^n2x2) where β^n is constructed using squared error loss L(X,Y,β)(1YXtβ)2. In this example β(.225,.317,.439) so that the continuity of X1 and X2 ensures that the regularity condition P(Xtβ* = 0) = 0 is satisfied. Consider two seemingly reasonable, and commonly employed methods for constructing a confidence set. The first is the centered percentile bootstrap (CPB). The CPB confidence set is formed by bootstrapping the centered and scaled in-sample error n(PnP)1YXtβ^n<0. Note that n(PnP)1YXtβ^n<0=n(τ^(c^)τ(c^)) where τ^(c^)Pn1YXtβ^n<0 is the in-sample error. More specifically, let u^ and l^ be the 1 − γ/2 and γ/2 percentiles of

n(P^n(b)Pn)1YXtβ^n(b)<0,
(3)

where P^n(b)n1i=1nMniδ(xi,yi) is the bootstrap empirical measure with weights (Mn1,Mn2,,Mnm)~Multinomial(n,1n,1n,,1n) and β^n(b)argminβRpP^n(b)L(X,Y,β). Then the 1 − γ CPB interval is given by [τ^(c^)u^n,τ^(c^)l^n]. The second approach is based on the asymptotic approximation

n(PnP)1YXtβ^n<0N(0,(1P1YXtβ<0)P1YXtβ<0).
(4)

Thus the normal approximation confidence set is given by τ^(c^)±z1γτ^(c^)(1τ^(c^))n (see the binomial approximation in Chung and Han 2009). If P(Xtβ* = 0) = 0 then both methods can be shown to be consistent.

The left hand side of Figure 1 shows the estimated coverage using 1000 Monte Carlo iterations of the CPB with 1000 bootstrap resamples, and the normal approximation. Both methods severely undercover in small samples. This is especially troubling since (i) the problem is low-dimensional, (ii) the linear classifier is of relatively high quality, (for example if n = 30 the expected test error Eτ(c^).11) and (iii) the regularity condition P(Xtβ* = 0) = 0 is satisfied. Why do these methods fail? Neither method correctly captures the additional variation in the test error across training samples due to the non-smoothness of the test error. Since the generative model satisfies the condition P(Xtβ* = 0) = 0, the variation across training sets eventually becomes negligible and the methods deliver the desired coverage for n large.

Figure 1
Left: Coverage of centered percentile bootstrap and normal approximations for constructing confidence sets for τ(c^). Right: Coverage of centered percentile bootstrap with smoothed target τsmoothed(c^)P(1+exp(aYc^(X))) ...

To illustrate the effect of non-smoothness on the coverage consider the problem of finding a confidence interval for the functional τsmoothed(c^)P(1+exp(aYc^(X)))1, where a is a positive free parameter. Notice that the size of a varies inversely with the smoothness of τsmooth(c^). A value of a > 0 gives the expectation of a sigmoid function and a value of a = ∞ corresponds to τ(c^)). Coverage for a = 0.1, 1.0, and 10 are given in the right hand side of Figure 1. Notice that coverage increases with the smoothness of the target τsmoothed(c^). The dramatic difference in coverage between a = .1 and a = ∞ suggests that a large component of the anti-conservatism is indeed attributable to non-smoothness.

Operating in the regular framework there is no indication that these methods may not work well. In the non-regular framework, however, both of these methods are inconsistent. To see this in the case of the CPB, write

n(P^n(b)Pn)1YXtβ^n(b)<0=n(P^n(b)Pn)1Xtβ=01YXt[n(β^n(b)β^n)+n(β^nβ)]<0+n(P^n(b)Pn)1Xtβ01YXtβ^n(b)<0.
(5)

The first term on the right hand side of (5) appears because we allow P(Xtβ* = 0) > 0 in the non-regular framework; conditioned on the data the term n(β^nβ) does not have a limit and consequently the CPB is inconsistent. A detailed proof is omitted (see for example Shao 1994). The inconsistency of the normal approximation can be seen by examining the limiting distribution of n(PnP)1YXtβ^n<0 in the non-regular framework. This limit is given in Theorem 3.1.

3 Adaptive confidence interval

In this section we introduce our method for constructing a confidence interval for the test error. This section is organized as follows. We begin by constructing adaptive confidence interval. Next, we establish the theoretical underpinnings of the method under fixed alternatives. Following this we provide a (heuristic) justification for our method using local alternatives. Finally, we discuss the choice of a tuning parameter required by the method.

3.1 Construction of the ACI

We propose an method of constructing a confidence interval that is consistent in the non-regular framework. We refer to this method as the Adaptive Confidence Interval (ACI) because, it is adaptive in two ways. First, unlike the CPB, the ACI provides asymptotically valid confidence intervals regardless of the true parameter values; intuitively the ACI achieves this by adapting to the amount of non-smoothness in the test error. Second, in settings (see Corollary 3.4) in which the CPB is consistent, the upper and lower limits of the ACI are adaptive in that these limits have the same distribution as the upper and lower limits of the CPB.

The ACI is based on bootstrapping an upper bound of the functional n(PnP)1YXtβ^n<0. This upper bound is constructed by first partitioning the training data T into two groups (i) points that are far from the boundary xtβ* = 0, and (ii) points that are too close to delineate from being on the boundary. The upper bound is constructed by taking the supremum over all possible classifications of the points that we cannot distinguish from lying on the boundary. More precisely, under the non-regular framework the scaled and centered test error can be decomposed as

Gn1YXtβ^n<0=Gn1Xtβ=01YXtβ^n<0+Gn1Xtβ01YXtβ^n<0,
(6)

where Gn=n(PnP). The first term on the right hand side of (6) corresponds to points on the decision boundary xtβ* = 0, and the second term corresponds to points that are not on this boundary. That is, the domain of X is partitioned into two-sets. We operationalize this partitioning using a series of hypothesis tests. For each X = x we test H0 : xtβ* = 0 against a two-sided alternative. Let Σ denote the asymptotic covariance of β^n (see below). Then the test rejects when the statistic (xtβ^n)2xtΣx is large. The bounds are obtained by computing the supremum (infemum) over all classifications of points for which the test fails to reject. In particular, an upper bound on Gn1YXtβ^n<0 is given by

u(Gn,β^n,Σ,an)=supbRpGn1(Xtβ^n)2XtΣX1an1YXtb<0+Gn1(Xtβ^n)2XtΣX>1an1YXtβ^n<0,
(7)

and an lower bound is given by

l(Gn,β^n,Σ,an)=infbRpGn1(Xtβ^n)2XtΣX1an1YXtb<0+Gn1(Xtβ^n)2XtΣX>1an1YXtβ^n<0.
(8)

The choice of an, is discussed at the end of this Section. Put b=β^n to see that (7) and (8) are upper and lower bounds, respectively.

Suppose we want to construct a 1 − δ% confidence interval for the test error. We have that

Pn1YXtβ^n<0(1n)u(Gn,β^n,Σ,an)P1YXtβ^n<0Pn1YXtβ^n<0(1n)l(Gn,β^n,Σ,an).

We approximate the distribution of u(Gn,β^n,Σ,an), (Gn,β^n,Σ,an) by bootstrap. The bootstrap is shown to be consistent later in this section. Denote the 1 − δ/2 percentile of the bootstrap distribution of u(Gn,β^n,Σ,an) by u1–δ/2 and the δ/2 percentile of the bootstrap distribution of (Gn,β^n,Σ,an) by lδ/2. The 1 − δ% ACI is given by

Pn1YXtβ^n<0(1n)u1δ2P1YXtβ^n<0Pn1YXtβ^n<0(1n)lδ2.
(9)

3.2 Properties of the ACI

In the remainder of the paper we verify that the ACI is asymptotically of the correct size even if the problem is non-regular (e.g. P(Xtβ* = 0) > 0) and we evaluate the performance of the ACI in small samples. A method for efficiently approximating the ACI is given and shown to be almost identical to exact computation on a suite of examples. Most proofs are deferred to the online supplement.

First we provide the asymptotic distribution of u(Gn,β^n,Σ,an) and (Gn,β^n,Σ,an). Throughout we make the following assumptions.

  • (A1)
    L(X, Y, β) is convex with respect to β for each fixed (x,y)Rp×{1,1}.
  • (A2)
    Q(β)PL(X,Y,β) exists and is finite for all βRp.
  • (A3)
    βargminβRpQ(β) exists and is unique.
  • (A4)
    Let g(X, Y, β) be a sub-gradient of L(X, Y, β). Then Pg(X,Y,β)2< for all β in a neighborhood of β* .
  • (A5)
    Q(β) is twice continuously differentiable at β* and H=2Q(β) is positive definite.
  • (A6)
    limnan=butan=o(n).

These assumptions are quite mild and hold for most commonly used loss functions (e.g., exponential loss, squared error loss, hinge loss–if P has a smooth density at 1, logistic loss, etc.). Recall that a subgradient satisfies L(x,y,γ)+(βγ)tg(x,y,γ)L(x,y,β) for all (x,y)Rp×{1,1} and, γ,βRp. All convex functions have a measurable subgradient. Let Ω be the covariance matrix of the sub-gradient of L(x, y, β) at β*. Under (A1)-(A5) Haberman (1989; see also Niemiro, 1992) proved that β^n converges with probability one to β* and n(β^nβ) converges in distribution to z=LN(0,H1ΩH1).

Let V be a Brownian-Bridge indexed by Rp with the variance-covariance function

Cov(V(ϕ),V(γ))=P[1Xtβ=01YXtϕ<0P1Xtβ=01YXtϕ<0]×[1Xtβ=01YXtγ<0P1Xtβ=01Xtβ=01YXtγ<0].
(10)

Furthermore, let B(β) denote a mean zero normal random variable with variance P(1Xtβ01YXtβ<0P1Xtβ01YXtβ<0)2.

Theorem 3.1. Let V,B(β) and z be as above. Assume (A1)-(A6). Then

  1. Gn1YXtβ^n<0C(z)+B(β),
  2. u(Gn,β^n,Σ,an)supuRpV(u)+B(β)and(Gn,β^n,Σ,an)infuRpV(u)+B(β).

Note that the limiting distributions of u(Gn,β^n,Σ,an), (Gn,β^n,Σ,an) and Gn1YXtβ^n<0 have the same regular component B(β); the three limits differ only in the non-regular component. Note also that the form of the covariance function of V given in (10) and the form of the limiting distribution of u(Gn,β^n,Σ,an) (or (Gn,β^n,Σ,an)) shows that if the margin condition P(Xtβ* = 0) = 0 holds, then u(Gn,β^n,Σ,an)B(β)=LlimnGn1YXtβ^n<0 and similarly for (Gn,β^n,Σ,an). That is, if the margin condition holds, the limiting distribution of the functional used to construct the ACI is the same as the limiting distribution of the functional Gn1YXtβ^n<0. From a practical point of view this means that for problems where the regular framework is applicable, for example, if the sample size is large or points are well separated from the boundary, the ACI is asymptotically exact.

Another scenario in which the limiting distribution of u(Gn,β^n,Σ,an), (Gn,β^n,Σ,an) and Gn1YXtβ^n<0 are the same is when the Bayes decision boundary is linear. In this case q(x) = 1/2 if xtβ* = 0 where q(x) = P(Y = 1|X = x). (Here, we assume that the loss function is classification-calibrated (Bartlett 2005). All loss functions mentioned in this paper are classification-calibrated.) Then for any fixed uRp we have

P1Xtβ=01YXtu<0={x:xtβ=0}[q(x)1xtu<0+(1q(x))(11xtu<0)]dPx(x)={x:xtβ=0}[2q(x)1xtu<0]dPx(x)+12P(Xtβ=0)=12P(Xtβ=0).

The form of the variance of V and the above series of equalities show that if the Bayes decision boundary is linear then V(u)=LN(0,12(112P1Xtβ=0)P1Xtβ=0) for all uRp. Therefore, if the Bayes decision is linear

limnu(Gn,β^n,Σ,an)=LsupuRpV(u)+B(β)=LN(0,(1P1xtβ<0)P1Xtβ<0)+B(β)=LV(Z)+B(β)=Llimnn(PnP)1YXtβ^n<0,

where the first and last equalities follow from Theorem 3.1, and the second and third equalities follow since V is constant across all indices. We have proved the following result.

Corollary 3.2. Assuming (A1)-(A6) hold then if either (i) the Bayes decision boundary is sign(Xtβ*) or (ii) P (Xtβ* = 0) = 0 then u(Gn,β^n,Σ,an), (Gn,β^n,Σ,an) and Gn1YXtβ^n<0 have the same limiting distribution.

The implication of the above theorem and corollary is that when either of the above conditions hold the ACI should provide the nominal coverage. When neither event holds then the ACI may be conservative. In simulations we shall see that the degree of conservatism is small.

The ACI in (9) utilizes a bootstrap approximation to the distribution of u(Gn,β^n,Σ,an), (Gn,β^n,Σ,an). The next theorem concerns the consistency of the bootstrap distributions. Let Σ^n be a weakly consistent estimator of Σ (e.g. the plug-in estimator). Define BL1(R2) to be the space of bounded Lipschitz-1 functions on R2 and let EM denote the expectation with respect to the bootstrap weights.

Theorem 3.3. Assume (A1)-(A6). Then {u(Gn,β^n,Σ,an),(Gn,β^n,Σ,an)} and {u(Gn(b),β^n(b),Σ^n,an),(Gn(b),β^n(b),Σ^n,an)} converge to the same limiting distribution in probability. That is,

suphBL1(R2)Eh({u(Gn,β^n,Σ,an),l(Gn,β^n,Σ,an)})EMh({u(Gn(b),β^n(b),Σ^n,an)},l(Gn(b),β^n(b),Σ^n,an))

converges in probability to zero.

Thus the ACI provides asymptotically valid confidence intervals. Moreover we have the following.

Corollary 3.4. Assuming (A1)-(A6) hold then if either (i) the Bayes decision boundary is sign(Xtβ*) or (ii) P(Xtβ* = 0) = 0 then u(Gn(b),β^n(b),Σ^,an)(Gn(b),β^n(b),Σ^,an) and Gn1YXtβ^n<0 converge to the same limiting distribution, in probability.

Thus, the ACI is also adaptive in the sense that in settings where the centered percentile bootstrap would be consistent, u(Gn(b),β^n(b),Σ^,an), (Gn(b),β^n(b),Σ^,an) and Gn1YXtβ^n<0 have the same limiting distribution.

3.3 Local Alternatives

In Section 2 we motivated the use of a non-regular asymptotic framework in order to gain intuition for small samples. An alternative strategy for developing intuition for non-regular problems is to study the limiting behavior of n(β^nβ) under local alternatives. This strategy has roots in Econometrics.

In econometrics, a common strategy to constructing procedures with good small sample properties in non-regular settings is to utilize alternatives local to the parameter values that cause the non-regularity (Andrews 2000; Cheng 2008; Xie 2009). To see this recall that in small samples a non-negligible proportion of the inputs x are in a n-neighborhood of the decision boundary xtβ* = 0 which causes the indicator function 1xtβ^n<0 to become unstable. In the prior sections we assumed that there was a non-null probability that an input lies exactly on the boundary in order to retain the instability of the indicator function even in large samples. Another way to maintain this instability is by considering local alternatives.

The ACI can be seen as arising as an asymptotic approximation under local alternatives in the following way. In particular, suppose that a training set Tn={(Xni,Yni)}i=1n is drawn iid from distribution Pn for which

βnargminβRpPnL(X,Y,β)=β+Γn
(11)

for some ΓRp{0}. In addition, we assume that P(Xtβ* = 0) > 0 (while Pn(Xtβn* = 0) > 0 may or may not hold). A general tactic is to derive the limiting distribution of an estimator which will depend on the local parameter Γ and then take a supremum over this parameter to construct a confidence interval. As a first step in following this approach we might expect that

Gn1YXtβ^n<0=Gn1Xtβ=01YXt[n(β^nβn)+Γ]<0+Gn1Xtβ01YXtβ^n<0V(z+Γ)+B(β)

under Pn. Note that supΓGn1Xtβ=01YXt[n(β^nβn)+Γ]<0 is equal to the first term on the right hand side of (7). Hence, u(Gn,β^n,Σ,an) is the supremum over all local alternatives of the form given in (11). Also taking the supremum over ΓRp{0} we obtain

supΓRp{0}V(z+Γ)+B(β)=LsupuRpV(u)+B(β),

which is the limiting distribution of u(Gn,β^n,Σ,an) (see Theorem 3.1). Thus, the ACI can be seen as arising as an asymptotic approximation under local alternatives. This result is formalized below.

Theorem 3.5. Assume that Tn={(Xni,Yni)}i=1n is drawn iid from distribution Pn for which:

  • (B1)
    βnargminβRpPnL(X,Y,β)=β+Γn for some ΓRp{0},
  • (B2)
    if F is any uniformly bounded Donsker class and GnLinl(F) under P, then GnLinl(F)underPn,
  • (B3)
    n(β^nβn)=H1Gng(X,Y,β)+OPn(1), (1), where Gnn(PnPn). Assume (A1)-(A6). Then:
    1. Gn1YXtβ^n<0V(z+Γ)+B(β)
    2. limnu(Gn,β^n,Σ,an)=LsupηRpV(z+η)+B(β)=supuRpV(u)+B(β)
    under Pn.

Thus the limiting distribution of u(Gn,β^n,Σ,an) is unchanged under local alternatives and hence might be expected to perform well in small samples. A similar result can be proved for (Gn(b),β^n(b),Σ^,an). This result is underscored by the empirical results in Section 5.

3.4 Choice of Tuning Parameter an

Use of the ACI requires the choice of the tuning parameter an. We use a simple heuristic for choosing the value of this parameter. The method described here performed well on all of the examples in Section 5. We begin with the presumption that undercoverage is a greater sin than conservatism. Recall that we can view the ACI as a two step procedure where at the first stage we test the null hypothesis H0: xtβ* = 0 against a two-sided alternative. The test of H0 used in constructing the ACI rejects when (Xtβ^n)2XtΣX>1an. The form of u(Gn,β^n,Σ,an) in (7) shows that 1an too small (e.g. large Type I error) results in too few points being deemed “near the boundary.” Consequently the resulting interval may be too small since the supremum does not affect enough of the training points. Conversely, 1an too large (e.g. large Type II error) puts too many points in the region on non-regularity, resulting in an interval that may be too wide because the supremum affects too many of the training points. Given our presumption, controlling Type I error is of primary importance. Let γ(0,1). Then let 1an=1nVχ12γn and we have for any xRp{0} and xtβ* = 0

P((xtβ^n)2xtΣx>1anH0)=P((n(β^nβ)txxtΣx)2>nan)γ.

Thus, the suggested an controls the Type I error to be no more than γ. Moreover, it is clear from the above display that the Type I error decreases to zero as n tends to infinity. In all of the experiments in this paper we choose, rather arbitrarily, to use γ = .005. Simulations results, given in Table 5 of the online supplement, show that the performance (measured in terms of width and coverage) of the ACI appears to be insensitive to choices of γ in the range .001 to .01 for a sample size of around 30. For larger sample sizes, the choice of γ is unimportant since n>χ1γ2 except for extremely small values of γ.

Table 5
Comparison of computation time (in seconds) between ACI, Yang’s CV and Jiang’s BCCV P – BR for squared error loss. Examples where at least the nominal coverage was not attained are omitted

4 Computation

To implement the ACI we need to calculate, for each bootstrap sample, the supremum and infimum in u(Gn(b),β^n(b),Σ^n,an) and l(Gn(b),β^n(b),Σ^n,an) respectively. The required optimization, as stated, is a Mixed Integer Program (MIP) because of the discrete nature of the indicator function. In this section, we develop a convex relaxation that can be solved in polynomial time. The details for the infimum are provided below; a similar approach is used to find the supremum by writing 1z<0=11z0 and using the relationship: supz g(z) = −in fzg(z). Let (mn1, mn2, … , mnn) be a realization of the bootstrap weights (Mn1,Mn2,,Mnn)~Multinomial(n,1n,1n,,1n). For each such realization, construction of the infemum in the ACI requires computing

infuRpiNn(b)(mni1)1yixitu<0,
(12)

where Nn(b)={i:(xitβ^n(b))2xitΣ^nxi1an}. In this form, the optimization is clearly seen to be an MIP. Reliably solving an MIP requires the use specialized software (we use CPLEX) and quickly becomes computationally burdensome as the size of the problem grows. The following convex relaxation of (12) is (i) computationally efficient requiring roughly the same amount of computation as fitting a linear SVM and (ii) can be solved without specialized software (e.g. R or matlab).

As the initial step write

iNn(b)(mni1)1yixitu<0=iNn(b)mni1yixitu<0+iNn(b)(1yixitu<0).

Then replace the indicator function 1yixitu<0 with convex surrogate and upper bound (1yixitu) where (z)+ denotes the positive part of z. Similarly, replace the function 1yixtu<0 with convex surrogate and upper bound (1+yixitu)+1. The indicator functions and their respective surrogates are shown in Figure 2. The relaxed optimization problem is then

infuRpiNn(b)[mni(1yixitu)++(1+yixitu)+]
(13)

where the −1 in the relaxation of 1yixitu<0 has been omitted since it does not depend on u. The optimization problem in (13) can be cast as a linear program and hence solved in polynomial time. See the next section for an empirical comparison of the relaxed and MIP solutions to (12).

Figure 2
Relaxation of the indicator functions. Left panel: indicator function 1yxtu<0 replaced with convex surrogate (1yxtu)+. Right panel: indicator function 1yxtu<0 replaced with convex surrogate (1+yxtu)+1.

5 Empirical study

In this section we compare solution quality between the relaxed and MIP solutions to (12); as will be seen the relaxed solution to (12) can be computed much more quickly while little is lost in terms of solution quality. Next using the relaxed solution to (12) the empirical performance of the ACI is compared with two recent methods proposed in the literature. Ten data sets are used in these comparisons; three are simulated and the remaining seven data sets are taken from the UCI machine learning repository (www.ics.uci.edu/~mlearn/MLRepository.html) and thus the true generative model is unknown. In this case, the empirical distribution function of the data set is treated as the generative model. A summary of the data sets are given in Table 2.

Table 2
Test data sets used to evaluate confidence interval performance. The last three columns record the average test error for a linear classifier trained using a training set of size n = 100 and loss function: squared error loss (SE), binomial deviance (BD), ...

To assess the difference in solution quality between the relaxed and MIP solutions to (12) we perform the following procedure for each of the 10 examples listed in Table 2. We generate 1000 training sets of size n = 30, and for each training set we compute 1000 bootstrap resamples. For each resample we compute (12) exactly using the MIP and approximately using the convex relaxation described above. Here we illustrate the results when the loss function used to construct β^n and β^n(b) is chosen to be L(X, Y, γ) = (1 − Y Xtγ)2. Let θMIP(t)(b) and θREL(t)(b) denote the MIP and relaxed solution to (12) for the bth bootstrap resample of the tth training set. Table 1 reports the 50, 75, 95, and 99 percentiles of 1n(θMIP(t)(b)θREL(t)(b)) for each example. Notice that for each example we considered, the relaxed and MIP solutions agree exactly on more than half of the resampled pairs. Moreover, on more than 95 percent of the resampled pairs, we observe that 1n(θMIP(t)(b)θREL(t)(b))1n, implying that the two solutions differed by at most the activation of a single indicator function. Table 1 also reports the estimated coverage of confidence sets constructed using the MIP and relaxed formulations. For each of the 10 data sets, estimated coverage using the two methods is not significantly different. The final bit of information in Table 1 regards computation time. The last two columns report the average time in seconds that it takes to construct a single confidence interval using the MIP and relaxed formulations. Computations were performed using a 3.06 GHz intel processor with 4 GB 1067 MHz DDR3. It is clear that even in the n = 30 case significant computational gain can be made by using the relaxed formulation. However, this gain becomes more pronounced as sample size increases. Figure 3 compares the computation time for the ThreePt data set (this data set is decribed in Laber and Murphy 2009) as a function of sample size using squared error loss. As claimed, the computation time for the relaxed construction scales much more efficiently than the MIP formulation. In the examples presented in the next section we use the convex relaxation to compute the confidence interval.

Figure 3
Computation time for MIP and relaxed construction of ACI using the ThreePt data set and squared error loss.
Table 1
Comparison of MIP and relaxed versions of the ACI. For each data set the table was constructed using 1000 training sets each with 1000 bootstrap iterations for a total of 1,000,000 computations of the optimization problem given in (12)

5.1 Competing methods

As competitors we consider a repeated-split normal approximation suggested by (Yang 2006) and the recently proposed Bootstrap Case Cross-Validated Percentile with Bias Reduction (BCCVP-BR) method of (Jiang 2008). These methods represent the best we could find in terms of consistent coverage. Both methods substantially outperform standard approaches like the bootstrap and normal approximation which are discussed in Section 2. To provide a baseline for comparison, the performance of the Centered Percentile Bootstrap (CPB) is included in the online supplement.

Briefly, Yang’s method repeatedly partitions the training data T into two equal halves TL and TV. A classifier is trained on TL and then evaluated on TV. The mean and variance of the number of misclassified points in TV is recorded. This mean and variance are then aggregated and used in a normal approximation. Jiang’s method can be roughly described as leave one out cross validation with bootstrap resamples. However, since a bootstrap resample can have multiple copies of a single training example, leave one out cross-validation will no longer have disjoint training and testing sets. Instead, for each unique training example (xi, yi) the bootstrap resample is partitioned into two sets, one with all copies of (xi, yi) call this V, and the second contains the remainder of the resample call this L. The classifier is trained on L and evaluated on V. The average error over all sets V is recorded within each bootstrap resample and the percentiles form the endpoints of a confidence interval. As a final step Jiang provides a bias correction. A full description of these methods can be found in the referenced works. While these methods are intuitive, they lack theoretical justification. Yang’s method was developed for use with a hold-out set; when such a hold-out set does not exist, the method is inconsistent. Jiang offers no justification other than intuition.

5.2 Results

We examine the performance of the ACI and competing methods using the following three metrics (i) coverage (ii) interval width and (iii) computational expense. These metrics are recorded using ten data sets, three sample sizes, and three loss functions. Three of the examples use simulated datasets and hence the test error can be computed exactly. The remaining seven data sets are taken from the UCI machine learning repository (www.ics.uci.edu/~mlearn/MLRepository.html) and thus the true generative model is unknown. In this case, the empirical distribution function of the data set is treated as the generative model. Results using squared error loss are listed here while the results using binomial deviance and ridged hinge loss (support vector machines) are given in the online supplement. A summary of the data sets are given in Table 2.

Coverage results for squared error loss are given in Table 3. The adaptive confidence interval is the only method to attain at least nominal coverage on all ten test sets. Yang’s method is either extremely conservative or anti-conservative. Jiang’s interval attains the nominal coverage on eight of ten data sets in the n = 30 case and nine of ten data sets for larger sample sizes. Table 4 shows the width of the constructed confidence intervals. When n = 30 the ACI is smallest in width for eight of the ten data sets. For larger sample sizes Jiang’s method and the ACI display comparable widths; Yang’s method is always the widest. Another important factor is computation time. Table 5 shows the average amount of time required in seconds to construct a single confidence interval. All methods used 1000 resamples. That is, 1000 bootstrap resamples for the ACI and Jiang’s method, and 1000 repeated splits for Yang’s method. Table 5 shows that Yang’s method is the most computationally efficient. However, it is also clear that Jiang’s method is significantly slower than the ACI for moderate sample sizes. For the Magic data set Jiang’s method takes more than 30 times longer than the ACI. It is most important, however, to notice the trend in computation time across sample sizes. Computation time for Yang’s method and the ACI grow slowly with sample size while the computational cost of Jiang’s method increases much more quickly. The reason for this is that Jiang’s method performs leave-one-out cross validation for each bootstrap resample thus increasing the computation time by a factor of n. Results for ridged hinge loss and binomial deviance loss are similar and can be found in the technical report (Laber and Murphy, 2010).

Table 3
Coverage comparison between ACI, Yang’s CV and Jiang’s BCCV P – BR for squared error loss, target coverage is .950. Coverage is starred if observed coverage is significantly different from .950 at .01 level
Table 4
Comparison of interval width between ACI, Yang’s CV and Jiang’s BCCV P – BR for squared error loss. Smallest observed width is starred. Examples where at least the nominal coverage was not attained are omitted

6 Discussion

Many statistical procedures in use today are justified by a combination of asymptotic approximations and high quality simulation performance. As exemplified here, the choice of asymptotic framework may be crucial in obtaining reliably good performance in small samples. In this paper a non-regular asymptotic framework in which the limiting distribution of the test error changes abruptly with changes in the true, underlying data generating distribution is used to develop a confidence interval. In particular, asymptotic non-regularity occurs due to the non-smooth test error in connection with particular combinations of β* values and the X distribution. It is common practice to “eliminate” this asymptotic non-regularity by assuming that these problematic combinations of β* values and the X distribution cannot occur. However, small samples are unable to precisely discriminate between settings that are close to the problematic β* values/X distribution from settings in which the β* values/X distribution are exactly problematic. As a result, asymptotic approximations that depend on assuming away these problematic settings can be of poor quality; this is the case here.

The validity of proposed adaptive confidence interval presented here does not depend on assuming away problematic scenarios; instead the ACI detects and then accommodates settings that are sufficiently close to the problematic β* values/X distribution. In this sense the ACI adapts to the non-smoothness in the test error. Specifically, in settings in which standard asymptotic procedures fail, the ACI provides asymptotically valid, albeit conservative, confidence intervals. Moreover, the ACI delivers exact coverage if either (i) the model space is correct or (ii) a margin condition holds. Practically, this means that in a setting where standard asymptotic procedures (e.g. the bootstrap) are applicable, the ACI is asymptotically equivalent to these methods. Experimental performance of the ACI is also quite promising. On a suite of 10 examples, three loss functions and three classification algorithms, the ACI delivered nominal coverage. In addition, the ACI generally had a smaller length than competing methods. The ACI can be computed efficiently with algorithms scaling polynomially in dimension and sample size.

Two important extensions of the ACI are: first, to extend the ACI to construct valid confidence intervals for the difference in test error between two linear classifiers and, second, to extend these ideas to the setting in which the number of features is comparable or larger than the sample size. The former extension is straightforward and can be achieved by enlarging the set over which the supremum is taken in (7) to include the points on the classification boundaries of both classifiers. The latter is more difficult. In the estimation of classifiers in the p >> n setting, it is important to avoid overfitting. A typical approach to reduce the amount of overfitting is regularization which effectively reduces the space of available classifiers to choose from. Similarly, the supremum in (7) must be taken over a restricted set of classifiers to avoid being unnecessarily wide. Extending the theory and computation to this setting is left to another paper.

Supplementary Material

supplementary material

Acknowledgments

This research is supported by NIH grants R01 MH080015 and P50 DA10075. The authors thank Min Qian, Zhigou Li, Diane Lambert, Kerby Shedden, and Vijay Nair for insightful comments and criticisms. In addition, the authors wish to thank the Editor and anonymous reviewers for criticisms and insights that made for a much better paper.

References

  • Anthony MM, Bartlett P. Learning in Neural Networks: Theoretical Foundations. Cambridge University Press; New York, NY, USA: 1999.
  • Bartlett P, Jordan M, McAuliffe J. Convexity, classification, and risk bounds. Journal of the American Statistician. 2005;101:138–156.
  • Bickel P, Klaassen A, Ritov Y, Wellner J. Efficient and adaptive inference in semi-parametric models. Johns Hopkins University Press; Baltimore: 1993.
  • Bose A, Chatterjee S. Tech. rep. Indian Statistical Institute; 2000. Generalized bootstrap for estimators of minimizers of convex functionals.
  • Bose A. Generalized bootstrap for estimators of minimizers of convex functions. Journal of Statistical Planning and Inference. 2003;117:225–239.
  • Cheng X. Robust Confidence Intervals in Nonlinear Regression Under Weak Identification. Job Market Paper. 2008
  • Chernick M, Murthy V, Nealy C. Application of Bootstrap and Other Resampling Techniques: Evaluation of Classifier Performance. PRL. 1985;3:167–178.
  • Chung H-C, Han C-P. Conditional confidence intervals for classification error rate. Computational Statistics and Data Analysis. 2009;53:4358–4369.
  • Donald DW. Testing when a parameter is on the Boundary of the Maintained Hypothesis. Econometrica. 2001;69:683–734.
  • Efron B. Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation. Journal of the American Statistical Association. 1983;78:316–331.
  • Efron B, Tibshirani R. Cross-Validation and the Bootstrap: Estimating the Error Rate of a Prediction Rule. Stanford; 1995. Tech. Rep. 172.
  • Efron B. Improvements on Cross-Validation: The .632+ Bootstrap Method. Journal of the American Statistical Association. 1997;92:548–560.
  • Haberman S. Concavity and Estimation. Annals of Statistics. 1989;17:1631–1661.
  • Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning, Springer Series in Statistics. Springer New York Inc.; New York, NY, USA: 2009.
  • Isaksson A, Wallman M, Gransson H, Gustafsson M. Cross-validation and bootstrapping are unreliable in small sample classification. Pattern Recognition Letters. 2008;29:1960–1965.
  • Jiang W, Varma S, Simon R. Calculating confidence intervals for prediction error in microarray classification using resampling. Statistical Applications in Genetics and Molecular Biology. 2008;7 [PubMed]
  • Kohavi R. IJCAI. Morgan Kaufmann; 1995. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection; pp. 1137–1145.
  • Kosorok M. Introduction to empirical processes and semiparametric inference. Springer Verlag; 2008.
  • Krzanowski W, Hand D. Assessing Error Rate Estimators: The Leave-One-Out Method Reconsidered. PRL. 1985;3:167–178.
  • Laber EB, Murphy SA. Small Sample Inference for Generalization Error in Classification Using the CUD Bound; Proceedings of the Proceedings of the Twenty-Fourth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-08); Corvallis, Oregon: AUAI Press. 2008.pp. 357–365. [PMC free article] [PubMed]
  • Laber EB. Adaptive Confidence Intervals for the Test Error in Classification. University of Michigan; 2009. Tech. Rep. 497.
  • Niemiro W. Asymptotics for M-Estimators defined by convex minimization. Annals of Statistics. 1992;20:1514–1533.
  • Schiavo RA, Hand D. Ten More Years of Error Rate Research. International Statistical Review. 2000;68:295–310.
  • Van der Vaart A, Wellner J. Weak convergence and empirical processes: with applications to statistics. Springer Verlag; 1996.
  • Xie M, Singh K, Zhang C-H. Confidence Intervals for Population Ranks in the Presence of Ties and Near Ties. Journal of the American Statistical Association. 2009;104:775–788.
  • Yang Y. Comparing Learning Methods for Classification. Statistica Sinica. 2006;16:635–657.
  • Zhang P. APE and Models for Categorical Panel Data. Scandinavian Journal of Statistics. 1995:83–94.
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...