• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of amjepidLink to Publisher's site
Am J Epidemiol. Aug 1, 2011; 174(3): 364–374.
Published online Jun 14, 2011. doi:  10.1093/aje/kwr086
PMCID: PMC3202159

Evaluating the Incremental Value of New Biomarkers With Integrated Discrimination Improvement

Abstract

The integrated discrimination improvement (IDI) index is a popular tool for evaluating the capacity of a marker to predict a binary outcome of interest. Recent reports have proposed that the IDI is more sensitive than other metrics for identifying useful predictive markers. In this article, the authors use simulated data sets and theoretical analysis to investigate the statistical properties of the IDI. The authors consider the common situation in which a risk model is fitted to a data set with and without the new, candidate predictor(s). Results demonstrate that the published method of estimating the standard error of an IDI estimate tends to underestimate the error. The z test proposed in the literature for IDI-based testing of a new biomarker is not valid, because the null distribution of the test statistic is not standard normal, even in large samples. If a test for the incremental value of a marker is desired, the authors recommend the test based on the model. For investigators who find the IDI to be a useful measure, bootstrap methods may offer a reasonable option for inference when evaluating new predictors, as long as the added predictive capacity is large.

Keywords: biological markers, bootstrap confidence interval, prediction, risk assessment, sampling distribution, sampling error, selection bias, type I error

Various metrics have been proposed for quantifying the predictive ability of a classification model or quantifying the incremental value of a new biomarker or predictor (1). The most common single-number summary of the ability of a classification tool to discriminate between cases and controls is the area under the receiver operating characteristic curve (AUC), also known as the c index. To quantify the incremental value of a new marker, one can use the improvement in the AUC when the marker is added to an existing classification model. However, the AUC has been widely criticized because it does not measure a clinically meaningful quantity (2, 3). There is also concern that the AUC is “insensitive” and does not demonstrate the value of new markers that are useful for prediction (2). Recently, several investigators proposed measures of incremental value that examine the extent to which a new marker reclassifies subjects (2, 4). However, such measures can be sensitive to arbitrary boundaries delineating discrete categories of risk (5).

Pencina et al. (4) proposed the integrated discrimination improvement (IDI) index as complementary to the AUC. The IDI is defined as

An external file that holds a picture, illustration, etc.
Object name is amjepidkwr086fx1_ht.jpg
(1)

In this equation, IS is the integral of sensitivity over all possible cutoff values and IP is the corresponding integral of “1 minus specificity.” In equation 1, “new” refers to the classification model that includes the new biomarker and “old” refers to the classification model that does not. Pencina et al. (4) provide the following estimator for the IDI:

An external file that holds a picture, illustration, etc.
Object name is amjepidkwr086fx2_ht.jpg
(2)

In equation 2, An external file that holds a picture, illustration, etc.
Object name is amjepidkwr086fx3_ht.jpg is an average of estimated probabilities of an event. An average is taken over the people in the sample who experienced events (“events”), and an average is taken over those who did not experience an event (“nonevents”). In other words, events are cases and nonevents are controls. Use of the IDI can be motivated from multiple perspectives (3, 69). Perhaps the simplest motivation for the IDI is that a useful marker leads to increased estimated risks of disease for cases and decreased estimated risks for controls. If the new marker contributes to risk prediction, the first term of equation 2 will be large in the positive direction and the second term will be large in the negative direction; subtracting them produces a large IDI.

Pencina et al. (4) give an example of using the IDI to evaluate the incremental value of a marker. Two regression models are fitted to a data set, with and without the new marker. Each regression model yields estimated risks of disease p^ for every individual, case and control, in the data set. The estimated risks from the 2 fitted models are averaged appropriately, and IDI^ is computed for the data set using equation 2. Although Pencina et al. (4) do not use logistic regression in their example, we expect this to be a common choice in practice, and we use logistic regression throughout most of this paper.

To test the null hypothesis that IDI = 0, Pencina et al. (4) provide the test statistic

An external file that holds a picture, illustration, etc.
Object name is amjepidkwr086fx4_ht.jpg
(3)

In equation 3, S^Eevents is the standard error of paired differences of new and old model-based predicted probabilities among cases; S^Enonevents is the corresponding standard error among controls. Pencina et al. (4) conjecture that zIDI is asymptotically standard normal under the null hypothesis that the new biomarker does not contribute to prediction.

Not all investigators agree that the IDI is a major improvement over the AUC as a measure of incremental value. Greenland (8) comments that the IDI, like the AUC, incorporates information that is irrelevant. That is, both measures summarize the entire receiver operating characteristic curve, including regions where false-positive or false-negative rates are unacceptable. Chi and Zhou (6) fault the IDI for putting equal weight on sensitivity and specificity, when the relative importance of sensitivity and specificity varies with the objective. Mihaescu et al. (10) comment that the IDI, like the AUC, is a measure of clinical validity rather than clinical utility. Without endorsing the AUC, we note that most researchers have enough experience with the AUC to interpret the measure and to know when an AUC value is “large.” It is not clear whether the same holds for the IDI. On the other hand, the IDI has become increasingly popular in predictive modeling research. In a scientific statement from the American Heart Association, Hlatky et al. noted that “the IDI test appears to be more powerful than the c index” for establishing that a new biomarker has positive incremental value (11, p. 2411). On February 17, 2011, 353 articles in the Science Citation Index referenced the article by Pencina et al. (4). Many of these authors used the IDI or the test statistic zIDI as supporting evidence in favor of a proposed biomarker.

In this article, we sidestep the debate on the inherent value of the IDI as a measure and focus instead on the statistical properties of the IDI. The popularity of the IDI warrants further investigation of its behavior, particularly in the common situation in which the “new” and “old” risk models are estimated using the same set of data. Pepe et al. (12) raised concerns that the denominator of equation 3 is an underestimate of the standard error of IDI^. We investigate this particular question, as well as the sampling distribution of IDI^. We provide empirical and theoretical evidence that IDI^ is approximately normal only for large values of the IDI. In particular, we show that the test statistic zIDI does not have a standard normal distribution under the null hypothesis that IDI = 0, and thus the test based on zIDI is not valid.

MATERIALS AND METHODS

We used both simulation and statistical theory to explore the sampling distribution of IDI^ and the null distribution of zIDI. Throughout this paper, we consider the behavior of IDI^ in the common situation where “old” and “new” nested risk models are fitted to the same data set.

Data simulation schemes

We employed multiple schemes for simulating data. We always use D to denote the binary variable indicating the outcome, that is, disease status. Y denotes established (“old”) predictors. Candidate (“new”) predictors are denoted with W, W1, or W2.

Logistic simulation models.

We simulated the log odds of disease according to a logistic risk model in which we think of age as the established predictor Y and cardiovascular disease as the outcome D. In our simplest simulation model, there is a single candidate predictor W:

An external file that holds a picture, illustration, etc.
Object name is amjepidkwr086fx5_ht.jpg
(4)

We also consider scenarios in which there are 2 candidate predictors W1 and W2:

An external file that holds a picture, illustration, etc.
Object name is amjepidkwr086fx6_ht.jpg
(5)

We simulated Y as N(65, 10) and independently simulated each of W, W1, and W2 as N(0, 1). These simulation parameters yield an event rate of approximately 5% when γ = 0 or γ1 = γ2 = 0. Using simulated data, we computed risks of disease using equation 4 or equation 5, and we simulated disease statuses from each risk independently using a Bernoulli distribution. If a γ parameter equals zero, then the corresponding W has no predictive value. If a γ parameter is not zero, then the corresponding W is predictive, although its incremental value depends, of course, on the magnitude of its coefficient.

Alternative logistic simulation model.

The alternative logistic simulation model was designed to mimic situations in which the established predictor is not very predictive. The simulation model is similar to equation 4:

An external file that holds a picture, illustration, etc.
Object name is amjepidkwr086fx7_ht.jpg
(6)

Y is randomly generated from a standard exponential distribution, and W is independently generated from a Poisson distribution with mean 4. As in the previous logistic simulation model, the prevalence is approximately 5% when γ = 0.

HIV simulation model.

In this simulation, we start with a real data set from a clinical trial for prevention of mother-to-child transmission of human immunodeficiency virus (HIV) (13). Table 1 gives basic descriptive statistics for this data set. Of the 1,882 deliveries by HIV-infected women recorded in this data set, 8% of the infants had a positive HIV test at birth. There is an established predictor Y, which is maternal viral load at 20–24 weeks of gestation. A higher viral load means that the mother has more copies of the virus circulating in her blood and is modestly predictive of whether she will transmit HIV to her child during pregnancy or delivery. We consider the mother's age as the candidate predictor. One would not expect the age of an HIV-infected pregnant woman to predict whether her infant will be born with HIV infection. In each simulated data set, we randomly permute mother's age, ensuring no predictive ability of the “new” predictor W.

Table 1.
Data From a Clinical Trial for Prevention of Mother-to-Child Transmission of Human Immunodeficiency Virusa

Risk models

For simulated data sets, we fit logistic regression models to the data set with and without the “new” predictors. In other words, for a given simulated data set, we fit the model

An external file that holds a picture, illustration, etc.
Object name is amjepidkwr086fx8_ht.jpg
(7)

to estimate risks of disease using only the established predictor Y. If there is a single candidate predictor, we fit the model

An external file that holds a picture, illustration, etc.
Object name is amjepidkwr086fx9_ht.jpg

to estimate the risks of disease using both the established and candidate predictors. We used the “new” and “old” estimated risks to compute IDI^ and zIDI.

In the original IDI paper, Pencina et al. (4) proposed the IDI for comparing 2 nested models. In particular, the proposal does not limit the IDI to evaluating a single candidate marker. In fact, the IDI has been used to evaluate multiple markers as a set (14, 15). In a similar spirit, we used logistic simulations with 2 new predictors to compare equation 7 and the larger model

An external file that holds a picture, illustration, etc.
Object name is amjepidkwr086fx10_ht.jpg
(8)

For simulations with a single candidate predictor W, we also consider the following 2 nested prediction models:

An external file that holds a picture, illustration, etc.
Object name is amjepidkwr086fx11_ht.jpg

and

An external file that holds a picture, illustration, etc.
Object name is amjepidkwr086fx12_ht.jpg
(9)

Thus, in the larger model there is a single candidate predictor but the 2 fitted models differ by 2 degrees of freedom (df). As before, we used the “new” and “old” estimated risks to compute IDI^ and zIDI.

RESULTS

The sampling distribution of IDI^: empirical results

Sampling distributions when IDI = 0.

Using the logistic simulation model and setting γ = 0, we simulated data sets with a useful predictor of disease and a candidate predictor of disease that has no predictive capacity. Similarly, we used the HIV simulation model to generate data sets in which a candidate predictor had no incremental value.

First, we investigated the accuracy of the standard error estimate used in equation 3. For a given sample size, we simulated 10,000 data sets using the logistic simulation model with γ = 0 and computed IDI^. The standard deviation of IDI^ across these 10,000 simulations estimates the standard error of IDI^ under the null hypothesis that IDI = 0. For each simulated data set, we also computed S^E(IDI^) using the formula in the denominator of equation 3. In Figure 1, we compare our empirical estimate of S^E(IDI^) with the estimate used in equation 3 by dividing the latter by the former. We see that the standard error estimate in equation 3 is, on average, about half as large as it should be. The magnitude of the bias is (perhaps remarkably) stable as sample size increases. Our results confirm the suspicion of Pepe et al. (12) that the standard error estimate used in equation 3 is an underestimate of the standard error of IDI^.

Figure 1.
Bias in S^E(IDI^). For each sample size, we simulated 10,000 data sets using the logistic simulation model with γ = 0 and computed IDI^. The standard deviation of IDI^ across these 10,000 simulations estimates the null standard error (SE) of ...

We also investigated more fully the sampling distribution of IDI^ and zIDI when the null hypothesis, IDI = 0, is true. The top row of Figure 2 shows the results for the logistic simulation model (γ = 0). The bottom row shows the results for the HIV simulation model. Results are based on 10,000 simulations of data sets of size n = 1,500 for the logistic model and n = 1,882 for the HIV simulation model. The null distribution of IDI^ is highly nonsymmetric, with a long right tail. The strong positive skewness in the distribution results from the fact that the 2 components of IDI^ have a strong negative correlation. Pepe et al. (12) also pointed out that the IDI is equal to the proportion of explained variation, which is either always or predominantly positive, depending on the type of regression model. That is, adding a new variable to a set of predictors rarely decreases the proportion of explained variation (and never decreases the proportion of explained variation in linear regression). The null distribution of zIDI is more symmetric but is centered away from zero and is not standard normal. Other simulation models gave very similar results (data not shown).

Figure 2.
Null distribution of IDI^ and zIDI for 10,000 data sets simulated using the logistic model (top row; n = 1,500) and the human immunodeficiency virus simulation model (bottom row; n = 1,882). For zIDI, a standard normal density curve is given for reference. ...

We also studied the sampling distribution of IDI^ and zIDI for 2-df IDIs. For the logistic simulation model, the larger model is equation 8, and for the HIV simulation model, the larger model is equation 9. Results are shown in Figure 3. Compared with Figure 2, Figure 3 shows that IDI^ is more prominently skewed toward positive values and the distribution of zIDI is further shifted to the right in comparison with a standard normal curve.

Figure 3.
Null distribution of IDI^ and zIDI for 10,000 data sets simulated using the logistic model with 2 candidate predictors (top row; n = 1,500) and the human immunodeficiency virus simulation model (bottom row; n = 1,882). For zIDI, a standard normal density ...

False-positive rates.

We have seen that the null distribution of zIDI is not standard normal (Figures 2 and and3).3). What is the implication for investigators attempting to use zIDI to evaluate a new biomarker? We used the logistic simulation model with γ = 0 to investigate the type I error (false-positive) rate of the zIDI test. Suppose an investigator uses zIDI to conduct a 2-sided hypothesis test of H0: IDI = 0 for a single biomarker and a 1-df difference between the “new” and “old” predictive models. It turns out that the zIDI test is slightly conservative. A nominal 5%-level test uses a cutoff of 1.96; the true size of the test is actually slightly smaller, approximately 3.9.

The IDI is a measure of the improvement in prediction. As previously noted (14), a 2-sided hypothesis test is not appropriate when interest is in markers that improve prediction. If one uses an IDI-based hypothesis test to evaluate a new biomarker, an appropriate test is 1-sided—that is, H0: IDI = 0 vs. H1: IDI > 0. Performing the test by comparing zIDI with a standard normal distribution, the cutoff 1.96 nominally corresponds to a 2.5%-level 1-sided test. The actual type I or false-positive error rate is approximately 3.9%. An intended α level of 5% corresponds to an actual α level of approximately 9.3%.

We also considered the case in which 2 df separate the “new” and “old” predictive models. In this case, both 1-sided and 2-sided hypothesis tests are anticonservative, with higher false-positive rates than the nominal levels. Figure 4 illustrates the results described above.

Figure 4.
Estimated type I error rates of the zIDI test. (The identity line (- - -) is given for reference; points above the line represent increased false-positive rates.) The false-positive rate of the test is lower than the nominal α level for a 2-sided ...

Sampling distributions away from the null.

Using the logistic simulation model, we also simulated data sets where the new predictor W has some predictive value by choosing γ ≠ 0. We examined a range of values of γ. As before, we computed IDI^ for each simulated data set.

Figure 5 shows estimated sampling distributions of IDI^ for a range of γ values. (Results are shown in 2 plots because of the drastically different scales for the distributions for small and large γ.) For small values of γ, IDI^ has a severe right skewness, as we saw in Figure 2. For larger values of γ, IDI^ has a fairly symmetric distribution. To help interpret these results, Table 2 (first row of data) provides the average P value for the coefficient of γ in the fitted logistic regression model. A value of γ = 0.4 is a marginally significant predictor according to this metric.

Table 2.
Bootstrap Coverage for Logistic and Alternative Logistic Simulation Models
Figure 5.
Sampling distribution of IDI^. Data were simulated with the logistic model and γ = 0, 0.1, 0.2, 0.3 (upper panel) and γ = 0.4, 0.6, 0.8, 1, 1.2, 1.4 (lower panel). Curves are labeled with their γ values. Each density estimate was ...

The sampling distribution of IDI^: theoretical results

The extremely nonnormal empirical distribution of IDI^ is surprising, so we investigated the distribution analytically in a simplified scenario to help explain the simulation findings. The formulation of the IDI in equation 4 does not restrict how the risk models are to be fitted, so we examined the distribution that arose when the risk scores were fitted by linear regression. This would be an unusual choice in practice, but it is convenient here because it allows us to derive simple formulas for the risk scores. In contrast, logistic regression models are fitted to data using iterative algorithms, and there are not simple formulas for model parameters as a function of the data. However, since the computational algorithms for logistic regression use iterative weighted linear regression, we would expect the distribution of IDI^ based on linear regression to be a good guide to the distribution based on logistic regression (at least when prediction is weak). Our analytic results for linear regression explain both the asymmetric null distribution of IDI^ and the underestimation of its standard error.

Without loss of generality, the “old” model contains a single variable Y and the “new” model additionally includes a variable W that is independent of Y and with mean zero (otherwise, replace Y by Yβ and W by WE[W|Y]). We show in the Appendix that

An external file that holds a picture, illustration, etc.
Object name is amjepidkwr086fx13_ht.jpg
(10)

where γ^ is the estimated coefficient of W in the fitted model and ρ is the prevalence of disease. Under the strong null hypothesis that both marker Y and marker W have no predictive value,

An external file that holds a picture, illustration, etc.
Object name is amjepidkwr086fx14_ht.jpg

where n is the sample size. Under the more general null hypothesis that Y is a useful predictor, we need to know Var[W|Y], which we denote = σ2. This is the scenario we are most interested in—the incremental value of W above and beyond an existing predictor Y. In this case, we have

An external file that holds a picture, illustration, etc.
Object name is amjepidkwr086fx15_ht.jpg

Equation 10 allows us to use well-established results about parameter estimates in linear models to understand the distribution of IDI^. Under the alternative hypothesis (γ ≠ 0), γ^2 has a noncentral chi-squared distribution with a noncentrality parameter increasing with n and γ. As the noncentrality parameter increases, the distribution gets closer to normal, but the normal approximation is only good in situations where the power of the test for γ = 0 is high. Since

An external file that holds a picture, illustration, etc.
Object name is amjepidkwr086fx16_ht.jpg

for large γ or n, the distribution will eventually be centered around γ2 with a normal distribution and with variance proportional to γ2.

Our simulation studies show that these results for the linear model hold approximately for logistic regression. In the first part of the Results section, we saw that the null distribution of IDI^ has a chi-square-shaped distribution and the sampling distribution appears approximately normal for IDI away from zero.

An interesting result applies to a scenario in which risk models have been estimated using a separate set of training data. If a case-control validation sample is taken to estimate the IDI using the existing (fixed) risk models, then the formula for estimating Var[IDI^] provided by Pencina et al. (4) turns out to be correct (Appendix).

Monte Carlo inference for the IDI

Bootstrapping is a popular method with which to make inferences about a parameter using an estimator whose sampling distribution is not well characterized. Unfortunately, bootstrapping is not always a reliable method for making inferences about the IDI. Figure 5 and Figure 6 show why. The sampling distribution of IDI^ changes rapidly in shape and scale as IDI approaches zero. The bootstrap estimates the sampling distribution of IDI^ under conditions as they exist in the sample. If the true IDI in the population is zero, then IDI^ in the sample will typically be positive, and the sampling distribution under conditions in the sample will be substantially different from the sampling distribution under the true, zero, IDI. The bootstrap distribution will be more symmetrical, more spread out, and shifted to the right compared with the true sampling distribution.

Figure 6.
Illustration of the fact that the resampling-subjects bootstrap does not provide valid inference for small values of the integrated discrimination improvement (IDI) index. We simulated 1,000 data sets of size 1,500 using the logistic simulation model ...

The third and sixth rows in Table 2 show that the bootstrap has an anticonservative bias when the true incremental value of a marker is null. In particular, for the alternative logistic simulation model, the anticonservatism of the bootstrap was severe, with only 74.1% of nominal 95% bootstrap confidence intervals covering the true IDI value of zero. We obtained similar results when 2 new markers were simultaneously evaluated with the IDI, with unreliable, anticonservative inferences for small values of the IDI (Table 3).

Table 3.
Bootstrap Coverage in Logistic Simulations With 2 New Predictorsa

DISCUSSION

In this paper, we investigated IDI as a measure of the incremental value of a biomarker. In our simulation studies, the published formula for estimating the standard error of IDI^ tended to underestimate the true standard error by a factor of approximately 2. Moreover, the sampling distribution of IDI^ for a marker with no predictive value is strongly skewed toward positive values. We also considered testing the null hypothesis H0: IDI = 0. The null distribution of the proposed z statistic does not follow a standard normal distribution. For evaluating the incremental value of a single biomarker, 2-sided hypothesis testing using the z test is conservative. More appropriate 1-sided hypothesis testing is anticonservative, meaning that the IDI z test is prone to giving false-positive results.

Most of the empirical results we have presented involved fitting logistic regression models to data simulated under a logistic model. This is an idealized situation where the exactly correct model is fitted to the data and used to estimate risks and the IDI. The fact that the sampling distribution of IDI^ in such a highly idealized situation did not conform to the expectations set out in equation 4 does not bode well for its behavior with real data.

Our empirical and theoretical results indicate that a valid test of H0: IDI = 0 that is based on IDI^ will be very difficult to develop. However, the hypothesis H0: IDI = 0 is equivalent to H0: P(D|Y, W) = P(D|Y), where W is the candidate biomarker and Y is the set of existing predictors (12). This is fortunate, because it means that an IDI-based test is unnecessary. Therefore, if a test of positive incremental value is desired, we recommend using a test based on the model. For example, if a regression function is used for risk modeling, then the likelihood ratio test for the coefficient of W in the risk model can be used to test the null hypothesis H0: P(D|Y, W) = P(D|Y). The likelihood ratio test is implemented in all major statistical packages, can be applied to single markers or sets of markers, and is the uniformly most powerful test.

In certain cases in practice, IDI-based tests of the predictiveness of a novel biomarker give small P values, whereas tests based on regression coefficients or the AUC are far from significant. For example, see Table III in the article by Criqui et al. (16), Table 2 in the article by Blankenberg et al. (17), or Table 3 in the article by Lin et al. (18). Since all tests evaluate the same null hypothesis, a tempting conclusion is that the IDI-based test is more powerful than the others (11). Unfortunately, the results in this paper lead to an alternate explanation, namely that IDI-based results are inconsistent with the other results because the test based on zIDI is not valid.

We remind readers that the value of hypothesis testing in evaluating new biomarkers is, at best, limited. The real challenge in biomarker research is to identify markers with a predictive capacity that is substantial enough to improve clinical practice. The motivation for the development of the IDI still stands: to find measures that quantify the incremental value in a meaningful way. For investigators who find the IDI to be a useful measure, bootstrapping to obtain confidence intervals may offer a reasonable option for inference, as long as the true IDI is well away from zero.

Acknowledgments

Author affiliations: Department of Biostatistics, School of Public Health, University of Washington, Seattle, Washington (Kathleen F. Kerr, Robyn L. McClelland, Elizabeth R. Brown, Thomas Lumley).

K. F. K. was supported by sabbatical funding from the University of Washington. E. R. B. was supported by National Institutes of Health grant R01 HL095126, and T. L. was supported by National Institutes of Health grant R01 HL080295.

Conflict of interest: none declared.

Glossary

Abbreviations

AUC
area under the receiver operating characteristic curve
HIV
human immunodeficiency virus
IDI
integrated discrimination improvement

APPENDIX

Derivation of Mathematical Results

Old and new models fitted to the same data

We consider adding a single new variable W that is fitted by linear regression. There is no loss of generality in assuming that the “old” model contains a single variable Y, so the “new” model contains W and Y. We can also assume that W has a mean value of zero and is uncorrelated with Y in the sample (otherwise, replace Y by Yβ and W by WE[W|Y]).

Now

An external file that holds a picture, illustration, etc.
Object name is amjepidkwr086fx17_ht.jpg

The last equality holds because in any generalized linear model with an intercept, the residuals sum to zero.

Since

An external file that holds a picture, illustration, etc.
Object name is amjepidkwr086fx18_ht.jpg

we have

An external file that holds a picture, illustration, etc.
Object name is amjepidkwr086fx19_ht.jpg

and

An external file that holds a picture, illustration, etc.
Object name is amjepidkwr086fx20_ht.jpg

where γ^ is the coefficient of the proposed marker W in the “new” model.

Under the strong null hypothesis that neither the “old” marker Y nor the “new” marker W is predictive of disease, then An external file that holds a picture, illustration, etc.
Object name is amjepidkwr086fx21_ht.jpg.

Therefore,

An external file that holds a picture, illustration, etc.
Object name is amjepidkwr086fx22_ht.jpg

and

An external file that holds a picture, illustration, etc.
Object name is amjepidkwr086fx23_ht.jpg

Under the more general null hypothesis that Y is predictive but W is not, let σ2 denote Var[D|Y]. We then have An external file that holds a picture, illustration, etc.
Object name is amjepidkwr086fx24_ht.jpg and

An external file that holds a picture, illustration, etc.
Object name is amjepidkwr086fx25_ht.jpg

Under the alternative hypothesis, γ^2 has a noncentral chi-squared distribution with a noncentrality parameter increasing with n and γ. As the noncentrality parameter increases, the distribution gets closer to normal, as shown in Figure 5.

Bootstrap

We can explicitly demonstrate the failure of the bootstrap in the simplest case in which the models are linear and the “old” model is uninformative. The derivations above show that estimated integrated discrimination improvement (IDI) then has a scaled noncentral chi-squared distribution with noncentrality parameter nγ2/2, that is, χ12(λ=nγ2/2).

A bootstrap sample is a sample from a population in which γ=γ^, where γ^ is the estimate in the original data sample. The distribution of statistics IDI* computed on the bootstrap samples will correctly estimate the sampling distribution of IDI when γ=γ^—that is, in large samples, conditional on γ^,

An external file that holds a picture, illustration, etc.
Object name is amjepidkwr086fx26_ht.jpg

When the new biomarker is uninformative, the sampling distribution of n×IDI^ is a central chi-squared distribution, that is, χ12(λ=0), but the conditional sampling distribution of the bootstrap replicates IDI* is n×IDI*χ12(λ=nγ^2/2). Since nγ^2 does not converge to zero, the bootstrap distribution does not converge to the sampling distribution. As Figure 6 shows, the bootstrap distribution of IDI* actually varies according to the sample value of IDI^.

If γ ≠ 0, however, the distribution of γ^2 is asymptotically normal with mean γ2 and variance proportional to 1/n and depending smoothly on γ. The sampling distribution of IDI^ is approximately normal with mean

An external file that holds a picture, illustration, etc.
Object name is amjepidkwr086fx27_ht.jpg

and variance proportional to 1/n and depending smoothly on γ.

The conditional distribution of the bootstrap replicates

IDI* in a sample with γ=γ^ will thus have mean

An external file that holds a picture, illustration, etc.
Object name is amjepidkwr086fx28_ht.jpg

which converges to the mean of IDI^, and since the variance depends smoothly on γ^, it will converge to the variance of the IDI^. Thus, the bootstrap gives the correct sampling distribution for IDI^ in large samples when γ ≠ 0.

Conditioning on previously estimated risk scores

To prove the result at the end of the “Sampling Distribution of IDI^: Theoretical Results” section (see text), assume 2 differentiable functions xfold(x)and(x,w)fnew(x,w). If we are conditioning on the test sample, these can be regarded as fixed functions. They produce a pair of random variables (P = fnew(Y), Q = fnew(Y, w)), and we also have the outcome variable D. Because we are treating the 2 functions as fixed, the triples (P, Q, Y) for each person are (conditionally) independent and identically distributed in the training sample.

The IDI is estimated by

An external file that holds a picture, illustration, etc.
Object name is amjepidkwr086fx29_ht.jpg

This is a Hadamard-differentiable function of the empirical cumulative distribution function of (P, Q, D), as long as the proportion of cases is bounded away from 0 and 1, so it is asymptotically normal and bootstrappable and is consistent for the value defined by applying the IDI() functional to the true distributions of P, Q, and D (19).

The asymptotic variance of the estimated IDI will depend only on the uncertainty in the numerators and so is the variance of

An external file that holds a picture, illustration, etc.
Object name is amjepidkwr086fx30_ht.jpg

Under prospective sampling, this is still larger than the formula given by Pencina et al. (4). However, under case-control sampling with prespecified numbers of cases and controls, the variance is the sum of variance contributions from the case (D = 1) and control (D = 0) strata; so under these circumstances, the asymptotic variance is

An external file that holds a picture, illustration, etc.
Object name is amjepidkwr086fx31_ht.jpg

Therefore, the variance formula presented by Pencina et al. (4) is correct if one develops the prediction models in a separate sample, fixes the “old” and “new” risk models to be those estimated from those samples, and then estimates the IDI in a separate case-control validation sample.

References

1. Gu W, Pepe M. Measures to summarize and compare the predictive capacity of markers. Int J Biostat. 2009;5(1) Article 27. (doi: 10.2202/1557-4679.1188) [PMC free article] [PubMed]
2. Cook NR. Use and misuse of the receiver operating characteristic curve in risk prediction. Circulation. 2007;115(7):928–935. [PubMed]
3. Pepe MS, Janes HE. Gauging the performance of SNPs, biomarkers, and clinical factors for predicting risk of breast cancer. J Natl Cancer Inst. 2008;100(14):978–979. [PMC free article] [PubMed]
4. Pencina MJ, D'Agostino RB, Sr, D'Agostino RB, Jr, et al. Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Stat Med. 2008;27(2):157–172. [PubMed]
5. Dalton JE, Kattan MW. Recent advances in evaluating the prognostic value of a marker. Scand J Clin Lab Invest Suppl. 2010;242:59–62. [PubMed]
6. Chi YY, Zhou XH. The need for reorientation toward cost-effective prediction: comments on ‘evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond’ by Pencina et al., Statistics in Medicine (DOI: 10.1002/sim.2929) Stat Med. 2008;27(2):182–184. [PubMed]
7. Cook NR. Comments on ‘evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond’ by M. J. Pencina et al., Statistics in Medicine (DOI: 10.1002/sim.2929). Stat Med. 2008;27(2): 191–195. [PubMed]
8. Greenland S. The need for reorientation toward cost-effective prediction: comments on ‘evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond’ by M. J. Pencina et al., Statistics in Medicine (DOI: 10.1002/sim.2929). Stat Med. 2008;27(2): 199–206. [PubMed]
9. Van Calster B, Van Huffel S. Integrated discrimination improvement and probability-sensitive AUC variants. Stat Med. 2010;29(2):318–319. [PubMed]
10. Mihaescu R, van Zitteren M, van Hoek M, et al. Improvement of risk prediction by genomic profiling: reclassification measures versus the area under the receiver operating characteristic curve. Am J Epidemiol. 2010;172(3):353–361. [PubMed]
11. Hlatky MA, Greenland P, Arnett DK, et al. Criteria for evaluation of novel markers of cardiovascular risk: a scientific statement from the American Heart Association. American Heart Association Expert Panel on Subclinical Atherosclerotic Diseases and Emerging Risk Factors and the Stroke Council. Circulation. 2009;119(17):2408–2416. [PMC free article] [PubMed]
12. Pepe MS, Feng Z, Gu JW. Comments on ‘evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond’ by M. J. Pencina et al., Statistics in Medicine (DOI: 10.1002/sim.2929). Stat Med. 2008;27(2): 173–181. [PubMed]
13. Taha TE, Brown ER, Hoffman IF, et al. A phase III clinical trial of antibiotics to reduce chorioamnionitis-related perinatal HIV-1 transmission. AIDS. 2006;20(9):1313–1321. [PubMed]
14. Chao C, Song Y, Cook N, et al. The lack of utility of circulating biomarkers of inflammation and endothelial dysfunction for type 2 diabetes risk prediction among postmenopausal women: the Women's Health Initiative Observational Study. Arch Intern Med. 2010;170(17):1557–1565. [PMC free article] [PubMed]
15. Sandholt CH, Sparsø T, Grarup N, et al. Combined analyses of 20 common obesity susceptibility variants. Diabetes. 2010;59(7):1667–1673. [PMC free article] [PubMed]
16. Criqui MH, Ho LA, Denenberg JO, et al. Biomarkers in peripheral arterial disease patients and near- and longer-term mortality. J Vasc Surg. 2010;52(1):85–90. [PMC free article] [PubMed]
17. Blankenberg S, Zeller T, Saarela O, et al. Contribution of 30 biomarkers to 10-year cardiovascular risk estimation in 2 population cohorts: the MONICA, Risk, Genetics, Archiving, and Monograph (MORGAM) Biomarker Project. Circulation. 2010;121(22):2388–2397. [PubMed]
18. Lin HJ, Lee BC, Ho YL, et al. Postprandial glucose improves the risk prediction of cardiovascular death beyond the metabolic syndrome in the nondiabetic population. Diabetes Care. 2009;32(9):1721–1726. [PMC free article] [PubMed]
19. van der Vaart AW. Asymptotic Statistics. Cambridge, United Kingdom: Cambridge University Press; 1998. Functional delta method; pp. 291–303.

Articles from American Journal of Epidemiology are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...