- Journal List
- NIHPA Author Manuscripts
- PMC1444894

# The Inconsistency of “Optimal” Cut-points Using Two ROC Based Criteria.

^{1}Division of Epidemiology, Statistics and Prevention Research, National Institute of Child Health and Human Development, National Institutes of Health, DHHS, Bethesda, MD.

^{2}Department of Mathematics and Statistics, American University, Washington, DC.

## Abstract

The use of biomarkers is of ever increasing importance in clinical diagnosis of disease. In practice, a cut-point is required to dichotomize naturally continuous biomarker levels of individuals at risk for disease and those not. Two methods commonly used for establishing the “optimal” cut-point are the point on the ROC curve closest to (0,1) and the Youden index, *J*. Both have sound intuitive interpretations, the point closest to perfect differentiation and the point farthest from none, respectively, and are generalizable to weighted sensitivity and specificity. Under the same weighting of sensitivity and specificity, they identify the same cut-point as “optimal” in certain situations and different cut-points in others. In this paper, we examine the situations in which the two criteria agree or disagree and show that *J* is the only “optimal” cut point for given weighting with respect to overall misclassification rates. A data driven example is used to clarify and demonstrate the magnitude of the differences. We also demonstrate a slight alteration in the (0,1) criterion that retains its intuitive meaning, while resulting in consistent agreement with *J*. In conclusion, we urge that great care should be taken when establishing a biomarker cut-point for clinical use.

**Keywords:**Optimal cut-point, cutoff, ROC, Youden Index, Optimal Operating Point, area under the curve (AUC), partial area under the curve (pAUC), placenta growth factor (plgf), receiver operating characteristic (ROC), sensitivity (q(c)), specificity (p(c)), Youden index (J)

The proper diagnosis of disease and treatment administration is a task that requires a variety of tools. Through advancements in biology and laboratory methods a multitude of biomarkers are available as clinical tools for such diagnosis. These biomarkers are usually measured on some continuous scale with overlapping levels for diseased and non-diseased individuals. Cut-points dichotomize biomarker levels, providing benchmarks that label individuals as diseased or not based on “positive” or “negative” test results. Biomarker levels of individuals with known disease status are used to evaluate potential cut-point choices and hopefully identify a cut-point that is “optimal” under some criteria.

Such a dataset would be comprised of biomarker levels for individuals classified as coming from the diseased (D) or non-diseased
$\text{(}\overline{\text{D}}\text{)}$ population. These levels could then be classified as a positive (+) or negative (−) test result based on whether the biomarker levels are above or below a given cut-point, respectively. In most instances, some individuals will be misclassified, thus truly belonging to a population other than the one indicated by their test results. The sensitivity (*q*(*c*)) and specificity (*p*(*c*)) of that biomarker for a given cut-point, *c*, are the probabilities of correctly identifying an individual’s disease status (i.e. true positives and true negatives)

Making 1 minus these values the probability of incorrect classification or false negatives (1 − *q*(*c*)) and false positives (1 − *p*(*c*)).

A receiver operating characteristic (ROC) curve is a mapping of this sensitivity by 1 minus specificity that has become a useful tool in comparing biomarker effectiveness (1–3). This comparison takes place through summary measures such as the area under the curve (*AUC*) and partial area under the curve (*pAUC*), with higher area values indicating higher levels of diagnostic ability (1, 2, 4). A biomarker with *AUC*=1 differentiates perfectly between diseased, sensitivity=1, and health, specificity=1, individuals. Meanwhile, an *AUC*=0.5 means that overall there is a 50:50 chance that the biomarker correctly identifies diseased or health individuals as such.

Though useful for biomarker evaluation, these measures do not inherently lead to benchmark “optimal” cut-points for clinicians and other healthcare professionals to differentiate between diseased and non-diseased individuals. Several methods have been proposed and applied to identify an “optimal” cut-point using sensitivity, specificity and the ROC curve (4–8). Confidence intervals and corrections for measurement error are some of the supporting statistical developments accompanying cut-point estimation (9). Applications of these techniques have been demonstrated in nuclear cardiology, epidemiology and genetics to mention some examples (7, 10, 11). In the Criterion section, we describe two criteria for locating this cut-point that have similar intuitive justification. In describing the mathematical mechanisms behind the criteria, we demonstrate that one of the criteria retains the intended meaning, while the other inherently depends on quantities that may differ from an investigators intention. The Example section demonstrates how the two criteria identify different cut-points for the classification of 120 preeclampsia cases and 120 controls based on plgf levels, biomarkers of angiogenesis, from nested case control study from the CPEP prospective cohort. Next, we discuss the appropriateness of the term “optimal” as it applies to each criteria. This is handled first with equally weighted sensitivity and specificity. Consideration of differing disease prevalence and costs due to misclassification are also presented as a practical generalization (5,12). We end with a brief discussion.

## CRITERION

*The closest to* (0,1) *criteria*

If a biomarker perfectly differentiates individuals with disease from those without based on a single cut-point, *q*(*c*)=1 and *p*(*c*)=1, the ROC curve is a vertical line from (0,0) to (0,1) joined with a line from (0,1) to (1,1) with an AUC =1. However, for a less than perfect biomarker, *q*(*c*)<1 and/or *p*(*c*)<1, the ROC curve does not touch the (0,1) point. Here the choice of an “optimal” cut-point is less straight forward. A criteria has been suggested and utilized where the point on the curve closest to (0,1) is identified and the corresponding cut-point is labeled “optimal” (6, 7). The rational behind this approach is that the point on the curve closet to perfection, *q*(*c*)=1 and *p*(*c*)=1 should be the optimal cut point from all the available cut-point, thus intuitively minimizing misclassification. Mathematically, the point *c** that satisfies the equation

fulfills this criteria and is thus labeled the cut-point that best differentiates between diseased and non-diseased.

This criterion can be viewed as searching for the shortest radii originating at the (0,1) point and terminating on the ROC curve. Reference arcs can be used to visually compare radial distances, with the arc corresponding to *c** being tangent to the ROC curve and thus the minimum and interior of any of the concentric arcs possible. Figure 1 demonstrates this point where the dotted arc is completely interior, thus closer to (0,1), to the arc formed by the distance to an alternate point on the curve.

### The Youden Index

Another measure for evaluating biomarker effectiveness is the Youden index (*J*), first introduced to the medical literature by Youden (13). *J* is also a function of *q*(*c*) and *p*(*c*), such that

over all cut-points *c*, with *c** _{J}* denoting the cut-point corresponding to

*J*. On a ROC curve,

*J*is the maximum vertical distance from the curve to the chance line or positive diagonal (figure 1), making

*c*

*the “optimal” cut-point (5,14). The intuitive interpretation of the Youden index is that*

_{J}*J*is the point on the curve farthest from chance. It has also been defined as the accuracy of the test in clinical epidemiology (15).

### Agreement/Disagreement

The criteria agree with respect to intuition; maximize and minimize the rate of individuals classified correctly and incorrectly, respectively. The question “Do they agree on the same “optimal” cut-point?”, now begs to be answered.

Suppose the biomarker of interest follows continuous distributions for both diseased and nondiseased populations that are known completely, leading to a true ROC curve. Our only distributional restriction is that a ROC curve is generated that is differentiable everywhere. This is intrinsic to the case where diseased and nondiseased individuals are assumed to follow any number of common continuous densities (i.e. normal, lognormal, gamma, ect.). Through differentiation, Appendix I shows that the two criteria only agree, *c**= *c** _{J}* =

*c*, when

*q*(

*c**) =

*p*(

*c**) and

*q*(

*c*

*) =*

_{J}*p*(

*c*

*). When either criteria identify a point on the curve such that*

_{J}*q*(

*c**) ≠

*p*(

*c**) or

*q*(

*c*

*) ≠*

_{J}*p*(

*c*

*), the criteria disagree on what cut-point is “optimal”,*

_{J}*c** ≠

*c*

*.*

_{J}An investigator with complete knowledge of a biomarker’s distributions could be faced with two different cut-points labeled “optimal” under the same intuition. Our motivation here is simply to show that they are different and address the appropriateness of the label “optimal” in a later section.

## EXAMPLE

Preeclampsia affects about 5 percent of pregnancies, resulting in substantial maternal and neonatal morbidity and mortality (16). Although the cause remains unclear, the syndrome may be initiated by placental factors that enter the maternal circulation and cause endothelial dysfunction resulting in hypertension and proteinuria (16). Identifying women suffering from preeclampsia is a very important step in the management of the disease. Placenta growth factor (Plgf) is a promising biomarker for such classification with an *AUC* = 0.60 (95% CI 0.526–0.670); however at what level would a woman be classified as at risk for the disease. A nested case control study of 120 preeclampsia and 120 normal individuals were randomly chosen from the CPEP cohort study. Plgf levels were measured from serum specimens that were obtained before labor. The ROC curve in figure 2 was generated from the log transformed plgf levels. After calculating the distance to (0,1) and the distance to the diagonal for each point, the cut-points *c**= 4.64 and *c** _{J}* = 4.12 are identified, respectively. So, criteria with seemingly identical intuitive intent are close but disagree on an “optimal” cut-point. Again, here it is sufficient to demonstrate that disagreement exists. We will revisit this example after the question of “optimality” is addressed in the next section.

**...**

### “Optimality”

When attempting to classify individuals based on biomarker levels, it is always the intent to do so “optimally”. However, the event of interest may intrinsically involve constraints which must for ethical or fiscal reasons be considered. These constraints are commonly accounting for the prevalence of the event in both populations and the costs of misclassification both monetary and physiological. So, now mathematic techniques of optimality must operate within these constraints but the idea of an “optimal” cut-point should remain; choosing a point which classifies the most number of individuals correctly and thus the least number incorrectly.

First let us assume the simplest scenario absent of constraints or weighting. By definition *c** _{J}* found by equation 2 succeeds ideologically by maximizing the overall rate of individuals classified correctly,

*q*(

*c*

*) +*

_{J}*p*(

*c*

*). As a result, the overall rate of misclassifications, (1 −*

_{J}*q*(

*c*

*)) + (1 −*

_{J}*p*(

*c*

*)), is minimized. So, we can say that*

_{J}*c*

*is “optimal” with respect to the total correct and incorrect classification rate and any cut-point that deviates from it is not.*

_{J}Under the same scenario, the closest to (0,1) criteria in equation 1 minimizes the total squared misclassification rates, quadratic terms for which an ideology does not seem to exist other than being geometrically intuitive. Equation 1 can be expanded and rewritten as

to show that this criteria is minimizing the total of the misclassification rates and a third term, the average of squared correct classification rates. Unless a specific justification for this third term exists, its usage results in unwarranted and thus unnecessary misclassification because it identifies a point *c** ≠ *c** _{J}*.

Now, let us consider the circumstance where cost and prevalence are thought to be factors as they usually are in practice. Using decision theory, a generalized *J* can be formed where these factors are represented as a weighting of sensitivity and specificity. The function that minimizes expected loss in classifying a subject can be written as

where ‘*a*’ denotes the relative loss (cost) of a false negative as compared with a false positive and π is the proportion of diseased individuals in the population of interest (prevalence) (17, 18). It is easy to see that minimizing this expected loss over all possible threshold values is the same as

where
$r={\scriptstyle \stackrel{(1-\pi )}{\phantom{\rule{0ex}{.6}}}/\underset{a\pi}{\phantom{\rule[--.8]{0ex}{0ex}}}}.$ for *r* = 1 this is equivalent to *J*.

Weighting of the (0,1) criteria occurs similarly,

where r is exactly the same weighting estimate for cost and prevalence. The issue of the quadratic term remains

only now its weighted and unnecessary. Comparing this equation to equation 4 it is easy to see that this absolutely does not minimize loss due to classification.

### Example Revisited

To demonstrate this unnecessary misclassification and its possible magnitude, we revisit the example of plgf levels used to identify preeclamptic women from those without the disease. Sensitivity and specificity at the cut-points previously identified are *q*(*c**) = 0.592, *p*(*c**) = 0.558 and *q*(*c** _{J}*) = 0.817,

*p*(

*c*

*) = 0.358, respectively. The overall correct classification rate (*

_{J}*q*+

*p*) is 1.150 for

*c** and 1.175 for

*c*

*out of a possible 2, with a difference of 0.025. Without the justification for the third term in equation 3 and without weighting, this difference can be thought of as one person out of a hundred being unnecessarily misclassified. Relative cost and disease prevalence are often difficult to assess as discussed by Greiner et al (18). and the references cited therein. So we will not attempt to adjust in this example.*

_{J}## DISCUSSION

In this paper, we demonstrated the intuitive similarity of two criteria used to chose an “optimal” cutpoint. We then showed that the criteria agree in some instances and disagree in others. Plgf levels used to classify women as preeclaptic or not were used to demonstrate this point and quantify the extent of disagreement.

We addressed both criteria in the context of what an investigator might view as “optimal”, with and without attention given to misclassification cost and prevalence. Mathematically, *J* reflects the intention of maximizing overall correct classification rates and thus minimizing misclassification rates, while the choosing point closest to (0,1) involves a quadratic term for which the clinical meaning is unknown. It is for this reason that advacate for the use of *J* to find the “optimal” cutpoint.

Since, the (0,1) criteria is visually intuitive we have provided an amended (0,1) criteria in Appendix 2 that is likewise geometrically satisfying while consistently identifying the same “optimal” cut-point as *J*. This criteria relies on a ratio of radii originating at (0,1).

Additional motivation for using *J* is an ever increasing body of supporting literature. Topics such as confidence intervals and correcting the estimate for measurement error have been considered where the (0,1) citeria lacks such support.

Most importantly, cut-points chosen through less than “optimal” criteria or criteria that are “optimal” in some arbitrary sense can lead to unnecessary misclassifications, resulting in needlessly missed opportunities for disease diagnosis and intervention. We showed that *J* is “optimal” when equal weight is given to sensitivity and specificity, *r* = 1, and a generalized *J* is “optimal” when cost and prevalence lead to weighted sensitivity and specificity, *r* ≠ 1. So, when the point closest to (0,1) differs from the point resulting in *J*, using this criteria to establish a “optimal” cut-point does introduces an increased rate of misclassification, unnecessarily.

## Acknowledgments

This research was supported by the Intramural Research Program of the NIH, Epidemiology Branch, DESPR, NICHD.

## Appendix 1

For continuous ROC curves we make no distributional assumptions beyond that the probability density functions *f** _{D}* and
${f}_{\overline{D}}$, for biomarker levels of diseased and non-diseased individuals respectively, form a ROC curve that is differentiable everywhere. This is the case when

*f*

*and ${f}_{\overline{D}}$ are assumed to be any common continuous parametric distributions (i.e. normal, gamma, lognormal).*

_{D}In order to locate these cut-points that minimize and maximize in equations 1 and 2, respectively, it is first necessary to locate critical values. So, differentiating equation 1,

Then set the derivative equal to zero,

Now, we differentiate the second criteria,

and then setting equal to zero

The forms of both A1.2 and 4 define the critical points of the criteria in equation 1 and 2, respectively, by the slopes of their corresponding points on the ROC curve. Since these solutions are not necessarily unique, multiple solutions may exist, i.e. local maximums or minimums. Therefore, all solutions and endpoints must be evaluated so that *c** and *c** _{J}* are global solutions.

Equations A1.2 and 4 show us that the (0,1) and *J* methods agree, *c**= *c** _{J}* =

*c*, only when

*q*(

*c**) =

*p*(

*c**) and thus (1−

*p*(

*c**))/(1−

*q*(

*c**)) = 1. When

*q*(

*c**) ≠

*p*(

*c**), the criteria disagree on what point is optimal,

*c** ≠

*c*

*. We will discuss which criteria might be “optimal” later, but for now our motivation is simply to show that they are different.*

_{J}## Appendix 2

Equation 1 identifies the point closest to perfection but irrespective of the possibilities of imperfection. In other words, this criterion minimizes the distance from (0,1) to the curve but fails to take into account the possible distance to the chance line or weighting the distances in equation 1. What if instead we minimize the proportion of the smaller radii (*r*_{2}) to the larger (*r*_{1}) displayed in figure 3 such that

*r*

_{2}to

*r*

_{1}, is equal to one minus the height of the curve from the diagonal or chance line,

*d*=

*q*(

*c*)−(1−

*p*(

*c*)).

where *d* = *q*(*c*) − (1 − *p*(*c*)), we obtain a weighted criterion.

The relation in equation A2.1 can be derived algebraically or by using proportionality of the triangles in figure 3 such that $\frac{{r}_{2}}{{r}_{1}}=\frac{{r}_{1}-({r}_{1}-{r}_{2})}{{r}_{1}}=1-\frac{({r}_{1}-{r}_{2})}{{r}_{1}}=1-\frac{d}{1}$. Figure 3 also, provides a visual reference for the proposed weighting, as radii passing through different points on the curve have different distances to the chance line but are treated uniformly in equation 1.

It is now easily seen that the differentiation

leads to the same critical points on the ROC curve as *J* and thus to identical cut-points *c** = *c** _{J}*.

## References

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (375K) |
- Citation

- Re: "The inconsistency of 'optimal' cutpoints obtained using two criteria based on the receiver operating characteristic curve".[Am J Epidemiol. 2006]
*Zetterberg H.**Am J Epidemiol. 2006 Oct 1; 164(7):707-8; author reply 708. Epub 2006 Aug 17.* - Evaluating cutpoints for the MHI-5 and MCS using the GHQ-12: a comparison of five different methods.[BMC Psychiatry. 2008]
*Kelly MJ, Dunstan FD, Lloyd K, Fone DL.**BMC Psychiatry. 2008 Feb 19; 8:10. Epub 2008 Feb 19.* - Receiver operating characteristic curve inference from a sample with a limit of detection.[Am J Epidemiol. 2007]
*Perkins NJ, Schisterman EF, Vexler A.**Am J Epidemiol. 2007 Feb 1; 165(3):325-33. Epub 2006 Nov 16.* - Interpretation of the tuberculin skin test.[J Gen Intern Med. 1995]
*Rose DN, Schechter CB, Adler JJ.**J Gen Intern Med. 1995 Nov; 10(11):635-42.* - A principled approach to setting optimal diagnostic thresholds: where ROC and indifference curves meet.[Eur J Intern Med. 2011]
*Irwin RJ, Irwin TC.**Eur J Intern Med. 2011 Jun; 22(3):230-4. Epub 2011 Jan 26.*

- Diagnostic Thresholds with Three Ordinal Groups[Journal of biopharmaceutical statistics. 20...]
*Attwood K, Tian L, Xiong C.**Journal of biopharmaceutical statistics. 2014; 24(3)608-633* - Calcaneal quantitative ultrasound and Phalangeal radiographic absorptiometry alone or in combination in a triage approach for assessment of osteoporosis: a study of older women with a high prevalence of falls[BMC Geriatrics. ]
*Thomsen K, Ryg J, Hermann AP, Matzen L, Masud T.**BMC Geriatrics. 14(1)143* - RNA-Seq Accurately Identifies Cancer Biomarker Signatures to Distinguish Tissue of Origin[Neoplasia (New York, N.Y.). ]
*Wei IH, Shi Y, Jiang H, Kumar-Sinha C, Chinnaiyan AM.**Neoplasia (New York, N.Y.). 16(11)918-927* - Clinical application of neutrophil gelatinase-associated lipocalin in the revised chronic kidney disease classification[International Journal of Clinical and Exper...]
*Xiang D, Zhang H, Bai J, Ma J, Li M, Gao J, Wang C.**International Journal of Clinical and Experimental Pathology. 7(10)7172-7181* - Multimodal MRI as a diagnostic biomarker for amyotrophic lateral sclerosis[Annals of Clinical and Translational Neurol...]
*Foerster BR, Carlos RC, Dwamena BA, Callaghan BC, Petrou M, Edden RA, Mohamed MA, Welsh RC, Barker PB, Feldman EL, Pomper MG.**Annals of Clinical and Translational Neurology. 2014 Feb; 1(2)107-114*

- PubMedPubMedPubMed citations for these articles

- The Inconsistency of “Optimal” Cut-points Using Two ROC Based Criteria.The Inconsistency of “Optimal” Cut-points Using Two ROC Based Criteria.NIHPA Author Manuscripts. 2006 Apr 1; 163(7)670

Your browsing activity is empty.

Activity recording is turned off.

See more...