- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

- Journal List
- J Natl Cancer Inst
- PMC2528005

# Discriminatory Accuracy From Single-Nucleotide Polymorphisms in Models to Predict Breast Cancer Risk

**Affiliation of author:**Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD

^{}Corresponding author.

**Correspondence to:**Mitchell H. Gail, Division of Cancer Epidemiology and Genetics, National Cancer Institute, 6120 Executive Blvd, Rm 8032, Bethesda, MD 20892-7244 (e-mail: vog.hin.liam@mliag).

## Abstract

One purpose for seeking common alleles that are associated with disease is to use them to improve models for projecting individualized disease risk. Two genome-wide association studies and a study of candidate genes recently identified seven common single-nucleotide polymorphisms (SNPs) that were associated with breast cancer risk in independent samples. These seven SNPs were located in *FGFR2*, TNRC9 (now known as *TOX3*), *MAP3K1*, *LSP1*, *CASP8*, chromosomal region 8q, and chromosomal region 2q35. I used estimates of relative risks and allele frequencies from these studies to estimate how much these SNPs could improve discriminatory accuracy measured as the area under the receiver operating characteristic curve (AUC). A model with these seven SNPs (AUC = 0.574) and a hypothetical model with 14 such SNPs (AUC = 0.604) have less discriminatory accuracy than a model, the National Cancer Institute’s Breast Cancer Risk Assessment Tool (BCRAT), that is based on ages at menarche and at first live birth, family history of breast cancer, and history of breast biopsy examinations (AUC = 0.607). Adding the seven SNPs to BCRAT improved discriminatory accuracy to an AUC of 0.632, which was, however, less than the improvement from adding mammographic density. Thus, these seven common alleles provide less discriminatory accuracy than BCRAT but have the potential to improve the discriminatory accuracy of BCRAT modestly. Experience to date and quantitative arguments indicate that a huge increase in the numbers of case patients with breast cancer and control subjects would be required in genome-wide association studies to find enough SNPs to achieve high discriminatory accuracy.

### CONTEXT AND CAVEATS

#### Prior knowledge

Two genome-wide association studies and a study of candidate genes recently identified seven common single-nucleotide polymorphisms (SNPs) that were associated with breast cancer risk in independent samples.

#### Study design

Estimates of relative risks and allele frequencies from these studies were used to estimate how much these SNPs could improve discriminatory accuracy measured as the area under the receiver operating characteristic curve (AUC). The discriminatory accuracy of these seven SNPs and a hypothetical model with 14 such SNPs were then compared with that of the National Cancer Institute's Breast Cancer Risk Assessment Tool (BCRAT).

#### Contribution

The seven-SNP model (AUC = 0.574) and a hypothetical model with 14 such SNPs (AUC = 0.604) have less discriminatory accuracy than the National Cancer Institute's BCRAT (AUC = 0.607). Adding the seven SNPs to BCRAT increased the AUC to 0.632.

#### Implications

Experience to date and quantitative arguments indicate that a huge increase in the numbers of case patients with breast cancer and control subjects would be required in genome-wide association studies to find enough SNPs to achieve high discriminatory accuracy.

#### Limitations

Individual-level data on case patients and control subjects are needed to investigate interactions that may improve the models. The data used to estimate SNP effects did not permit estimation of interactions among SNPs or between SNPs and risk factors in BCRAT.

From the Editors

Hopes have been raised that combinations of common genetic markers can be used to improve the discriminatory accuracy of models to project the risk of a specific disease, such as breast cancer, and thereby improve disease prevention programs (1). Recent genome-wide association studies (2,3) and an assessment of candidate single-nucleotide polymorphisms (SNPs) (4) revealed seven common SNP alleles that confer risk for breast cancer. I calculated how much discriminatory accuracy these SNPs provide and how much they can add to the discriminatory accuracy of Gail model 2 (5) in the National Cancer Institute's Breast Cancer Risk Assessment Tool (BCRAT) (http://www.cancer.gov/bcrisktool/).

From table 2 in Easton et al. (2), the allele frequencies (*s _{i}*) and per allele odds ratios (OR

*) for five disease-associated SNPs were, respectively, 0.38 and 1.26 for rs2981582 in*

_{i}*FGFR2*, 0.25 and 1.20 for rs3803662 in TNRC9 (now known as

*TOX3*), 0.28 and 1.13 for rs889312 in

*MAP3K1*, 0.30 and 1.07 for rs3817198 in

*LSP1*, and 0.40 and 1.08 for rs13281615 in chromosomal region 8q. I used the SNP in TNRC9 with the highest association with disease, rs3803662, which was identified as the result of fine-scale mapping (2). I included SNP rs13387042 in chromosomal region 2q35, with allele frequency 0.497 and per allele odds ratio of 1.20, from data in table 1 of Stacey et al. (3). The minor allele in

*CASP8*D302H (rs1045485) in chromosomal region 2q has a frequency of 0.13 and odds ratio of 0.88 per allele [from table 1 in Cox et al. (4)]. To provide a relative odds of 1.0 or more for disease-associated alleles, I took the rare homozygote as baseline and used allele frequency 0.87 with an odds ratio of 1.136 (=1/0.88) for the major allele in the modeling. I define

*X*as the number of disease-associated alleles at SNP

_{i}*i*in a given subject and define

**X**= (

*X*

_{1},…,

*X*

_{7}). Under Hardy–Weinberg equilibrium, the probabilities of

*X*

_{i}= 0, 1, and 2, namely,

*p*

_{i}(

*X*

_{i}) are (1−

*s*

_{i})

^{2}, 2(1−

*s*

_{i})

*s*

_{i}, and

*s*

_{i}

^{2}, respectively, where

*s*

_{i}is the frequency of the disease-associated allele. These seven SNPs are on six different chromosomes, and rs1045485 and rs13387042, which are both on chromosome 2, are 15.8 Mb apart. I therefore assume linkage equilibrium, which implies There are 3

^{7}or 2187 such probabilities,

*P*(

**X**). Analyses (2–4) of data on these SNPs indicate that at a given locus the odds ratio is well described by (OR

_{i})

^{Xi}. If it is assumed that SNP effects are additive on the logistic scale, the relative risk for a rare disease is

The distribution of relative risks in the general population is , where *t* is a dummy argument representing any real number. The disease risk, *r*(**X**), is the probability that a woman with risk factors **X** will develop breast cancer over a defined time interval. For a short interval, such as 5 years, *r*(**X**) is proportional to *rr*(**X**) because competing risks of death can be ignored. Thus, *r*(**X**) = *k*[*rr*(**X**)], where *k* is the risk for a woman with relative risk 1.0, which corresponds to the lowest level of risk for all risk factors. Hence, the distribution of risk in the general population is ). As shown by Gail et al. (6), the distribution of risk in women who develop breast cancer (case patients) is

Likewise, the distribution of relative risks in case patients is , and it follows that *FD*_{r}(*t*) = *FD*_{rr}(*t*/*k*).

The distribution *F*_{rr}(*t*) is shown in Figure 1 for the seven-SNP model. The corresponding mean of log_{e}[*rr*(**X**)] (MLRR) is 0.841, with a standard deviation (SDLRR) of 0.262. This SDLRR describes the dispersion of relative risk and risk in the population and is related to discriminatory accuracy (7). A steep slope in the midrange of a locus in Figure 1 corresponds to a small SDLRR.

_{e}relative risk,

*F*

_{rr}(

*t*), for the seven–single-nucleotide polymorphism (SNP) model (

**thin dashed line**), a 14-SNP model with the original seven SNPs plus seven more SNPs with identical characteristics (

**thin solid**

**...**

The curves in Figure 2 are plots of [1–*FD _{r}*(

*t*)] (ie, the probability that risk exceeds a given level,

*t*, in case patients) against [1–

*F*(

_{r}*t*)] (ie, the probability that risk exceeds a given level,

*t*, in the population), as the risk level,

*t*, (not shown) varies from 0 to 1.0. Each point on a locus thus gives the probability that a case patient would have a risk greater than

*t*on the ordinate and the probability that a member of the general population would have a risk greater than

*t*on the abscissa. If most of the risk were concentrated in a small proportion of the population, the curve would rise quickly, indicating that most case patients had higher risks than members of the general population. In the curve corresponding to the seven-SNP model in Figure 2, only a fraction [1–

*FD*(

_{r}*t*

_{0.5})] = 0.606 of case patients have risks higher than the median risk in the general population, defined by [1–

*F*(

_{r}*t*

_{0.5})] = 0.5, indicating poor discrimination for the seven-SNP model. Another measure of discriminatory accuracy, the area under this curve (6,8,9), is 0.574. For a rare disease, such as breast cancer in a 5-year interval, this area is very nearly equal to area under the receiver operating characteristic curve (AUC), which is the probability that a randomly selected case patient has a projected risk greater than that of a randomly selected control (non-case) subject (6). For these discrete risk models, I allow for ties in projected risk by computing the probability that the case risk exceeds the control risk (more precisely the risk in the general population) plus half the probability that the case risk equals the control risk.

*t*, [1–

*FD*

_{r}(

*t*)], plotted against the probability that a member of the general population has a risk greater than

*t*, [1–

*F*

_{r}(

*t*)], as

*t*(not shown) varies from 0 to 1. Separate curves are

**...**

To determine whether the modest discriminatory accuracy of the seven-SNP model could be improved, I supposed that there were seven more SNPs with identical properties to the first seven SNPs and that all were in linkage equilibrium. As shown in Figure 2, some improvement in discriminatory accuracy was observed, with an AUC of 0.604. The corresponding distribution of log_{e}*rr*(**X** had an MLRR of 1.682 and an SDLRR of 0.371 (Figure 1). Note that 1.682 is twice the MLRR for the seven-SNP model and 0.371 is 2^{0.5} times the SDLRR for the seven-SNP model, as follows from the addition of independent log relative risks (Equation 1).

BCRAT (Gail model 2) is based on age at first live birth, age at menarche, number of first-degree relatives with breast cancer, and number of previous benign breast biopsy examinations. BCRAT has been criticized for lack of discriminatory accuracy (9). I obtained unbiased (weighted) estimates (10) of the joint distribution of these risk factors, **X**, for white women aged 50 years or older from the 2000 National Health Interview Survey (http://www.cdc.gov/NCHS/nhis/htm; data accessed on July 22, 2002). From the BCRAT relative risks (11), I used the methods described above to calculate an MLRR of 0.520 and an SDLRR of 0.359, corresponding to the thick dashed curve in Figure 1; the AUC was 0.607 (Figure 2). Thus, BCRAT had greater discriminatory accuracy measured by AUC than the seven-SNP model and a slightly greater AUC than the hypothetical 14-SNP model.

By assuming that odds ratios from the seven-SNP model multiplied those from the BCRAT and that the distribution of these SNPs was independent of that of the risk factors in BCRAT, I estimated how much the discriminatory accuracy of BCRAT could be improved by adding the seven SNPs. The resulting distribution of log_{e}[*rr*(**x**)](Figure 1) has an MLRR of 1.361 and an SDLRR of 0.445. The AUC increased to 0.632 (Figure 2). In a different population, Chen et al. (12) estimated that adding mammographic density to BCRAT increased the average age-specific AUC by 0.047, from 0.596 to 0.643. The corresponding increase in AUC from adding these seven SNPs to BCRAT was 0.025 (= 0.632 − 0.607). Thus, mammographic density adds more to the discriminatory accuracy of BCRAT than do the seven SNPs.

All the AUC values in these analyses describe the discriminatory power of risk factors, such as SNPs, in women of comparable age over a short interval, such as 5 years. Thus, these AUC values describe the discriminatory accuracy of risk factors apart from age. Some investigators compare case patients and control subjects over large age ranges. Because age is a strong predictor of breast cancer risk and is included in all risk models and because case patients tend to be older than control subjects, doing so increases the AUC value.

This presentation is focused on discriminatory accuracy. High discriminatory accuracy is required for some applications, such as screening for disease (6), but even risk models with modest discriminatory accuracy can be useful for some applications, such as deciding whether or not to take tamoxifen, which decreases the absolute risks of breast cancer and hip fracture but increases the absolute risks of endometrial cancer and stroke (6,13). For such decision problems, for general counseling, and for designing prevention trials, it is important that the model accurately predict the risk in women with various risk factor combinations, a feature termed “calibration” (6,9). To assess calibration, one will need to study a cohort to determine how many women develop breast cancer and then compare that number with how many cancers were predicted, overall and in groups of women with various combinations of genotypes and other risk factors. It will be of special interest to determine whether the risks for women with multiple adverse alleles are as high as predicted by the multiplicative model in Equation 1. Positive or negative interactions among such SNP effects or with other risk factors could lead to poor calibration in some subgroups. Although interactions can affect calibration, my unreported calculations indicate that they have little effect on discriminatory accuracy. The generalizability to various racial groups of a risk model that is based on SNPs might be affected by interactions between SNP effects and racial group because the magnitude and even the direction of an association of a marker allele with disease may vary by racial group (3).

The power to detect interactions between pairs of SNPs and between SNPs and other risk factors is limited. A recent study of prostate cancer risk (14) failed to detect such interactions and found that adding information from five SNPs increased the AUC for a model based on age, geographic region, and family history of prostate cancer by only 0.009, from 0.624 to 0.633. Another study of prostate cancer failed to demonstrate statistically significant interactions among disease-associated SNPs from seven different genomic regions (15). It would be of interest to search for interactions of the effects of common SNP alleles on breast cancer risk with age, as have been found for rare high-risk mutations in *BRCA1* and *BRCA2* (16).

To build a model of absolute risk, one can couple the relative risk estimates from case–control data in genome-wide association studies with cancer incidence rates from registry data, as described previously (5,11). To do so requires data on the joint distribution of all risk factors in representative case patients or in the general population. In my analysis, it was assumed that the SNP genotypes were mutually independent and also independent of the factors in BCRAT. The effect of positive correlations between these SNPs and family history of breast cancer, if any, would be to diminish the discriminatory accuracy that these SNPs add to BCRAT because family history is included in BCRAT.

Very large relative risks are needed for a single factor to achieve good discriminatory accuracy (17). Even adding a strong risk factor with a large attributable risk, such as mammographic density, only increased the AUC of a model like BCRAT from 0.596 to 0.643 (12). Thus, it is not surprising that adding seven SNPs with small relative risks would increase the AUC of BCRAT only modestly.

It is tempting to speculate on how much additional discriminatory accuracy can be achieved by identifying further common SNPs and what effort would be required to find them. Pharoah et al. (7) assumed that the natural logarithm of risk was normally distributed, which provides a good approximation if many independent SNPs satisfy Equation 1 and if risk is proportional to relative risk, as was assumed in my analysis. Based on segregation analyses (18) and considerations of the recurrence risk among siblings, Pharoah et al. (7) estimated an SDLRR of 1.2 in the general population and showed that the logarithm of risk in case patients would be normally distributed with the same variance but with the mean increased by 1.2^{2} (= 1.44). From these values, I calculated an AUC of 0.800. This result supports arguments (7) that knowing which SNPs give rise to this polygenic component of risk (which is independent of risk from *BRCA1* and *BRCA2* mutations) might have some value for screening the population. The seven-SNP model has an SDLRR of 0.262. To achieve an SDLRR of 1.2, one would need 147 [= 7(1.2./0.262)^{2}] SNPs like the seven SNPs already identified. The geometric mean of the per allele odds ratios from these seven SNPs was 1.15. The study by Easton et al. (2) used approximately 400 case patients with strong family histories of breast cancer in the SNP discovery phase, which might be equivalent in statistical power to approximately 1600 population-based case patients (19). Stacey et al. (3) used 1600 population-based case patients in the discovery phase. Calculations as in Gail et al. (20) show that approximately 65% of disease-associated SNPs with an odds ratio of 1.15 would have among the 25000 smallest *P* values in a scan of 500000 SNPs if 1600 case patients and control subjects are used in the discovery phase. Thus, increasing the number of case patients and control subjects in the discovery phase to 5000 or more (20) might increase the number of such SNPs that would eventually be confirmed in subsequent phases to 11 (= 7/0.65). Improvements in SNP chip technology might yield a few more such SNPs, but even a 50% increase would yield only 17 (= 11 × 1.5) SNPs. There are probably many other disease-associated SNPs with smaller odds ratios, but their detection will require larger numbers of case patients and control subjects both in the discovery and validation phases. For example, if remaining disease-associated SNPs have a geometric mean OR of 1.10, one would need (20) approximately 2.15 {= [log(1.15)/log(1.10)]^{2}} times as many case patients and control subjects in the discovery phase as was required for an OR of 1.15. The contribution of an SNP to the variance of the log relative risk is 2*s*_{i}(1–*s*_{i})[log(OR_{i})]^{2}. It follows that if 10 additional SNPs can be identified with properties like those of the seven SNPs found so far but the rest of the SNPs have an OR of 1.10, one will need to find about 280 [= (147 − 17) × 2.15] additional low-risk SNPs to achieve the desired SDLRR of 1.2. Although these numbers are only illustrative, they show that a huge increase in the numbers of case patients and control subjects would be required in genome-wide association studies to find enough SNPs to achieve an SDLRR of 1.2.

This study had several limitations. To investigate interactions that may improve the models, individual level data on case patients and control subjects are needed. The published data (2–4) used to estimate SNP effects did not permit estimation of interactions among SNPs or between SNPs and risk factors in BCRAT. Several assumptions were needed to speculate on prospects for finding additional common disease-associated alleles that will achieve high discriminatory accuracy. Further research may indicate the extent to which these assumptions and the resulting broad conclusions hold.

## Funding

Intramural Research Program, Division of Cancer Epidemiology and Genetics, National Cancer Institute and National Institutes of Health.

## Footnotes

I would like to thank Sir Bruce A. J. Ponder for stimulating discussions leading to this work, Dr Montserrat Garcia-Closas for discussions on an empirical study to evaluate risk prediction models including single-nucleotide polymorphisms with other risk factors, the reviewers and Dr Ruth M. Pfeiffer for helpful comments, and Mr David Pee for providing estimates of the distribution of risk factors for Breast Cancer Risk Assessment Tool from the 2000 National Health Interview Survey.

## References

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (126K)

- Value of adding single-nucleotide polymorphism genotypes to a breast cancer risk model.[J Natl Cancer Inst. 2009]
*Gail MH.**J Natl Cancer Inst. 2009 Jul 1; 101(13):959-63. Epub 2009 Jun 17.* - Assessment of clinical validity of a breast cancer risk model combining genetic and clinical information.[J Natl Cancer Inst. 2010]
*Mealiffe ME, Stokowski RP, Rhees BK, Prentice RL, Pettinger M, Hinds DA.**J Natl Cancer Inst. 2010 Nov 3; 102(21):1618-27. Epub 2010 Oct 18.* - Breast cancer risk assessment with five independent genetic variants and two risk factors in Chinese women.[Breast Cancer Res. 2012]
*Dai J, Hu Z, Jiang Y, Shen H, Dong J, Ma H, Shen H.**Breast Cancer Res. 2012 Jan 23; 14(1):R17. Epub 2012 Jan 23.* - Breast cancer susceptibility: current knowledge and implications for genetic counselling.[Eur J Hum Genet. 2009]
*Ripperger T, Gadzicki D, Meindl A, Schlegelberger B.**Eur J Hum Genet. 2009 Jun; 17(6):722-31. Epub 2008 Dec 17.* - Breast cancer genome-wide association studies: there is strength in numbers.[Oncogene. 2012]
*Fanale D, Amodeo V, Corsini LR, Rizzo S, Bazan V, Russo A.**Oncogene. 2012 Apr 26; 31(17):2121-8. Epub 2011 Sep 26.*

- Cited in BooksCited in BooksPubMed Central articles cited in books
- MedGenMedGenRelated information in MedGen
- PubMedPubMedPubMed citations for these articles
- SNPSNPPMC to SNP links
- TaxonomyTaxonomyRelated taxonomy entry
- Taxonomy TreeTaxonomy Tree

- Discriminatory Accuracy From Single-Nucleotide Polymorphisms in Models to Predic...Discriminatory Accuracy From Single-Nucleotide Polymorphisms in Models to Predict Breast Cancer RiskJNCI Journal of the National Cancer Institute. Jul 16, 2008; 100(14)1037PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...