NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Matheny M, McPheeters ML, Glasser A, et al. Systematic Review of Cardiovascular Disease Risk Assessment Tools [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2011 May. (Evidence Syntheses/Technology Assessments, No. 85.)

Cover of Systematic Review of Cardiovascular Disease Risk Assessment Tools

Systematic Review of Cardiovascular Disease Risk Assessment Tools [Internet].

Show details


The body of literature for this analysis consisted largely of studies that could not be easily pooled or combined quantitatively. Therefore, it is essential when identifying trends in the outcomes to highlight studies that reflect key issues and concepts, and we have done that in this section.

Almost all models retained good relative and absolute risk prediction in the development cohort itself, but since most were not externally validated, the utility of these models must remain in question. Among the small number of externally validated models, the strongest performance was seen in those with matched outcomes among North American and European cohorts. External validation of U.S.-developed models in other U.S. cohorts found that most retained good relative and absolute risk prediction performance among white and black populations, but absolute risk prediction was poor among minority populations, such as Hispanics and Asian Americans. 23, 97, 100 A few studies that evaluated higher- or lower-risk cohorts, such as siblings of patients with early CAD or young adults, had poor absolute risk prediction performance, which is expected. 42, 49 In all cases, overall model relative risk performance (risk separation) was superior for women. 23, 42, 49, 97, 100 Thus, the evidence in this review would suggest that risk models are generally accurate only in patients who are representative of the source population, and for the Framingham cohort, such patients were middle-aged and white or black.

There was a paucity of CVA risk models in the literature. This was primarily due to the exclusion criterion that required the baseline population to be free of CVD at the time of cohort inception. A few of the CVA risk models were externally validated in a population with baseline CVD. While those cohorts and risk models that were developed in the absence of baseline CVD are included here, they are not representative of the overall literature.

Comparison of traditional cardiovascular risk factors among seven U.S. cohorts by D'Agostino and colleagues found that although effects seen across some cohorts were similar, those that comprised Japanese American, Native American, or Hispanic populations had significantly different relative risk associated with risk factors identified in the Framingham cohort.23 In addition, those cohorts also demonstrated poor performance in absolute risk prediction of the FRS model. While there is clearly some tolerance for changes in cohort characteristics, the degree of tolerance is not entirely known, and overall the evidence suggests that the number and magnitude of differences in relative risk between populations is correlated with poor absolute risk performance. Other studies have identified a direct relationship between the tendency to under- or overpredict based upon the baseline outcome incidence in the model derivation cohort.30

In studies that examined risk factors for CVD using some of the same cohorts, but using slight variations of risk variables sets (four in one; six in another), confounding was clearly present as variables were included or excluded from the models, suggesting that even in the best risk estimates there is likely unmeasured confounding.23, 30

External validation of U.S. risk models among European cohorts in which the outcomes were matched were more mixed. A few studies with matched outcomes reported acceptable risk model performance, but the European cohorts were generally at higher risk than the source population, including populations of patients with diabetes or elderly patients.48, 89 Another cohort reported acceptable performance but its results are questionable because the authors evaluated a cohort that included patients with diabetes using a model that was developed excluding diabetes.40

Several studies reported that the risk models underpredicted outcomes, but again, these were almost entirely conducted in high-risk patient cohorts.56, 77, 82, 85, 89 Most of the evaluations among European cohorts found that the U.S. risk models overpredicted risk, given that underlying outcome event rates between the model cohort and the evaluation cohort differed substantially,14, 48, 56, 80, 88, 92, 94, 110 with significant differences observed in the degrees to which individual variables contributed to risk assessment.30

The UKPDS risk model was the most frequently externally validated diabetes model,38, 40, 73, 78, 108 although evaluation results were mixed. For example, application of the UKPDS model to a Chinese cohort of patients with diabetes drastically overestimated the risk of CHD among those patients,38 largely due to a significantly higher rate of cerebrovascular disease and a significantly lower rate of CVD among Chinese patients compared with U.S. or European cohorts.121 Among newly diagnosed patients in the British Poole Diabetes Study, absolute risk prediction, as determined by the Hosmer-Lemeshow goodness-of-fit test, was mildly inadequate, but the O/E ratio was acceptable.73 However, the cohort included both soft and hard CHD outcomes, while the model was developed to predict hard CHD only, suggesting that the model would overpredict outcomes in this cohort if the outcomes were appropriately matched. Results of analyses in the NHS Trust cohort of London patients demonstrated that the model significantly underpredicted the number of outcomes.78 However, the cohort included both soft and hard outcomes, which left the question of whether the model would have had an acceptable ratio for matched outcomes. Thus, although the UKPDS model had clearly improved performance over non-diabetes-specific risk models when directly compared, confirmed external validation in a matched outcome cohort of patients with diabetes has not yet taken place.73, 78

Most of the external validations performed on diabetic cohorts by non-specific cardiovascular risk models found that the models were significantly underpredicting the number of outcomes.7378, 85 A few studies showed an acceptable O/E ratio, but had outcome mismatches that were more restrictive in the cohort than the model.40, 72 This supports the conclusion that the risk of CVD among patients with diabetes is elevated compared to the general population. In addition, this also suggests that a diabetes risk variable in a general model is insufficient for capturing the variance of risk experienced by diabetic populations; that is, risk is not simply attributed by whether the patient has diabetes or not, but also by other factors such as diabetes control, duration of diabetes, and whether the patient has already experienced end-organ damage.

There were no studies in which a general risk prediction model was compared to a diabetes-excluded model for matched outcomes. WHS, in which 2.9 percent of patients had diabetes, evaluated the FRS ATP-III and 1998 models, but the outcomes were substantially mismatched in the ATP-III (CVD vs. hard CHD) and 1998 models (total vs. hard CHD), and absolute risk prediction was poor in both.54 The Chicago Heart Association study evaluated young men aged 18 to 39 without diabetes for matched outcomes in the ATP-III model and unmatched outcomes in the 1998 model, but absolute risk performance was poor in both because of the young population.49 Czech patients without diabetes were evaluated with the 1998 FRS model for matched outcomes, and the model overpredicted the number of outcomes.56 The Norwegian Counties Study of patients without diabetes evaluated the SCORE risk model, which did not include a diabetes risk factor but did include patients with diabetes in its source cohort, and the model overestimated the number of outcomes, and the overestimation was worse with increasing age.44 The internal validation evaluation of the QRISK equation for CVD, which excluded patients with diabetes, also externally validated the 1991 FRS general risk model.46 The 1991 FRS model significantly overpredicted the outcome, although there was a small outcome mismatch.

A number of U.S. cohorts that engaged in recalibration or remodeling reported poor absolute risk performance for the original FRS models. However, most of these evaluations had outcome mismatches between the cohort and the model. 54, 61, 101 Those studies that performed remodeling of FRS risk variables in a local cohort reported retained or improved relative risk prediction and adequate absolute risk prediction. 54, 61, 101 It should be noted that remodeling results in a model with an outcome that matches the model outcome (by definition), and it is not surprising that this would result in improved performance. One study evaluated matched outcomes between the cohort and the original model and found that minority populations were poorly predicted by the model. This study subsequently showed that remodeling resulted in adequate performance in all the cohorts, based upon the Hosmer-Lemeshow goodness-of-fit test.23 Two other studies with matched outcomes and inadequate original model performance noted adequate absolute risk prediction after remodeling.45 In contrast, recalibration methods (which adjust the baseline outcome event rate intercept in the model but do not adjust the risk variable coefficients) performed more variably, with both adequate and inadequate absolute risk prediction results.42, 45 However, one of these studies performed both recalibration and remodeling, and showed that although recalibration was sufficient for women and not men, remodeling resulted in adequate absolute risk prediction for both sexes.45

Remodeling efforts among diabetes and diabetes-excluded risk models followed the general trend of cardiovascular risk prediction models. Recalibration methods were successful in some cases, but were inadequate in others. However, remodeling methods were almost always successful in producing a model that performed well in the local cohort.38 Among non-diabetic cohorts and general risk models, remodeling was successful in improving performance, although it should be noted that diabetes as a risk factor was dropped from the models. 56 Among a large U.S. female non-diabetic cohort, remodeling of the FRS ATP-III risk variables did not result in a well-calibrated model.61

Remodeling established risk models in other cohorts also serves to illuminate systematic relative risk differences between risk factors. For example, although absolute risk prediction was very poor when the UKPDS model was applied to the HKD Registry, there were no significant differences when comparing the hazard ratios of specific risk variables from the two cohorts,38 suggesting that both the baseline outcome incidence and the relative risk contribution from individual risk factors play into absolute risk performance.

There were some substantial and consistent challenges to analyzing this body of literature. For example, we observed significant heterogeneity among outcome definitions, and this resulted in frequent mismatches between cohort and modeling outcomes. Frequently, cohort outcome data were collected in order to match a particular risk model, or to develop one, and other models with different outcomes were tested in order to directly compare them. Relative risk performance relies on the weight of the risk factors in the model, and is not dependent on the baseline outcome incidence. The C statistic, AUC, and sensitivity and specificity at a specified cut-off point all measure this type of performance. Relative risk performance (discrimination) can be insensitive to outcome mismatches if the relative contribution from each risk factor remains intact. Since all of the outcomes were variations of CVD, stable relative risk performance was frequently found even when outcomes were mismatched. However, in order to use these tools clinically, low- and high-risk threshold cut-off points are set using the development data (i.e., matched outcome). Separate risk cut-off points must be established in order to appropriately use such tools to risk-stratify patients for outcomes other than those for which they were developed.

In addition, many risk calculators provide a percent risk of an outcome rather than a set number of years, which is absolute risk prediction and measured by model calibration statistics. The O/E ratio is the crudest measurement of this type of performance, but it can result in an acceptable ratio when some ranges of risk are overpredicted and some are underpredicted. The Hosmer-Lemeshow goodness-of-fit test is a more granular evaluation method that sorts all patients by predicted risk, divides them into 10 categories, evaluates the O/E ratio for each category separately, and sums the chi-square value in each category to report an aggregate measurement. Absolute risk prediction performance is dependent on both the baseline outcome incidence and the contribution of risk from each risk factor in the source cohort. Evaluating absolute risk prediction with a mismatched outcome between model and cohort has severe limitations, because the baseline outcome event rates are different from the outset. Some interpretation is possible if the prediction error is in the opposite direction of what one would expect; that is, if a cohort outcome is more restrictive, one would expect the model to overpredict the outcome, but if it underpredicts the outcome, then the result can be safely interpreted as poor absolute risk prediction. However, no such assertion can be made if absolute risk prediction is determined to be adequate for mismatched outcomes.


Overall, the FRS models performed fairly well in U.S. populations, but performance suffered when they were applied to populations that were substantially different from the source cohort. Although the FRS model was developed from a predominantly white cohort and is not representative of the U.S. population as a whole, performance was reasonable in both white and black patients from the ARIC cohort. In some cases, this was due to particularly low or high baseline risk in the destination cohort, and in some cases it was due to systematic differences in risk attributable to specific risk factors. In addition, the 2001 ATP-III version demonstrated several benefits compared to the older FRS models, including a focus on hard CHD outcomes, exclusion of patients with diabetes, and incorporation of more current FRS data than the 1991 version. A 2008 CVD model was recently published but has not yet been externally validated.39

Recalibration, and to a greater extent, remodeling, demonstrated effectiveness as a means to improving performance in cohorts with substantially different outcome incidence or risk factor prevalence compared to the source cohort. However, questions remain regarding the population sample size necessary to perform these methods and how frequently they should be applied.

Development of risk models for cohorts with risk profiles that are systematically divergent from the general population can also be a successful strategy. However, in many cases, studies taking this approach were more or less remodeling exercises using traditional risk variables in the most common models. Sample size requirements for developing stable risk models are even less clear for these cohorts, and some of these studies had fewer than 1,000 participants. A growing body of literature suggests that specific cohort risk models are likely to be most successful when there are risk factors unique to that population that inform cardiovascular risk.

Even among U.S. cohorts, there was evidence that some ethnically diverse or minority populations had significantly different risk factor contributions to outcomes, even when the baseline prevalence was similar.23, 30 Our review did not exclude studies on the basis of geographic area, but in analyzing the data it became clear that there were systematic differences in risk factor prevalence and outcome event rates between Asian cohorts (which were mostly Chinese or Korean) and North American and European cohorts.121 This makes use of Asian models in a general U.S. population ill-advised.

Diabetes-specific process measurement variables are significantly related to cardiovascular outcome risk among patients with diabetes, and risk models that incorporate these factors outperformed general risk prediction models when applied to these patients. Analysis also suggests that models excluding patients with diabetes outperformed general risk prediction models that included these patients in their development when applied to non-diabetic cohorts. Unfortunately, external validation of diabetes-specific risk models is lacking, particularly among U.S. cohorts. No U.S. diabetes risk model has been externally validated.

Problems with absolute risk prediction were improved or resolved by recalibration and remodeling methods, supporting the need in this literature for periodic recalibration or remodeling for either general or specific populations. However, empirical evidence for determining what time interval is reasonable or for detecting when a population is “significantly” different from the reference population does not yet exist. Future research in this area should focus on carefully matching outcomes between cohorts and risk models. Additional work in recalibration and remodeling methods is needed, as well as external validation of diabetes-only risk models.


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...