Abstract
Background:Predictive models are increasingly used in guidelines and informed decision-making interventions. We compared predictions from 2 prominent models for diabetes: the United Kingdom Prospective Diabetes Study (UKPDS) outcomes model and the Archimedes-based Diabetes Personal Health Decisions (PHD) model.
Methods:Ours was a simulation study comparing 10-year and 20-year model predictions for risks of myocardial infarction (MI), stroke, amputation, blindness, and renal failure for representative test cases.
Results:The Diabetes PHD model predicted substantially higher risks of MI and stroke in most cases, particularly for stroke and for 20-year outcomes. In contrast, the UKPDS model predicted risks of amputation and blindness ranging from 2-fold to infinitely higher than the Diabetes PHD model. Predictions for renal failure all differed by more than 2-fold but in a complicated pattern varying by time frame and specific risk factors. Relative to their predictions for white men, the UKPDS model predicted much lower MI and stroke risks for women and Afro-Caribbean men than the Diabetes PHD model did for women and black men. A substantial majority of the Diabetes PHD point estimates fell outside of the UKPDS outcomes model's 95% CIs.
Conclusions:These models produced markedly different predictions. Patients and providers considering risk estimates from such models need to understand their substantial uncertainty and risk of misclassification.
The prevalence of diabetes among the American population has increased dramatically in the last 2 decades.1 Diabetes increases the risks of multiple adverse health outcomes and is frequently accompanied by other cardiovascular risk factors. Historically, diabetes treatment has been very glucocentric, 2,3 despite the fact that premature morbidity and mortality in diabetes are more strongly affected by cardiovascular risk factors than glucose control.4–7
If patients are to make informed choices about where to focus their efforts for risk reduction, access to accurate, personal estimates of risk and effects of risk reduction activities is a logical first step. The 2 most readily available models that can be used to predict risks of multiple outcomes for persons with diabetes and model the effects of risk factor modification are the United Kingdom Prospective Diabetes Study (UKPDS) outcomes model8 and the Diabetes Personal Health Decisions (PHD) tool, available from the American Diabetes Association9 and based on the Archimedes Model.10 We report here a comparison of output from these 2 models for a set of test cases that we undertook when trying to decide which model to use in a study of communication with diabetic patients about their multiple risks and risk reduction options.
Methods
We obtained permission from the American Diabetes Association to use the Diabetes PHD website for research purposes, and we received a research license to use the UKPDS outcomes model, version 1.2, from the University of Oxford.
We constructed sets of sample cases to obtain predictions from each model, as outlined in Table 1. The cases for each model were identical to the extent possible given that each model uses some information the other does not. Starting with the male and female base cases, we created additional cases varying one factor at a time from the base case.
The UKPDS outcomes model allows the researcher to specify risk factor values for up to the subsequent 40 years, or the program can model them based on the current values (and project regression toward the mean). Here, we report models in which we forced the values to remain constant for 40 years, which tends to increase estimates of benefits from risk factor control by preventing regression to the mean. However, the differences in the estimates were modest and there was no other way to model quitting smoking without relapse. The Diabetes PHD model allows risk factors to be changed and then immediately displays changes in risk projections; how it models risk factor changes mathematically has not been publicly reported. In the UKPDS outcomes model, probability estimates are generated by incorporating a random number seed to reflect the inherent randomness in the occurrence of any event for an individual. To produce stable estimates, the developers provide an option to run multiple simulations and average the results to obtain more stable estimates, a technique referred to as “Monte Carlo simulation.” We specified 10, 000 Monte Carlo simulation loops because this produced stable estimates. The UKPDS outcomes model also will produce 95% CIs around its estimates using a bootstrapping approach based on sampling with replacement (taking multiple random samples from the data, allowing an observation to be sampled as many times as it is randomly chosen and examining the distribution of predictions obtained to estimate variance around the mean).11 We estimated CIs for the UKPDS outcomes model estimates with 999 bootstraps and evaluated whether the Diabetes PHD estimates fell within these limits as one measure of agreement between the models.
We tabulated 10- and 20-year outcome probability estimates from each system for myocardial infarction (MI), stroke, limb amputation, blindness in one eye, and renal failure. The Diabetes PHD website produces graphical output, not numerical tables. However, hovering the mouse over a spot on the plots displays probabilities that change with each year when following the curve. We used this approach to obtain 10- and 20-year risk estimates.
Results
Table 2 shows the 10- and 20-year risk estimates obtained from both models. For MI risk, several patterns were apparent. Diabetes PHD produced higher risk estimates for MI than the UKPDS outcomes model, with the differences larger at 20 years than 10 years. The point estimate changes for changes in different risk factors did were not always parallel, eg, the UKPDS model estimated similar risk reductions for lowering all risk factors other than weight, which it projected as having almost no effect, whereas the Diabetes PHD model estimated the greatest reduction coming from smoking cessation and weight loss as yielding similar benefits to other risk factor reduction except at 20 years for a white man. The UKPDS model predicted substantially lower MI risks for women than men, whereas the Diabetes PHD male-female differences were much more modest. Relative to the white male base case, the UKPDS model predicted a substantially lower MI risk for the Afro-Caribbean man, reflecting an observed finding of the UKPDS trial.12 However, for the black man, the Diabetes PHD model predicted a similar MI risk. The Diabetes PHD point estimates fell outside the UKPDS outcomes model 95% CIs for all cases except the 10-year MI risks for white men who quit smoking or lost weight. The different model predictions are perhaps more easily appreciated in Figure 1, where we plotted the 10- and 20-year MI risk data for the white female cases, including the UKPDS bootstrap 95% CIs. In all but 2 cases, the 10-year Diabetes PHD predictions exceeded the 20-year UKPDS predictions.
For stroke risk, again the Diabetes PHD model always produced higher risk estimates than the UKPDS outcomes model. For all of the female cases and the 20-year estimates for men, the Diabetes PHD estimates were more than twice the UKPDS outcomes model estimates—in some cases, 3 to 4 times larger. Only the 10-year Diabetes PHD risk estimate for a white man who lost weight fell within UKPDS 95% CIs. For stroke, both models agreed that relative risk reduction was greatest for lowering blood pressure, followed closely by smoking cessation and weight loss.
In contrast, the UKPDS outcomes model always predicted at least 2-fold higher risks of amputation and blindness than did the Diabetes PHD model. Absolute risks for amputation and blindness were much smaller than for MI and stroke, ranging from 0% to 6%. Except for the 20-year risk for the white female base case, all of the Diabetes PHD point estimates for amputation in the female cases fell below the UKPDS 95% CIs, whereas the Diabetes PHD estimates were within the UKPDS CIs for a number of the male cases. For blindness, in all cases the Diabetes PHD estimates fell below the UKPDS 95% CIs.
Prediction of renal failure presented a more complex pattern of differences. The UKPDS model predicted higher 10-year rates for men and women than did the Diabetes PHD model, but the highest predicted risk for any case was 3%, whereas the Diabetes PHD model predicted a 0% risk for many cases. Except for lowering blood pressure in the white female case, all of the Diabetes PHD estimates fell below the UKPDS 95% CIs. In contrast, the Diabetes PHD model generally predicted notably higher 20-year rates except in the cases of lowering glycosylated hemoglobin (HbA1c) to 7.0% and losing 44 lb (20 kg), for which it predicted complete or almost complete elimination of the risk of renal failure; the UKPDS outcomes model estimated almost no risk reduction. The Diabetes PHD estimates fell within the UKPDS 95% CIs for the male and female cases with lowered blood pressure and weight loss, plus the female case with lowered HbA1c, all cases for which the Diabetes PHD model predicted greatly reduced risks.
Discussion
We found that these 2 sophisticated, well-known models produced moderately to extremely divergent risk and risk reduction estimates. Qualitatively, the risk estimates were generally consistent with current evidence and guidelines; cardiovascular risks substantially exceeded other risks and received substantial benefit from better cardiovascular risk control. Modeling of some risk factors (eg, weight, HbA1c) and outcomes (renal failure) clearly differed between the 2 models, both qualitatively as well as quantitatively.
One might argue that qualitative consistency with existing evidence is good enough to justify presenting such predictions to patients, but the spurious precision of such numerical estimates is likely to lead to a false belief in their accuracy. The models offer conflicting suggestions about the benefits of weight reduction. Neither attempts to quantify effects of changes in diet and exercise, 2 of the factors most directly under patients' control, other than that they are reflected in changes in the measured risk factors. Some of the changes, such as the dramatic reductions in the risk of renal failure predicted by the Diabetes PHD model for lowering HbA1c and losing weight, are probably too speculative to present to patients. However, a recent study of general practitioners reported that they felt computerized models producing estimates for multiple outcomes would be helpful13 and, in some settings, use of predictive models is already a standard practice, such as in the US National Cholesterol Education Program's ATP-III guidelines14–16 and the US Preventive Service Task Force's recommendations on the use of aspirin for primary prevention.17 Both of these use the Framingham model even though it has been shown to be a poor estimator of cardiovascular risk in a variety of populations.18–22
Our comparisons of these 2 predictive models do not let us say whether one is better than another; we have no gold standard providing “correct” probability estimates. We did not seek to map out agreement for a wide range of all factors; we merely hoped to compare the models for a limited set of plausible cases. We started with base cases having suboptimal risk factor control, if risk factors are all at target levels, because one would not use such risk calculators to discuss potential benefits of risk factor reduction with patients. The UKPDS model produces a random number seed when generating projections, and the Diabetes PHD model estimates vary modestly between runs, but varying these did not appreciably affect our findings. We do not know for certain that the 2 models defined MI and stroke exactly the same way, though minor differences in definition could not account for the magnitude of the differences in risk predictions. The creators of the UKPDS outcomes model make clear that one cannot assume it is valid for populations other than those included in the UKPDS trial, particularly different racial/ethnic groups.11 Although this is an issue for comparing predictions for black versus Afro-Caribbean persons or Asian-Indian versus Asian-American persons in the United States, it is not obvious why predictions should differ widely for UK versus US non-Hispanic white men and women, whether in terms of baseline risks or effects of risk reduction activities.
The UKPDS outcomes model is based on observational data from 3642 participants in the UKPDS and has been shown to predict the population incidence of the modeled outcomes in the UKPDS participants accurately.8 The Diabetes PHD is based on a sophisticated model that has shown an excellent ability to predict the aggregate outcomes of a number of large randomized, controlled trials, 23 indicating good “calibration-in-the-large”24 for those situations. Both models have evidence of reasonable calibration-in-the-large when predicting MI and stroke and response to lipid lowering with a statin, 25 but information about discrimination was not provided in that publication. However, calibration-in-the-large does not mean a model is appropriate for individual predictions. Lemeshow and colleagues26 pointed out some years ago that several models for predicting intensive care unit mortality had very similar discrimination and calibration yet produced quite divergent risk estimates for many individuals, and they suggested that the limitations of the models were too great to be useful for individual decision making. Stern27 recently summarized a number of studies comparing predictive models and found the same issues of poor agreement among equally valid models for individual-level predictions.
We have found no published studies reporting external validation of individual-level predictions of either of these models for any outcomes. All studies of generalizability have examined aggregate predictions. The Archimedes diabetes model was developed using data from studies including the UKPDS and specifically used UKPDS data in modeling retinopathy and nephropathy; it has been shown to model the annual proportion of persons in the UKPDS experiencing an MI very accurately, as well as accurately predicting the proportion of UKPDS participants experiencing retinopathy and nephropathy; however, this study provided no data on the accuracy of prediction at the individual level.28 The Archimedes model showed reasonable discrimination in predicting the development of diabetes in one validation study, 29 though this is a different outcome than we evaluated here. The Framingham equation and a variety of other cardiovascular risk models have been shown to perform poorly for persons with diabetes, 18 and, in general, the performance of the Framingham equation has been found to vary widely among different populations.30 In a prospective cohort study of patients with newly diagnosed type 2 diabetes in the United Kingdom, the UKPDS risk engine, which is less sophisticated than the outcomes model, showed modest discrimination and poor calibration when predicting coronary heart disease (CHD) events.19 This was also the case for Chinese31 and Dutch/German32 cohorts. A study among Australian diabetics found the UKPDS risk engine had good calibration and discrimination for predicting stroke but performed poorly for coronary heart disease, whereas the Framingham equation did poorly for both outcomes.33 Berger et al, 34 in a review and comparison of 6 different cardiovascular risk calculators, noted that optimal decision making would require both good risk assessment and good estimates of risks of treatment side-effects, which are often unavailable. Mohan et al35 pointed out the differing cardiovascular risk estimates for 3 test cases in relation to the US Preventive Services Task Force's guideline on the use of aspirin for primary prevention.
A recent review by Sheridan et al36 found evidence for a small reduction in estimated CHD risk from repeated presentation of risk information to patients. However, whether the actual risk estimates or simply repeatedly bringing up CHD risk may lead to this surrogate outcome is not known. For example, a recent study found that presenting spirometric “lung age” to subjects increased smoking cessation, regardless of whether the test revealed impaired function or not.37 Two studies published subsequent to the Sheridan review found no improvement in risk factor control from interventions providing information about personal risks and benefits to diabetic patients.38,39
Conclusions
Our findings and those of others indicating frequently poor performance and agreement by a variety of predictive models raise the question of whether such predictions are accurate enough to justify a prominent role in treatment guidelines and decision making by patients and providers. At the population level, it may be appropriate to suggest a treatment based on some level of risk. At the individual level, such a wide range of uncertainty means that many persons whose estimated risks are near any cutoff will be misclassified. Mathematically, individual-level predictions must have a much greater degree of uncertainty than population means. Others have outlined methods for assessing predictive model performance, including discrimination, calibration, generalizability, and estimation of net benefit.40,41 Until a model has been shown to have sufficient individual-level predictive accuracy for members of the population under consideration, we believe it should be used circumspectly, with the risk projections treated as “ballpark estimates”—perhaps appropriate as one factor considered by patients and providers making informed, patient-centered treatment choices, but not appropriate for guidelines and quality measures to use with specific numerical cutoffs.
Notes
This article was externally peer reviewed.
Funding: Support for this research was provided by Robert Wood Johnson Foundation grant no. 63824.
Prior presentation: This work was presented at the 38th annual meeting of the North American Primary Care Research Group, Seattle, WA, 13–17 November 2010.
Conflict of interest: none declared.
- Received for publication January 28, 2011.
- Revision received March 13, 2011.
- Accepted for publication April 1, 2011.