Type 2 Diabetes Risk Forecasting from EMR Data using Machine Learning
Subramani Mani, MBBS PhD, Yukun Chen, MS, [...], and Joshua Denny, MD MS
Abstract
Objective:
To test the feasibility of using data collected in electronic medical records for development of effective models for diabetes risk forecasting.
Methods:
Using available demographic, clinical and lab parameters of more than two thousand patients from Electronic medical records, we applied different machine learning algorithms to assess the risk of development of type 2 diabetes (T2D) six months to one year later.
Results:
We achieved an AUC greater than 0.8 for predicting type 2 diabetes 365 days and 180 days prior to diagnosis of diabetes.
Conclusion:
Diabetes risk forecasting using data from EMR is innovative and has the potential to identify, automatically, high-risk populations for early intervention with life style modifications such as diet and exercise to prevent or delay the development of T2D. Our study shows that T2D risk forecasting from EMR data is feasible.
1. INTRODUCTION and BACKGROUND
The global estimate of the number of adults with diabetes (DM) is 246 million and expected to grow to 380 million by 2025 [1]. Diabetes is the fifth leading cause of death by disease in the US [2]. The US prevalence rates for diabetes among adults who are twenty years or older is 24 million (11%). An additional 57 million people are estimated to be prediabetic with impaired fasting glucose (IFG) and/or impaired glucose tolerance (IGT). Five to ten percent of prediabetic patients progress to a diagnosis of Type 2 diabetes mellitus (T2D), the most common type of diabetes mellitus, each year [3]. Typically, diagnosis of diabetes for T2D patients is delayed 4–7 years after onset of the disease [4] and most patients have established vascular complications resulting from diabetes by the time their diagnosis is made [3, 5]. Early prevention and intervention strategies are needed to reduce the associated mortality and morbidity. Randomized controlled studies over the last two decades have demonstrated the effectiveness of various prevention strategies such as dietary modifications, weight reduction, regular physical exercise and pharmacologic agents in significantly reducing the development of overt T2D in high risk individuals by 40 – 60 percent [3, 6–8]. Figure 1 shows the potential intervention points for diabetes prevention based on a quantification of risk for development of T2D.
Researchers have previously developed risk scores and predictive models for diabetes screening based on population studies [9–12]. However, these risk models may not be readily applicable to patients reporting to a hospital for different types of services. One Netherlands study identified people at risk for undiagnosed T2D by looking for risk factors in the electronic medical records (EMR) data of general practice clinics [13].
The objective of our study is diabetes risk forecasting using available data in electronic medical records to allow future interventions aimed at preventing development of overt T2D. With the increasing adoption of electronic medical records spurred by national incentives such as the Health Information Technology for Economic and Clinical Health (HITECH) Act of 2009, such risk forecast models may become broadly applicable as an intervention strategy. The forecast models represent real-world patients and may be rapidly integrated within existing EMRs in different hospitals. The current study used EMR-derived patient data to build models for forecasting T2D risk.
2. METHODS
2.1. Dataset
Data was obtained from the Synthetic Derivative (SD), a de-identified copy of the electronic medical record of the Vanderbilt University Medical Center (VUMC), a large tertiary care hospital in Nashville, TN, USA [14]. As the data did not include personal identifiers, the Vanderbilt University Institutional Review board did not consider this human subjects research and approved the study. We selected T2D cases and controls based on adaptations of algorithms described previously [15]. The algorithm detected patients with T2D from the SD based on diagnosis codes from version 9 of the International Classification of Diseases (ICD-9), medication records, and laboratory values. The algorithm also performed text searches for abbreviations, acronyms and common misspellings of the word “diabetes” and for generic and brand names of oral and injectable hypoglycemic agents. Patients with hemoglobin A1c readings above 6.5% or at least two random blood glucose readings over 200 mg/dL were also considered cases. Controls were identified by the absence of positive search findings. All controls were required to have a minimum of at least two notes or completed problem lists with no positive search results, and at least one laboratory value documenting a serum glucose less than 126 mg/dL. A change made to this algorithm for this study included allowing one glucose level between 126 and 200 mg/dL to simulate the postprandial abnormal glucose result in an otherwise healthy patient. Previous evaluations of these algorithms have been published with positive predictive values of 100% on dual physician review of 50 randomly selected cases and controls [15]. Additionally, an endocrinologist evaluated 15 controls with one abnormal glucose value; all of these were deemed true controls.
Demographic variables, clinical parameters and lab values considered relevant for T2D prediction were included in the study dataset. The list of variables selected is provided in Table I. Since the study goal was T2D forecasting, only patients who had at least one lab value 365 days prior to the time of inclusion were considered for the study. We randomly divided the population into two equal groups: one for model development and the other for future validation of the models generated. There were 20,675 unique patients in the model development group who had at least one lab value based on a 365-day cutoff. Since most of the patients in our sample did not have entries for many of the selected variables we excluded patients who had missing values for more than 3 variables. We then had 3357 patients including 228 with T2D. After randomly removing some diabetes control patients to ensure that at least 10% of the sample were T2D cases, we had a total of 2280 patients in our study sample. Table 1 shows the sample distribution of the whole study dataset. Three datasets D365, D180 and D0 were created with the cutoff days of 365, 180 and 0 days respectively. For example, D365 contains only data on each patient that was available 365 days prior to the date of diagnosis as diabetes positive/negative (See Figure 2). The variable A1c had a large number of missing values and was assigned a value of 0 if it was missing. For the rest of the variables with a relatively small percentage of missing values, we implemented Gaussian imputation such that the missing values of each variable would be imputed with random numbers generated from the same Gaussian distribution (mean and the standard deviation) of the non-missing values of the corresponding variable.

2.2. Algorithms
We used a representative set of machine learning and feature selection algorithms for our predictive modeling task. Two linear classifiers (Gaussian Naïve Bayes [16] and Logistic Regression), one sample-based classifier (K-nearest neighbor [16]), two decision tree based classifiers (CART [17] and Random Forests [18]), and one kernel based classifier (Support Vector Machine [19]) were used.
SVMs are considered as state-of the-art machine learning algorithms for classification by the machine learning community [20]. SVMs use a hypothesis space of polynomial linear functions over a high dimensional feature space. While SVMs are a “black box” algorithm, they typically outperform other ML algorithms for classification tasks (e.g., predicting whether a patient has a disease or not). Naïve Bayes and decision tree classifiers have been used in many applications for clinical decision making. Moreover, models generated by decision trees are human understandable in general given the models are not too complex. Decision trees generate clear descriptions of how the machine learning method arrives at a particular prediction. Decision trees use the formalism of recursive partitioning of data using splits on variables with the most predictive values. CART is a classifier that uses a tree-growing algorithm which minimizes the standard error of the classification accuracy based on a particular tree-growing method applied to a series of training sub-samples. For each training set CART builds a classification tree where the size of the tree is chosen based on cross-validation accuracy on this training set. The accuracy of the chosen tree is then evaluated based on the test set. Random forests use a combination of multiple tree predictors for improved prediction accuracy when compared with individual trees. Naïve Bayes is a classifier that applies the Bayes Rule to compute the conditional probability distribution of the class variable given the other attributes. In order to simplify the computation and reduce the time complexity the Naive Bayes algorithm makes the assumption that the attributes are conditionally independent of each other given the class. It is a robust classifier and serves as a good comparison in terms of accuracy for evaluating other algorithms [21].
A general schema for predictive model building is provided in Figure 3. To increase predictive performance, feature (attribute) selection algorithms are often used to select a subset of the features that are highly predictive of the class. By selecting a small number of the most relevant features (or by eliminating many irrelevant features), one is able to reduce the risk of overfitting the training data and often produce a better overall model. Four state of the art feature selection methods were used in our experiments: HITON-PC, HITON-MB [22], Gram-Schmidt orthogonalization (GS) [23, 24], and Random Forests Feature Selection (RFFS) [18].
We implemented a 5-fold nested cross-validation (NCV) framework such that the parameters of the classification algorithm and the method of feature selection (including no feature selection) were optimized in the inner loop of NCV. Area under receiver operator characteristics curve (AUC), sensitivity, specificity, positive predictive value and negative predictive value were used as performance measures for the predictive models. Folds one through four of the NCV contained 46 cases and 410 controls each and the fifth fold consisted of 44 cases and 412 controls, all randomly selected.
2.3. Combining models
We also selected the best performing models using the nested cross-validation approach on training data and combined the output of the models one through six and evaluated them using test data. Combining high performing classifiers is expected to further improve performance and make risk forecasting more robust. For combining classifiers we averaged the probabilistic output of the best k models (k varied from 2 through 6). We also included the best classifier as the baseline method.
3. RESULTS
Table II provides the average AUC scores, sensitivity and specificity based on 5-fold cross validation (CV) for the various algorithms using the 365 days, 180 days and 0 day cutoffs. Random forests (RF) had the best AUC with all the three cutoff days. RF had the highest sensitivity with 365 days and 0 day cutoff while Naïve Bayes (NB) had the best sensitivity with the 180 days cutoff. CART had the best specificity with 0 day cutoff while RF had the best specificity with 365 days and 180 days cutoff.
Table III provides positive and negative predictive values for the three datasets. CART had the best positive predictive value (PPV) with the 0 day cutoff while RF had the best PPV with 365 days cutoff. Both NB and RF had the highest PPV with 180 days cutoff. RF had the best negative predictive value (NPV) with 365 days and 0 day cutoff while NB had the highest NPV with 180 days cutoff.
Figure 4 shows a representative decision tree (CART tree) with two internal nodes and three leaf nodes using 180 days cutoff. The root node shows a prior risk of 0.1. It can be seen that conditioned on BMI (body mass index) being greater than or equal to 30.8 the risk for diabetes increases more than two-fold to 0.22. When the BMI is less than 30.8 but TG (triglycerides) level is greater than or equal to 278 the risk for diabetes also increases more than twofold to 0.25. The risk for diabetes is considerably reduced in patients with BMI less than 30.8 and TG levels lower than 278.

Figure 5 shows the same CART model using test data. The prior risk remains the same but the conditional risk based on BMI being greater than or equal to 30.8 increases to 0.29. However, the increase in risk of diabetes with TG values greater than or equal to 278 when BMI is less than 30.8 is smaller when compared with training data.
Figure 6 shows the effect of combining the best performing models one through six. Combination method (CM)1 means that only the best performing model is used while CM6 denotes the combination of all the six models obtained. Using the datasets with cutoff days of 365 and 180, CM2 and CM3 obtained marginal improvement over the best performing classifier (CM1).
4. DISCUSSION
Diabetes risk forecasting using data from EMR has the potential to automatically identify, high-risk populations for early intervention. Few healthcare systems currently attempt to identify patients at risk for diabetes. Community outreach programs and current risk stratification tools do not scale to the millions of patients expected to develop diabetes in the coming decades. In comparison, risk forecasting using EMR data may identify individuals at risk without the need for resource intensive screening procedures. Machine learning methods have been tried previously to build predictive models with clinical potential from medical data [25–32]. Our study tested the feasibility of using data collected in electronic medical records for development of effective models for diabetes risk forecasting.
The results show that predictive modeling using machine learning algorithms can generate models for diabetes risk assessment from EMR data 180 days and 365 days prior to the diagnosis of T2D. Although the AUC of the models, in general, was reasonable (∼0.8), the positive predictive value (0.24) was less than desired. But this represents an enrichment of 140% over the prior probability of 10%, and is balanced with a high negative predictive value (0.95 – 0.97). The lower PPV is not surprising given the unbalanced nature of the data with 10% cases and 90% controls. The ML algorithms are optimized in general to improve predictive accuracy. This study demonstrates that predicting future diabetes risk is challenging, even when incorporating known risk factors and advanced algorithms.
The balance between positive predictive value and negative predictive value in a risk forecasting strategy depends on the intervention and severity of disease. For T2D, a lower positive predictive value is acceptable since the intervention proposed is lifestyle modification resulting in a healthy lifestyle leading to prevention of T2D in at-risk individuals. Since the lifestyle modification is also expected to benefit patients who do not develop T2D, the approach is tolerant to a reasonable number of false positives. This is in contrast to a disease risk prediction with an intervention strategy that can result in harm to patients unlikely to develop that particular disease.
In general RF had the best overall performance but since RF is based on aggregating many different decision trees it is difficult to obtain a human understandable model from RF output which limits its clinical utility. A decision tree model such as the CART tree shown in Figure 5 is comprehensible to humans and may be preferred by clinicians [33] for risk forecasting. Though the CART model based on BMI and Triglycerides which are parameters known to be associated with T2D may not be particularly novel, the model is likely to be quite useful in our risk forecast setting. BMI and serum Triglyceridees are likely to be routinely collected and readily available in most EMRs.
Even though a combination of ensemble classifiers (CM2 and CM3 in Figure 6) improved predictive performance marginally it is not clear how useful such classifiers are in clinical settings as the predictive models become more complex.
Risk forecasting can be used to plan prevention and intervention strategies. Using the CART tree in Figure 5 the clinician can decide on lifestyle modification for patients who are above the baseline risk of 0.1. Such a strategy will include all the patients in the leaf nodes with risk levels of 0.286 (n=119) and 0.125 (n=24). The intervention will benefit 37 out of the 46 patients who will be diagnosed with T2D after six months. However, the intervention will also involve 106 out of 410 patients who would not have a diagnosis of T2D after six months. This would be considered beneficial and acceptable since the intervention is lifestyle modification (diet, exercise or both). A second strategy would be to set a higher risk threshold in which case only the patients in the first leaf (BMI >=30.8) would be included in the intervention group. The second strategy will benefit 34 out of 46 patients who will develop T2D but will also subject 85 patients who will not have diabetes after six months to lifestyle modification.
Note that our performance estimates for the T2D forecasting models are conservative. For example, the six month T2D forecast models only consider a T2D diagnosis after six months but not beyond that period. Hence patients included in the intervention group who did not have a diagnosis of T2D after six months might actually develop T2D if followed up beyond the period of six months.
Some of the variables that went into the assignment of the gold standard labels such A1c readings and serum glucose were also used for our predictive modeling. This could introduce some bias to our predictive models. However, as discussed in the methods section we only used data that was available 180/365 days prior to the date of diagnosis to build the models. All the values that were used for gold standard assignments obtained within 180/365 days prior to the date of diagnosis were not available for prediction using the D180/D365 datasets. Hence the resulting bias is minimal. As T2D is highly prevalent but underrecognized in a hospitalized patient we need criteria based on glucose measurements to assign gold standard labels with confidence. Since most hospitalized patients get a basic metabolic panel that includes serum glucose, the glucose variable is typically available for prediction.
There are many known risk factors that might be useful in diabetes forecasting such as familial history of diabetes, habitual physical inactivity, history of gestational diabetes, polycystic ovary syndrome and genetic markers [34]. However, these variables were not readily available at the time of this study. Future inclusion of such variables for risk forecasting is likely to further improve prediction accuracy. We also did not perform temporal reasoning [35–37] in this study for modeling longitudinal data. Our study shows that T2D risk forecasting from EMR data is feasible. To the best our knowledge this is the first study to apply predictive modeling using machine learning for T2D risk forecasting from EMR data.
The FORECASTER system aims to forecast risk of diabetes one year and six months prior to outpatient/inpatient diagnosis of diabetes with a target sensitivity of 80% and a target negative predictive value of 90%. Successful completion and clinical deployment of the FORECASTER system for diabetes will stimulate research for the development of risk forecasting for other chronic conditions such as dementia that might benefit from early risk assessments. Effective risk forecasting for diabetes will enable us to conduct randomized controlled trials based on dietary modifications, weight reduction and regular physical exercise to develop protocols for preventing or delaying the development of diabetes for patients visiting a hospital and by extension to the community at large. Since diabetes is a risk factor for many cardiovascular, cerebrovascular, renal and other systemic diseases, any successful diabetes prevention strategy will have an impact on the manifestation of other diseases, thereby reducing significantly mortality and morbidity.
Acknowledgments
We thank Professor Hua Xu for helpful discussions and the anonymous reviewers for insightful comments.










