Machine Learning Approaches to Predict Risks of Diabetic Complications and Poor Glycemic Control in Nonadherent Type 2 Diabetes

Purpose: The objective of this study was to evaluate the efficacy of machine learning algorithms in predicting risks of complications and poor glycemic control in nonadherent type 2 diabetes (T2D). Materials and Methods: This study was a real-world study of the complications and blood glucose prognosis of nonadherent T2D patients. Data of inpatients in Sichuan Provincial People’s Hospital from January 2010 to December 2015 were collected. The T2D patients who had neither been monitored for glycosylated hemoglobin A nor had changed their hyperglycemia treatment regimens within the last 12 months were the object of this study. Seven types of machine learning algorithms were used to develop 18 prediction models. The predictive performance was mainly assessed using the area under the curve of the testing set. Results: Of 800 T2D patients, 165 (20.6%) met the inclusion criteria, of which 129 (78.2%) had poor glycemic control (defined as glycosylated hemoglobin A ≥7%). The highest area under the curves of the testing set for diabetic nephropathy, diabetic peripheral neuropathy, diabetic angiopathy, diabetic eye disease, and glycosylated hemoglobin A were 0.902 ± 0.040, 0.859 ± 0.050, 0.889 ± 0.059, 0.832 ± 0.086, and 0.825 ± 0.092, respectively. Conclusion: Both univariate analysis and machine learning methods reached the same conclusion. The duration of T2D and the duration of unadjusted hypoglycemic treatment were the key risk factors of diabetic complications, and the number of hypoglycemic drugs was the key risk factor of glycemic control of nonadherent T2D. This was the first study to use machine learning algorithms to explore the potential adverse outcomes of nonadherent T2D. The performances of the final prediction models we developed were acceptable; our prediction performances outperformed most other previous studies in most evaluation measures. Those models have potential clinical applicability in improving T2D care.


INTRODUCTION
Diabetes mellitus, characterized by persistent hyperglycemia (Li et al., 2020), is a common chronic disease. The prevalence of diabetes in China has increased rapidly from 0.67 in 1980 to 10.4% in 2013, which may be attributed to the aging of the population and changes in lifestyle (Jia et al., 2019). 10% of global health expenses is spent on diabetes (USD 760 billion) (International Diabetes Federation, 2019). Type 2 diabetes (T2D) accounts for the majority (90-95%) of individuals with diabetes mellitus (Deshpande et al., 2008;Inaishi and Saisho, 2020). Longterm hyperglycemia may lead to increased risk of diabetes-related complications including cardiovascular disease, kidney disease, retinopathy, and neuropathy (Kidanie et al., 2018). T2D and its complications harshly impact the life quality and the finances of individuals and bring about a heavy economic burden on the national health-care system (Hur et al., 2013;World Health Organization, 2016;Bui et al., 2019;Harding et al., 2019). The prevalence of these complications is generally proportional to the degree of glycemic control and the duration of diabetes (Kidanie et al., 2018). Intensive glucose control in the early stage of T2D can greatly reduce chronic complications of diabetes (Holman et al., 2008;prospective, 1995). Principles and guidelines have been used for glycemic control and preventing long-term complications for T2D (International Diabetes Federation, 2019;Jia et al., 2019;American Diabetes Association, 2020). Nevertheless, the effective treatment of T2D depends on high therapy adherence. Adherence to therapy is defined as the extent to which a person's behavior in taking medication, monitoring of indicators, and/or following a diet corresponds with agreed recommendations from a health-care provider (García-Pérez et al., 2013). Adherence to the recommended therapy is associated with better glycemic control, fewer complications, risk reduction, and lower medical costs (Egede et al., 2012;McAdam-Marx et al., 2014;Kennedy-Martin et al., 2017;Ting et al., 2021). It is reported that nonadherence to medication among patients is common (Ting et al., 2018). Adherence to long-term therapy for chronic illnesses in developed countries averages 50%. In developing countries, the rates are even lower (World Health Organisation., 2003). A certain number of patients were found to be failing to monitor glycemia regularly nor receiving timely treatment intensification (Aujoulat et al., 2014;Reach et al., 2017;Giugliano et al., 2019;Lu et al., 2020). Early identification of potential adverse outcomes due to patient nonadherence should be an urgent priority for individualized treatment of T2D (Zarkogianni et al., 2018;Pallarés-Carratalá et al., 2019). Therefore, it was necessary to establish a prediction model that could predict the prognosis of nonadherent T2D.
"Machine learning" (ML) is also called "artificial intelligence." The purpose of ML is to build computer systems that can adapt and learn from their experience (Kavakiotis et al., 2017). ML algorithms are commonly used to build predictive models. It can identify specific clinical variables and learn decision rules through data (Han et al., 2015;Dagliati et al., 2018;Nagaraj et al., 2019). The implementation of ML algorithms can help identify appropriate candidates for further evaluation and avoid cumbersome routine clinical steps (Handelman et al., 2018). Several studies have shown that supervised ML in medical fields can bring accurate prediction (Al'Aref et al., 2020;Meyer et al., 2018;Weiss et al., 2015;Weiss et al., 2012). However, previous studies have only applied statistics or ML models for predicting patients who may have poor adherence. Few ML models were found to predict the adverse outcomes of nonadherent T2D. In this study, we would use the local healthcare systems to predict the potential adverse outcomes of nonadherent T2D.
Therefore, the objective of this work was to develop and evaluate prediction models of diabetic complications and poor glycemic control (defined as hemoglobin A1c (HbA1c) ≥7%) among nonadherent T2D patients based on ML algorithms and to identify the predictors of complications and HbA1c. Finally, it aimed to provide risk prediction tools for clinical practice.

Research Design and Participants
Data in this study were obtained through face-to-face investigation and the Electronic Health Medical Record System (EHRS) of Sichuan Provincial People's Hospital. All subjects were inpatients who had been screened according to the following criteria. Patients with T2D [the World Health Organization (WHO) (1999) criteria were adopted for diagnosis of T2D] were included and would be excluded when he or she visited a medical institution within 12 months, had adjusted their treatment plan within 12 months, did not use chemicals for hypoglycemic therapy, had used traditional Chinese medicine, Chinese herbal medicine, and acupuncture to control glycemia within the last 12 months, and had liver and kidney dysfunction. The patient's private information, such as name, home address, and contact number were hidden during the research. Informed consent forms were obtained before the investigation.

Univariate Analysis
Univariate analysis for continuous variables was performed using t-tests, variance analysis, or the Wilcoxon signed-rank test. The categorical variables were analyzed using the chi-square test or Fisher's exact test. P-values less than 0.05 (p < 0.05) were considered statistically significant.

Input Variables
There were 32 input variables identified for this study, including demographic information, laboratory indicators, disease-related characteristics, medication information, and economics.

Outcome Variables
The outcome variables were poor glycemic control and whether complications occur. In this study, HbA1c <7% was considered to be good glycemic control and ≥7% was considered to be poor glycemic control (American Diabetes Association, 2020). The complications analyzed in this study were common chronic complications of T2D.

Variable Screening
The variables with missing values >70%, the maximum percentage of records in a single category >90%, and the maximum number of categories >95% were excluded. The minimum coefficient of variation was set to 0.1, and the minimum standard deviation was set to 0. The Pearson method was used to evaluate the correlation between input variables and outcome variables. We set the cutoff value of variable importance to 0.9 (1−α).

Data Partition
The raw data were randomly split into a training set (80%) and an independent testing set (20%) by 8:2 after the variable screening. The model was built based on the training set, and the testing set was used only for the evaluation of the performance after the modeling stages. The grouping of the training set and the testing set was determined by the random seed value of the partition.

Machine Learning Algorithms
End-to-end models were built to predict outcome variables from the input variables. The data were processed using the following ML algorithms: artificial neural network (ANN), Bayesian network (BN), chi-squared automatic interaction detector (CHAID), classification and regression tree (CRT), quick unbiased efficient statistical tree (QUEST), and discriminate (D) and ensemble (XF) models. The XF models summarized the output of the best three models (assessed by AUC) and generated their outputs based on the voting principle.

Model Evaluation
The predictive performance of the final models was assessed by the following performance metrics: area under the receiver operating characteristic curve (AUC), negative predictive value (NPV), positive predictive value (PPV), and accuracy.

Variable Importance
We explored the variable importance of each outcome variable derived from the best predictive model among all the tested models. Variable importance reflected the contribution of input variables to the outcome variables in specific models.
IBM SPSS Modeler 18.0 (Company Name) was used to build various models and SAS 9.21 (Company Name) was used to conduct hypothesis testing.

Research Population
A total of 800 T2D patients were screened by the inclusion and exclusion criteria. 525 patients who had visited medical institutions in the past year, 49 patients who had hepatic and renal insufficiency, 43 patients who had adjusted their treatment plan and who did not use chemotherapy for hypoglycemic therapy in the last 12 months, and 18 patients who received hypoglycemic treatments other than chemical drugs in the last 12 months were excluded. The final cohort consisted of 165 patients (the screening process of patients is shown in Figure 1), including 97 male patients and 68 female patients. Seven types of complications were found in 83 cases (i.e., diabetic peripheral neuropathy (DPN), diabetic angiopathy (DA), diabetic nephropathy (DN), diabetic eye disease (DED), diabetic foot (DF), diabetic ketoacidosis (KE), and diabetic skin lesions (DD). Due to the small sample size and data imbalance, ketoacidosis (KE), diabetic skin lesions (DD), and diabetic foot (DF) were not included in this study.

The Results of Univariate Analysis
Tables 1 and 2 list the results of univariate analysis of risk factors for complications and HbA1c in T2D patients, respectively. According to Table 1, the duration of T2D was a significant factor affecting DN (p < 0.0001), DPN (p 0.0022), DA (p 0.0015), and DED (p 0.0082), and the duration of unadjusted hypoglycemic treatment was a risk factor of DN (<0.0001), DPN (<0.0001), DA (<0.0001), DED (<0.0001), and KE (<0.0284). Genetic history of diabetes was a risk factor for DPN (p 0.037) and DO (p 0.0189). According to Table 2, the number of hypoglycemic drugs (p < 0.0233) and the duration of T2D (p < 0.0020) were important factors affecting HbA1c. The percentage of patients with HbA1c under control declined with the prolonging of the duration of unadjusted hypoglycemic therapy.

The Results of Variable Screening
Among the total of 32 input variables, 18 were excluded due to the low correlation with the characteristics of the outcome variable, and five were excluded due to data imbalance. There were nine input variables and five outcome variables that were retained for the development of the final models. The input variables were age, duration of diabetes (≥1 year), duration of unadjusted hypoglycemic treatment (≥1 year), number of insulin species, total cost (total

The Results of Model Prediction
Sixteen best-performing algorithms with the highest AUCs were selected for modeling of four complications, and two bestperforming algorithms were selected for HbA1c. Ten independent replicate results were generated for each model by changing the data split of a dataset. This was achieved by modifying random seeds of the "partition" node. A total of 180 models were obtained. The modeling steps of DED are shown in Figure 2, and the ROC curve for the model with the highest AUC of each complication is shown in Figure 3. The PPV, NPV, accuracy, and AUC for different ML algorithms by the testing set are shown in Table 3. Among the 18 evaluated models, most models performed well. XF performed best among all the predictive models of DN and DA, with AUCs of 0.902 ± 0.040 and 0.889 ± 0.059. D performed best among all the models of DPN and DED, with AUCs of 0.859 ± 0.050 and 0.832 ± 0.086. The best model for HbA1c was considered to be BN, with the highest AUC of 0.825 ± 0.092. Figure 4 shows the variable importance of DN, DA, DED, DPN, and HbA1c derived from the best-performing ML algorithms. It also shows the relative importance of the variables with the top three most important variables of complications being the duration of T2D, the duration of unadjusted hypoglycemic treatment, and types of insulin. The top three most important variables of HbA1c were the number of hypoglycemic drugs, types of insulin, and total cost. The most important variables of DN, DA, DED, and DPN were age, duration of T2D, types of insulin, and duration of unadjusted hypoglycemic treatment, respectively. A novel predictive variable, the duration of unadjusted hypoglycemic treatment (during this time, the patient's hypoglycemic treatment regimen remains unchanged, and relevant follow-up monitoring has not been performed), of T2D was identified from this study. We can predict the probability of complications in T2D patients through the duration of the hypoglycemic regimen.

Results and Discussion
This study employed ML algorithms to screen for cases likely to have diabetic complications and poor glycemic control among nonadherent T2D patients and provided potential risk prediction tools for both outcomes. Eighteen models were evaluated, and the risk factors for complications and poor glycemic control were identified, with the most important risk factors being the duration of T2D, the duration of unadjusted hypoglycemic treatment, types of insulin, number of hypoglycemic drugs, and total cost of hypoglycemic therapy. The prediction models we established in this study obtained acceptable performances. According to previous reports, under-monitoring and delay of treatment are major challenges to diabetes management (Ross et al., 2011;Khunti et al., 2013;Khunti et al., 2016). The findings of this study are important because early screening may strengthen glycemic control and reduce the risk of diabetic complications through timely monitoring of glycemia and treatment FIGURE 2 | Modeling steps of diabetic eye disease (DED). The "variable screening" node was used for data preprocessing after the "T2D data" were imported. Since the D model can only identify continuous variables, the "variable conversion" node was used to convert categorical variables into continuous variables. The "partition" node was used to divide the dataset into a training set and a testing set randomly by 8:2. Ten partitions were generated for each dataset by modifying the random seed value. Machine learning algorithms of BN, CHAID, ANN, and D were used for modeling after partition. Finally, the ROC curve and confusion matrix of each model was output through the two nodes at the end of the data stream. AUC obtained from the confusion matrix of the testing set was used for model verification. T2D, type 2 diabetes; Part, partition; D, discriminate; BN, Bayesian network; ANN, artificial neural network; CHAID, chi-squared automatic interaction detector.
ML algorithms have been widely used in medical fields recently (Cichosz et al., 2015;Kavakiotis et al., 2017;Contreras and Vehi, 2018;Lan et al., 2019). It is the key technology of big data analysis, which provides new ways for clinicians to solve medical problems (Hui et al., 2016). Recent advances in ML algorithms have improved the accuracy of diagnosis and prediction in outcomes, in some cases even surpassing the performance of clinicians (Beam and Kohane, 2018). ML-  based prediction models for classification or prediction of future health states are being developed (Emanuel and Wachter, 2019). A large number of studies have reported on the prediction model of diabetic complications and glycemia. In the study of Hsin-Yi Tsao et al., data mining techniques were used to create prediction models of diabetic retinopathy, with results that indicated that insulin therapy and duration of diabetes are the most important risk factors of diabetic retinopathy, which was consistent with this study (Tsao et al., 2018). Compared with previous research, our study also found a new risk factor, the duration of unadjusted hypoglycemic treatment, for DED. Konstantia Zarkogianni et al. developed a risk prediction model for T2D cardiovascular complication (Zarkogianni et al., 2018). As with most predictive models, the prediction results are difficult to interpret. In our study, the prediction results were interpretable due to the use of decision trees. Dennis H Murphree et al. built several ML models to predict good HbA1c control (<7.0%) among T2D patients, which showed the potential for applying ML to solve problems in medical fields (Tsao et al., 2018). Consistent with prior research studies (Murphree et al., 2018;Tsao et al., 2018;Zarkogianni et al., 2018;Aminian et al., 2020), the findings of this study showed high AUCs.
Previous studies have explored the characteristics of patients with medication nonadherence from different perspectives. A systematic review analyzed the relationship between medication nonadherence and the health outcomes in the elderly (Walsh et al., 2019) and showed that medication nonadherence may be significantly associated with all-cause hospitalization and mortality in old people (Walsh et al., 2019). Instead of the FIGURE 4 | Feature importance of DN, DA, DED, DPN, and HbA1c derived from machine learning algorithms. Part (A) was the feature importance of diabetic complications and part (B) was the feature importance of HbA1c. Feature importance describes the relative importance of input variables for a single outcome variable in the supervised models. DN, diabetic nephropathy; DPN, diabetic peripheral neuropathy; DA, diabetic angiopathy; DED, diabetic eye disease; HbA1c, glycosylated hemoglobin A.
Frontiers in Pharmacology | www.frontiersin.org June 2021 | Volume 12 | Article 665951 8 senior group, the subjects of our study were T2D patients, and the prognosis of T2D we predicted was HbA1c and diabetic complications. In a cross-sectional study, the author explored the main predictors of poor adherence among T2D patients (Demoz et al., 2020). Another study, a previous article by Dr. Wu, assessed multiple ML algorithms and predicted the medication nonadherence risks of patients with T2D . The two articles above were studies on the influencing factors of medication nonadherence, and the predicted outcome is the compliance of patients. Both articles are quite different from our work. In our study, statistical and ML methods were used to predict risk factors of HbA1c and potential complications of nonadherent T2D. The research method and outcomes were quite different. New research ideas were provided for the influencing factors and prediction models of T2D progression.

Study Limitations and Strengths
This study had some strengths. Instead of a generic cohort, a highly specific one, nonadherent T2D patients, was used. This was the first study to use ML models to explore the health outcomes of nonadherent T2D. Besides, the internal validation of these models was conducted using the following method. The raw data were randomly grouped ten times by modifying the seed value of the "partition." In this way, independently repeated experiments were conducted, and the bias that may occur when datasets are randomly grouped was avoided. This method is also better than bootstrapping (Milea et al., 2020), which may increase the weight of some data. Moreover, the dataset we used for prediction contained clinical information that has not been studied before, which is the duration of unadjusted hypoglycemic treatment.
However, there were some notable limitations to this study. This was a single-center, small-sample study, and the performance of the final models was not compared with that of the established clinical reference tools, which limits the reliability of the verification results. Nevertheless, the influencing factors were analyzed through conventional statistical calculations, and the results of the univariate analysis were consistent with the prediction models. In the future, a large-scale, forward-looking, and multicenter study is needed for further external validation.

CONCLUSION
Among the nonadherent T2D patients, duration of T2D and duration of unadjusted hypoglycemic treatment were the key risk factors of diabetic complications. The number of hypoglycemic drugs was the key risk factor of glycemic control. The enhancement of medication compliance in patients with T2D and the strengthening of blood glucose monitoring and control are beneficial to delaying the occurrence and development of T2D complications and provide evidence support for the individualized management of T2D. In this study, after the validation and screening of prediction models, the final models derived in this study may be clinically useful for patients with T2D and health-care professionals, including general practitioners and endocrinologists. The findings of this study may provide evidence of the potential adverse outcomes based on the current health situation, help to improve the treatment adherence of T2D patients, and reduce the burden of individuals and national health-care systems.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

AUTHOR CONTRIBUTIONS
YF and EL contributed equally to this work and should be considered co-first authors. XW and RT contributed equally to this work and should be considered co-corresponding authors.