Prediction of Suicide-Related Events by Analyzing Electronic Medical Records from PTSD Patients with Bipolar Disorder

Around 800,000 people worldwide die from suicide every year and it’s the 10th leading cause of death in the US. It is of great value to build a mathematic model that can accurately predict suicide especially in high-risk populations. Several different ML-based models were trained and evaluated using features obtained from electronic medical records (EMRs). The contribution of each feature was calculated to determine how it impacted the model predictions. The best-performing model was selected for analysis and decomposition. Random forest showed the best performance with true positive rates (TPR) and positive predictive values (PPV) of greater than 80%. The use of Sertraline, Fentanyl, Aripiprazole, Lamotrigine, and Tramadol were strong indicators for no SREs within one year. The use of Haloperidol, Trazodone and Citalopram, a diagnosis of autistic disorder, schizophrenic disorder, or substance use disorder at the time of a diagnosis of both PTSD and bipolar disorder, predicted the onset of SREs within one year. Additional features with potential protective or hazardous effects for SREs were identified by the model. We constructed an ML-based model that was successful in identifying patients in a subpopulation at high-risk for SREs within a year of diagnosis of both PTSD and bipolar disorder. The model also provides feature decompositions to guide mechanism studies. The validation of this model with additional EMR datasets will be of great value in resource allocation and clinical decision making.


Introduction
Approximately 800,000 people worldwide die from suicide every year [1]. Suicide is the 10th leading cause of death in the United States, with 48,000 deaths occurring in 2018 [2]. Because the rate

Software and Model Setup
The analysis algorithm was written in the Python programming language in a Jupyter notebook [43]. The ML-based models and calibration curves were developed by using scikit-learn 0.20.0 [44]. The key Python libraries used in this analysis were SciPy [45], NumPy [46] and Pandas [47].
Several different ML-based classifiers were tested, including logistic regression [48], random forest [49], decision tree [50], K-nearest neighbors [51], Naïve Bayes [52] and support vector machine [53]. All models were set at a random state of 42 to ensure reproducibility while the other hyper-parameters were left at default settings. The random state seeded the random number generator used in the models. For the final random forest model, we set estimators to 100 and the maximum number of features to the square root of the number of features.
ML-based models frequently encounter datasets that are heavily imbalanced-the number of samples in the different classes are distributed unevenly-which affects their learning phases and subsequent predictions. An over-sampling procedure based on the Synthetic Minority Oversampling Technique (SMOTE) [54] was performed prior to conducting the analysis. The over-sampling procedure creates new samples by connecting inliers and outliers from the original dataset [54]. The resampled dataset was split into training and test datasets randomly in a 4:1 ratio. Only the training set was oversampled with SMOTE so that the test set contained the original subjects in the dataset.
Many socioeconomic factors have been reported to play important roles in suicide prediction [55]. However, data from only the EMR were used as the predictors, variables, or features for modeling: (a) demographic data, including gender and age at BDT; (b) number of emergency department (ED) visits and diagnoses within one year prior to the BDT; (c) medication usage within one year prior to the BDT, including medication orders, dispenses, and fills. Medication usage data was coded by whether patients had taken these medications within one year prior to their BDT.
Predictor or variable importance was calculated to assess key factors in SRE prediction. In the random forest algorithm, predictor importance was quantified by evaluating the decrease in "node impurity" at each split across all decision trees in the forest [56]. In the simplest case, node impurity can be considered as the difference in measurement from controls at a node. The random forest module uses these measures to estimate variances in nodes across trees. The nodes with maximized response variances are those that have greater contributions to the differences in categories of cases and have a greater impact on the model's ability to predict outcomes.
Since patients with SREs are a minor class in our dataset, model performance was based on true positive rate (TPR), positive predictive value (PPV), and negative predictive value (NPV) calculated as follows (Equation (1) Random forest results were interpreted using the python package TreeInterpreter 0.2.2 (https://github. com/andosa/treeinterpreter), which allowed the (a) decomposition of each prediction into feature contribution components in the training set mean and (b) identification of those features that affect the difference and their contribution. In the model, all features will make contributions to the predication about an instance whether positive or negative. If the value of a feature's contribution was positive (SRE), the prediction value was scored as 1. If the feature's contribution was negative (no SRE), the prediction value was scored as 0.

Model Construction and Performance
A total of 6042 patients with PTSD and bipolar disorder were identified from the EMR system by ICD9 and ICD10 codes (Appendix A). Of this population, 4138 of them had no records of SRE before BDT. Among these 4138 patients, 205 were identified as having SREs within one year after BDT, while 3933 of them did not have SREs in the same time period. Patients with follow up time less than one year and no reported SRE (970) were excluded from this study. The filtered 2963 subjects were oversampled into a balanced dataset by SMOTE as described above. After data resample and split, the training dataset contained 4726 subjects with 2363 subjects marked as 1 and 2363 subjects marked as 0. The inclusion process is described in Figure 1 and the baseline patient characteristics are shown in Table 1. Significant differences among patients with and without SREs because of gender, age, and ED visits may be contributing variables in this study.
ML-based models were trained and evaluated with the data generated by the resample procedures. Performances of all the models are shown as the means from a 5-fold stratified cross-validation process ( Table 2). TPR and PPV were prioritized since the model should be able to identify the high-risk population within the precision constraints relevant to the data. Random forest was superior at retrieving positive cases with less false positives with an exceptional high PPV (Table 2). Random forest achieved an accuracy of 92.4%, an area under curve (AUC) of 95.6%, an F1 score of 0.879, and an area under receiver operating characteristic (ROC) curve of 0.820. The random forest model was chosen as the predictive model in the following analysis.

Model Decomposition and Feature Importance Analysis
A decomposition analysis on the decision trees generated by the random forest algorithm was conducted to better understand the contributions of each factor on SRE predictions. All features in the model were examined individually to determine if the feature provided positive contributions. Such an approach allowed a minimization of the data volume needed to make an accurate prediction and to reduce computation expenses. Ninety-two features were used in the model including disease categories 1-12, the seventy-five medications mentioned above, age, gender and ED visits. Among them, only age and ED visits were continuous variables, and all other features were categorical. In order to find the features that are necessary for the model and to minimize the data requirement, feature importance was calculated using the method implemented in the package. Feature importance was calculated as the decrease in node impurity weighted by the probability of reaching that node. The node probability was calculated by the number of samples that reach the node, divided by the total number of samples. The higher the value the more important the feature [57] Multiple random forest tests, which included the top most important features, were performed to retrain the model and test its performance. The performance of model improved as the number of features with high importance increased ( Figure 2). The performance curves reached a plateau at approximately 30 features, then maintained a performance similar to the original model we trained using all 90 features. As a result, the 30 most important features (Table 3) were used to train a simplified random forest model.   A simplified random forest model was built using the top 30 features. The ROC and the Precision-Recall graphs of the new model were plotted (Figure 3). The random forest model outperforms the no skill (random) model in both graphs ( Figure 3). The simplified model yielded an accuracy of 98.3%, an AUC of 95.9% (similar to the original model performance), an F1 score of 0.868, and an ROC of 0.811. The performance parameters for the retrained model achieved a high TPR and PPV with the 30 selected features, again similar to the original model performance ( Table 2; Table 4). These results indicate that the random forest model is sensitive to patients who had SREs and can predict SREs correctly. Every feature that impacted the final prediction using random forest was processed through the decomposition algorithms from treeinterpreter. The random forest model was used to predict how each feature could impact the possibility of having a SRE within one year after being BDT on all 3168 patients in the dataset. Of the 3168 patients, the model correctly predicted SREs from 3120 of them. Contribution values (negative and positive) of the features to correctly predicted presence of SREs within 1 year were calculated.  The distributions for two continuous datasets, age and ED visits were investigated ( Figure 4). The age and ED distributions between positive and negative scores were significantly different (p < 0.001) ( Figure 4). Younger ages and more ED visits are associated with a higher risk of having SREs. The distribution of the 28 categorical features provided an insight into how the individual features impacted the SREs of individual cases ( Figure 5). Generally speaking, value 1 tended to make a positive contribution compared to 0 across all features. Specifically, features such as Fentanyl, Aripiprazole, Disease category 11, Disease category 2 and Disease Category 6 showed obvious associations between contributing groups and feature values. The value distributions of features are different in positive and negative contributing groups ( Figure 4) and these shifts can provide information about the impact a feature may have on SREs. The difference in value distributions of features were examined using a chi-square test (Table 5) and as a percentage in positive and negative contributing groups. If a feature has no or little association with the final prediction, the percentages of patients taken medication or have the comorbid disease in positive and negative contributing groups should be similar to the percentage of 1 in the whole population. If the percentage of patients taken medication or have the comorbid disease in positive or negative contributing group significantly differs from that of the whole population and each other, it suggests a possible mechanistic association between this feature and the potential risk for an SRE. For example, 11.6% of the participants have taken Sertraline. They account for 0% of the positive contributing groups and 45.9% of negative contributing groups. It can be concluded that taking Sertraline is predictive for no SREs within one year. High-importance features with an obvious separation pattern among the population groups have also been identified (Table 3). This indicates that the values of these features can greatly impact the final SRE predictions and may inform future mechanism studies.   All features except Olanzapine showed a significant difference between their distributions in positive and negative contributing groups. This is the result we are expecting because all the features in Figure 5 have been selected through the drop column feature importance test and were identified as important for the model to make the prediction. If the value of a certain feature does not provide significant differences in the percentages among the groups, it is likely that it has no benefit in for predicting SREs and will be dropped in the previous step. The results shown in Table 5 provided additional support to our feature selection process above.

Discussion and Conclusions
Prior studies have found that the ML-based methods perform better at identifying suicide risks in large populations of patients than traditional methods. The accuracies of these studies are reported to be between 0.76-0.79 with AUCs generally between 0.80-0.90 [15,17,58]. The objective of this study was to find an ML-based method that identifies patients at high-risk for SREs-patients diagnosed with both PTSD and bipolar disorder. This study demonstrated an accuracy of 0.92 and an AUC of 0.956 using the random forest method. The random forest model can accurately predict those patients at higher risk of SREs as evaluated with TPR and PPV tests. Sub-populations suffering from certain disorders and taking certain medications can be distinguished from a larger population as having a higher risk for SREs.
Different features have different contributions to the prediction of SREs. These features are mental disorders and drug administration history within one year. As discussed by Sanderson et al. [17], mental health diagnoses were separated into twelve disease categories based on their ICD9 codes (Appendix B). Patients suffering from comorbid diseases at BDT are more likely to have SRE within a year. These comorbid diseases include: Category 11, autistic disorder-current and disturbance of conduct; Category 3, mood disorders and adjustment disorders; Category 4, other psychotic disorders; and Category 5, acute stress reactions. Several studies have reported results that diseases in Category 11 are more likely to trigger SRE [59][60][61][62]. Numerous studies have provided evidence that mental disorders have the potential to increase the risk of SREs [63,64]. This evidence is supported by the results of the random forest model presented here.
For the unselected features, it does not mean that these features may not be useful predictors of SREs. The aim of the feature selection process to use a sufficient but minimal number of features for the model to achieve optimal prediction results. It was found that optimal results were found using 30 features and that the addition of additional features did not affect the results. The distribution of all categorical features is attached (Appendix C). The impurity-based feature importance can be misleading for high cardinality features and continuous variables (age and ED visits) [65]. For this reason, the distribution of these two variables were examined first to ensure that their association with SREs are not the result of biased algorithms.
With medication usage as the feature, some of them showed a much higher proportion in the negative contributing group compared to either the whole population or the positive contributing groups (Fentanyl, Levomilnacipran, Sertraline, Aripiprazole, Tramadol, Lamotrigine, Sertraline, and Fluoxetine). These medications are considered to reduce the risk of SREs within one year in our model. Other investigators have shown similar beneficial effects in clinical trials [66,67]. However, some studies have found that Tramadol, Aripiprazole, and Fentanyl have not been associated with risk reduction in SREs. Thus, our results may provide support for further investigations. The model identified several medications that increased the risk of SREs. Such medications have also been reported to increase the risk of SREs in other studies [68]. Caution must be taken in interpreting the effect of medications on the prediction of SREs in that the model's results do not account for drugs that may be indicators of comorbidities, e.g., sleep problems that may alter the risk of SREs.
The results of our study made it possible for clinicians to identify patients who have a higher risk of SREs and have additional insight of how to reduce this risk by identified risk factors. Clinicians will be able to adjust the medications to replace some drugs which increase the risk of SREs with drugs with the same class with less risk or focus on relieving the symptoms which may contribute most to the suicide risk.
The ML-based random forest model provides a basis for clinicians to build similar models for different populations facing different disease risks. Our model is built with open source Python packages and trained based on EMR data. This means other researchers can test our model in other clinical samples. Also, our study can provide guidance for clinical institutions or other researchers to build their own models for other kinds of populations.
Unavoidably, there are limitations to this study. (a) the data was collected from hospitals affiliated with UPMC. External data for validation was not used and, if included, may have led to overfitting; (b) most clinicians prefer to treat diseases and disorders with particular combinations of drugs different from those used by other clinicians. This may cause bias in the results among institutions if such preferences are widely used in the hospital, despite alternative drug choices; (c) the high prediction performance of the model may due to the unique characteristics of the BDT patient subpopulation. The model may need further adjustment and optimization to apply it to other high-suicide risk populations or other disease states; (d) mis-diagnoses and biased prescriptions are two problems may cause errors in the predictions of SREs. PTSD and bipolar disorder may be mis-diagnosed as other diseases in their early stages, which may cause bias in our model, especially with younger patients. However, the ability to identify mis-diagnoses and biased prescriptions from the EMR is beyond the capability of our model; (e) though some medications, like lithium, may not be indicated for SREs, clinicians prescribe them for bipolar disorders to a greater extent due to its known anti-suicidal properties. This may be the situation in many clinical practices.
The ML-based random forest model makes it possible for clinicians to identify subpopulations of patients who have a higher risk of SREs and to have additional insights to reducing this risk by identifying individual risk factors. Medications that increase the risk of SREs can be substituted with drugs having a lower risk or that focus on relieving symptoms that may contribute most to SREs.
Using EMR information, a ML-based random forest model was constructed that predicts, with an accuracy of around 90%, if a patient will have an SRE within the following year of the diagnosis of both bipolar disorder and PSTD. The model extracts features that make contributions to the risk of SREs, which can be further utilized in mechanism studies. The model has great potential as a clinical tool that can aid clinicians in identifying high-risk individuals and to better guide patient clinical care.  Acknowledgments: Authors would like to acknowledge the support from Robert Sweet for his precious suggestion in experiment design and proof reading. This research was supported in part by the University of Pittsburgh Center for Research Computing through the resources provided.