Send to

Choose Destination
J Biomed Inform. 2015 Dec;58:28-36. doi: 10.1016/j.jbi.2015.09.005. Epub 2015 Sep 11.

A probabilistic topic model for clinical risk stratification from electronic health records.

Author information

College of Biomedical Engineering and Instrument Science, Zhejiang University, China. Electronic address:
Department of Cardiology, Chinese PLA General Hospital, China.
College of Biomedical Engineering and Instrument Science, Zhejiang University, China.



Risk stratification aims to provide physicians with the accurate assessment of a patient's clinical risk such that an individualized prevention or management strategy can be developed and delivered. Existing risk stratification techniques mainly focus on predicting the overall risk of an individual patient in a supervised manner, and, at the cohort level, often offer little insight beyond a flat score-based segmentation from the labeled clinical dataset. To this end, in this paper, we propose a new approach for risk stratification by exploring a large volume of electronic health records (EHRs) in an unsupervised fashion.


Along this line, this paper proposes a novel probabilistic topic modeling framework called probabilistic risk stratification model (PRSM) based on Latent Dirichlet Allocation (LDA). The proposed PRSM recognizes a patient clinical state as a probabilistic combination of latent sub-profiles, and generates sub-profile-specific risk tiers of patients from their EHRs in a fully unsupervised fashion. The achieved stratification results can be easily recognized as high-, medium- and low-risk, respectively. In addition, we present an extension of PRSM, called weakly supervised PRSM (WS-PRSM) by incorporating minimum prior information into the model, in order to improve the risk stratification accuracy, and to make our models highly portable to risk stratification tasks of various diseases.


We verify the effectiveness of the proposed approach on a clinical dataset containing 3463 coronary heart disease (CHD) patient instances. Both PRSM and WS-PRSM were compared with two established supervised risk stratification algorithms, i.e., logistic regression and support vector machine, and showed the effectiveness of our models in risk stratification of CHD in terms of the Area Under the receiver operating characteristic Curve (AUC) analysis. As well, in comparison with PRSM, WS-PRSM has over 2% performance gain, on the experimental dataset, demonstrating that incorporating risk scoring knowledge as prior information can improve the performance in risk stratification.


Experimental results reveal that our models achieve competitive performance in risk stratification in comparison with existing supervised approaches. In addition, the unsupervised nature of our models makes them highly portable to the risk stratification tasks of various diseases. Moreover, patient sub-profiles and sub-profile-specific risk tiers generated by our models are coherent and informative, and provide significant potential to be explored for the further tasks, such as patient cohort analysis. We hypothesize that the proposed framework can readily meet the demand for risk stratification from a large volume of EHRs in an open-ended fashion.


Electronic health record; Joint patient sub-profile-risk modeling; Latent Dirichlet Allocation; Patient sub-profile; Probabilistic topic model; Risk stratification

[Indexed for MEDLINE]
Free full text

Supplemental Content

Full text links

Icon for Elsevier Science
Loading ...
Support Center