Adapting existing natural language processing resources for cardiovascular risk factors identification in clinical notes

Abdulrahman Khalifa; Stéphane Meystre

doi:10.1016/j.jbi.2015.08.002

Adapting existing natural language processing resources for cardiovascular risk factors identification in clinical notes

J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S128-S132. doi: 10.1016/j.jbi.2015.08.002. Epub 2015 Aug 28.

Authors

Abdulrahman Khalifa¹, Stéphane Meystre²

Affiliations

¹ Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, United States. Electronic address: abdulrahman.aal@utah.edu.
² Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, United States. Electronic address: stephane.meystre@hsc.utah.edu.

Abstract

The 2014 i2b2 natural language processing shared task focused on identifying cardiovascular risk factors such as high blood pressure, high cholesterol levels, obesity and smoking status among other factors found in health records of diabetic patients. In addition, the task involved detecting medications, and time information associated with the extracted data. This paper presents the development and evaluation of a natural language processing (NLP) application conceived for this i2b2 shared task. For increased efficiency, the application main components were adapted from two existing NLP tools implemented in the Apache UIMA framework: Textractor (for dictionary-based lookup) and cTAKES (for preprocessing and smoking status detection). The application achieved a final (micro-averaged) F1-measure of 87.5% on the final evaluation test set. Our attempt was mostly based on existing tools adapted with minimal changes and allowed for satisfying performance with limited development efforts.

Keywords: Cardiovascular disease; Clinical narrative; Information extraction; Machine learning; Medical records; Natural language processing; Risk factors; Text mining.

MeSH terms

Aged
Cardiovascular Diseases / diagnosis
Cardiovascular Diseases / epidemiology*
Cohort Studies
Comorbidity
Computer Security
Confidentiality
Data Mining / methods*
Diabetes Complications / diagnosis
Diabetes Complications / epidemiology*
Electronic Health Records / organization & administration*
Female
Humans
Incidence
Longitudinal Studies
Male
Middle Aged
Narration*
Natural Language Processing*
Pattern Recognition, Automated / methods
Risk Assessment / methods
Utah / epidemiology
Vocabulary, Controlled

Grants and funding

R13 LM011411/LM/NLM NIH HHS/United States