Adapting existing natural language processing resources for cardiovascular risk factors identification in clinical notes

J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S128-S132. doi: 10.1016/j.jbi.2015.08.002. Epub 2015 Aug 28.

Abstract

The 2014 i2b2 natural language processing shared task focused on identifying cardiovascular risk factors such as high blood pressure, high cholesterol levels, obesity and smoking status among other factors found in health records of diabetic patients. In addition, the task involved detecting medications, and time information associated with the extracted data. This paper presents the development and evaluation of a natural language processing (NLP) application conceived for this i2b2 shared task. For increased efficiency, the application main components were adapted from two existing NLP tools implemented in the Apache UIMA framework: Textractor (for dictionary-based lookup) and cTAKES (for preprocessing and smoking status detection). The application achieved a final (micro-averaged) F1-measure of 87.5% on the final evaluation test set. Our attempt was mostly based on existing tools adapted with minimal changes and allowed for satisfying performance with limited development efforts.

Keywords: Cardiovascular disease; Clinical narrative; Information extraction; Machine learning; Medical records; Natural language processing; Risk factors; Text mining.

MeSH terms

  • Aged
  • Cardiovascular Diseases / diagnosis
  • Cardiovascular Diseases / epidemiology*
  • Cohort Studies
  • Comorbidity
  • Computer Security
  • Confidentiality
  • Data Mining / methods*
  • Diabetes Complications / diagnosis
  • Diabetes Complications / epidemiology*
  • Electronic Health Records / organization & administration*
  • Female
  • Humans
  • Incidence
  • Longitudinal Studies
  • Male
  • Middle Aged
  • Narration*
  • Natural Language Processing*
  • Pattern Recognition, Automated / methods
  • Risk Assessment / methods
  • Utah / epidemiology
  • Vocabulary, Controlled