Format

Send to

Choose Destination
J Biomed Inform. 2014 Dec;52:199-211. doi: 10.1016/j.jbi.2014.07.001. Epub 2014 Jul 16.

Limestone: high-throughput candidate phenotype generation via tensor factorization.

Author information

1
Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX 78712, United States. Electronic address: joyceho@utexas.edu.
2
Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX 78712, United States.
3
Scripps Translational Science Institute, Scripps Health, La Jolla, CA 92037, United States.
4
Sutter Health Research, Development, and Dissemination Team, Sutter Health, Walnut Creek, CA 94598, United States.
5
Department of Biomedical Informatics, Vanderbilt University, Nashville, TN 37232, United States; Department of Medicine, Vanderbilt University, Nashville, TN 37232, United States.
6
Department of Biomedical Informatics, Vanderbilt University, Nashville, TN 37232, United States; Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN 37232, United States.
7
School of Computational Science and Engineering at College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, United States.

Abstract

The rapidly increasing availability of electronic health records (EHRs) from multiple heterogeneous sources has spearheaded the adoption of data-driven approaches for improved clinical research, decision making, prognosis, and patient management. Unfortunately, EHR data do not always directly and reliably map to medical concepts that clinical researchers need or use. Some recent studies have focused on EHR-derived phenotyping, which aims at mapping the EHR data to specific medical concepts; however, most of these approaches require labor intensive supervision from experienced clinical professionals. Furthermore, existing approaches are often disease-centric and specialized to the idiosyncrasies of the information technology and/or business practices of a single healthcare organization. In this paper, we propose Limestone, a nonnegative tensor factorization method to derive phenotype candidates with virtually no human supervision. Limestone represents the data source interactions naturally using tensors (a generalization of matrices). In particular, we investigate the interaction of diagnoses and medications among patients. The resulting tensor factors are reported as phenotype candidates that automatically reveal patient clusters on specific diagnoses and medications. Using the proposed method, multiple phenotypes can be identified simultaneously from data. We demonstrate the capability of Limestone on a cohort of 31,815 patient records from the Geisinger Health System. The dataset spans 7years of longitudinal patient records and was initially constructed for a heart failure onset prediction study. Our experiments demonstrate the robustness, stability, and the conciseness of Limestone-derived phenotypes. Our results show that using only 40 phenotypes, we can outperform the original 640 features (169 diagnosis categories and 471 medication types) to achieve an area under the receiver operator characteristic curve (AUC) of 0.720 (95% CI 0.715 to 0.725). Moreover, in consultation with a medical expert, we confirmed 82% of the top 50 candidates automatically extracted by Limestone are clinically meaningful.

KEYWORDS:

Dimensionality reduction; EHR phenotyping; Nonnegative tensor factorization

PMID:
25038555
DOI:
10.1016/j.jbi.2014.07.001
[Indexed for MEDLINE]
Free full text

Supplemental Content

Full text links

Icon for Elsevier Science
Loading ...
Support Center