Send to

Choose Destination
Am J Physiol Heart Circ Physiol. 2018 Oct 1;315(4):H910-H924. doi: 10.1152/ajpheart.00175.2018. Epub 2018 May 18.

Phrase mining of textual data to analyze extracellular matrix protein patterns across cardiovascular disease.

Author information

NIH BD2K Program Centers of Excellence for Big Data Computing-Heart BD2K Center, Departments of Physiology, Medicine/Cardiology, and Bioinformatics, David Geffen School of Medicine, University of California , Los Angeles, California.
NIH BD2K Program Centers of Excellence for Big Data Computing-KnowEng Center, Department of Computer Science, University of Illinois at Urbana-Champaign , Champaign, Illinois.
NIH BD2K Program Centers of Excellence for Big Data Computing-Heart BD2K Center, Heart Big Data to Knowledge Center, Department of Computer Science, Scalable Analytics Institute, Henry Samueli School of Engineering and Applied Science, University of California , Los Angeles, California.


Extracellular matrix (ECM) proteins have been shown to play important roles regulating multiple biological processes in an array of organ systems, including the cardiovascular system. Using a novel bioinformatics text-mining tool, we studied six categories of cardiovascular disease (CVD), namely, ischemic heart disease, cardiomyopathies, cerebrovascular accident, congenital heart disease, arrhythmias, and valve disease, anticipating novel ECM protein-disease and protein-protein relationships hidden within vast quantities of textual data. We conducted a phrase-mining analysis, delineating the relationships of 709 ECM proteins with the 6 groups of CVDs reported in 1,099,254 abstracts. The technology pipeline known as Context-Aware Semantic Online Analytical Processing was applied to semantically rank the association of proteins to each CVD and all six CVDs, performing analyses to quantify each protein-disease relationship. We performed principal component analysis and hierarchical clustering of the data, where each protein was visualized as a six-dimensional vector. We found that ECM proteins display variable degrees of association with the six CVDs; certain CVDs share groups of associated proteins, whereas others have divergent protein associations. We identified 82 ECM proteins sharing associations with all 6 CVDs. Our bioinformatics analysis ascribed distinct ECM pathways (via Reactome) from this subset of proteins, namely, insulin-like growth factor regulation and interleukin-4 and interleukin-13 signaling, suggesting their contribution to the pathogenesis of all six CVDs. Finally, we performed hierarchical clustering analysis and identified protein clusters predominantly associated with a targeted CVD; analyses of these proteins revealed unexpected insights underlying the key ECM-related molecular pathogenesis of each CVD, including virus assembly and release in arrhythmias. NEW & NOTEWORTHY The present study is the first application of a text-mining algorithm to characterize the relationships of 709 extracellular matrix-related proteins with 6 categories of cardiovascular disease described in 1,099,254 abstracts. Our analysis informed unexpected extracellular matrix functions, pathways, and molecular relationships implicated in the six cardiovascular diseases.


big data; machine learning; relationship discovery; text mining

[Indexed for MEDLINE]
Free PMC Article

Supplemental Content

Full text links

Icon for Atypon Icon for PubMed Central
Loading ...
Support Center