See 1 citation found by title matching your search:
PLoS One. 2019 Feb 13;14(2):e0212112. doi: 10.1371/journal.pone.0212112. eCollection 2019.
Using topic modeling via non-negative matrix factorization to identify relationships between genetic variants and disease phenotypes: A case study of Lipoprotein(a) (LPA).
- 1
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States of America.
- 2
- Division of Clinical Pharmacology, Vanderbilt University Medical Center, Nashville, TN, United States of America.
- 3
- Medical Scientist Training Program, Vanderbilt University School of Medicine, Nashville, TN, United States of America.
- 4
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, United States of America.
Abstract
Genome-wide and phenome-wide association studies are commonly used to identify important relationships between genetic variants and phenotypes. Most studies have treated diseases as independent variables and suffered from the burden of multiple adjustment due to the large number of genetic variants and disease phenotypes. In this study, we used topic modeling via non-negative matrix factorization (NMF) for identifying associations between disease phenotypes and genetic variants. Topic modeling is an unsupervised machine learning approach that can be used to learn patterns from electronic health record data. We chose the single nucleotide polymorphism (SNP) rs10455872 in LPA as the predictor since it has been shown to be associated with increased risk of hyperlipidemia and cardiovascular diseases (CVD). Using data of 12,759 individuals with electronic health records (EHR) and linked DNA samples at Vanderbilt University Medical Center, we trained a topic model using NMF from 1,853 distinct phenotypes and identified six topics. We tested their associations with rs10455872 in LPA. Topics enriched for CVD and hyperlipidemia had positive correlations with rs10455872 (P < 0.001), replicating a previous finding. We also identified a negative correlation between LPA and a topic enriched for lung cancer (P < 0.001) which was not previously identified via phenome-wide scanning. We were able to replicate the top finding in a separate dataset. Our results demonstrate the applicability of topic modeling in exploring the relationship between genetic variants and clinical diseases.
Fig 1Illustration of topic modeling on EHRs using NMF.
PLoS One. 2019;14(2):e0212112.
Fig 2Word clouds for six topics.
The size of the words (phecode) in each cloud indicates the weights of the phenotypes on the topic. Phenotypes with larger-sized words have greater influence on the topic compared to phenotypes with smaller-sized words. For each word cloud, we listed the top 60 words.
PLoS One. 2019;14(2):e0212112.
Fig 3Topic distribution in the cohort.
To visualize the prevalence of each topic in the cohort, we assigned an individual to the topic with the maximum score.
PLoS One. 2019;14(2):e0212112.
Fig 4t-SNE plot of visualizing the patient clusters in a projected 2D metric map (The perplexity was set to 30).
PLoS One. 2019;14(2):e0212112.
Fig 5PheWAS results of rs10455872 on 12,759 individuals adjusted by sex and age.
PLoS One. 2019;14(2):e0212112.
The authors have declared that no competing interests exist.
Publication types
MeSH terms
Substance
Grant support
Full Text Sources
Miscellaneous