Send to

Choose Destination
Pac Symp Biocomput. 2012:422-33.

Ranking gene-drug relationships in biomedical literature using Latent Dirichlet Allocation.

Author information

Department of Biomedical Informatics, Vanderbilt University, Nashville, TN 37203, USA. yonghui.wu@Vanderbilt.Edu


Drug responses vary greatly among individuals due to human genetic variations, which is known as pharmacogenomics (PGx). Much of the PGx knowledge has been embedded in biomedical literature and there is a growing interest to develop text mining approaches to extract such knowledge. In this paper, we present a study to rank candidate gene-drug relations using Latent Dirichlet Allocation (LDA) model. Our approach consists of three steps: 1) recognize gene and drug entities in MEDLINE abstracts; 2) extract candidate gene-drug pairs based on different levels of co-occurrence, including abstract level, sentence level, and phrase level; and 3) rank candidate gene-drug pairs using multiple different methods including term frequency, Chi-square test, Mutual Information (MI), a reported Kullback-Leibler (KL) distance based on topics derived from LDA (LDA-KL), and a newly defined probabilistic KL distance based on LDA (LDA-PKL). We systematically evaluated these methods by using a gold standard data set of gene-drug relations derived from PharmGKB. Our results showed that the proposed LDA-PKL method achieved better Mean Average Precision (MAP) than any other methods, suggesting its promising uses for ranking and detecting PGx relations.

[Indexed for MEDLINE]
Free PMC Article

Supplemental Content

Full text links

Icon for World Scientific Publishing Company Icon for PubMed Central
Loading ...
Support Center