Format

Send to

Choose Destination
Genes (Basel). 2019 Jan 17;10(1). pii: E57. doi: 10.3390/genes10010057.

A Multi-Label Supervised Topic Model Conditioned on Arbitrary Features for Gene Function Prediction.

Author information

1
School of Information, Yunnan Normal University, 650500 Kunming, China. liulinrachel@163.com.
2
Key Laboratory of Educational Informatization for Nationalities Ministry of Education, Yunnan Normal University, 650500 Kunming, China. maitanweng2@163.com.
3
School of Software, Yunnan University, 650091 Kunming, China. maitanweng2@163.com.
4
School of Software, Yunnan University, 650091 Kunming, China. zwei@ynu.edu.cn.

Abstract

With the continuous accumulation of biological data, more and more machine learning algorithms have been introduced into the field of gene function prediction, which has great significance in decoding the secret of life. Recently, a multi-label supervised topic model named labeled latent Dirichlet allocation (LLDA) has been applied to gene function prediction, and obtained more accurate and explainable predictions than conventional methods. Nonetheless, the LLDA model is only able to construct a bag of amino acid words as a classification feature, and does not support any other features, such as hydrophobicity, which has a profound impact on gene function. To achieve more accurate probabilistic modeling of gene function, we propose a multi-label supervised topic model conditioned on arbitrary features, named Dirichlet multinomial regression LLDA (DMR-LLDA), for introducing multiple types of features into the process of topic modeling. Based on DMR framework, DMR-LLDA applies an exponential a priori construction, previously with weighted features, on the hyper-parameters of gene-topic distribution, so as to reflect the effects of extra features on function probability distribution. In the five-fold cross validation experiment of a yeast datasets, DMR-LLDA outperforms the compared model significantly. All of these experiments demonstrate the effectiveness and potential value of DMR-LLDA for predicting gene function.

KEYWORDS:

Dirichlet-multinomial Regression; gene function; multi-label classification; probability distribution; topic model

Supplemental Content

Full text links

Icon for Multidisciplinary Digital Publishing Institute (MDPI) Icon for PubMed Central
Loading ...
Support Center