Format

Send to

Choose Destination
Bioinformatics. 2018 Jul 15;34(14):2465-2473. doi: 10.1093/bioinformatics/bty130.

GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank.

You R1,2, Zhang Z1,2, Xiong Y3, Sun F2,4, Mamitsuka H5,6, Zhu S1,2.

Author information

1
School of Computer Science and Shanghai Key Lab of Intelligent Information Processing.
2
Center for Computational System Biology, ISTBI, Fudan University, Shanghai, China.
3
Department of Bioinformatics and Biostatistics, Shanghai Jiaotong University, Shanghai, China.
4
Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, USA.
5
Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto Prefecture, Japan.
6
Department of Computer Science, Aalto University, Helsinki, Finland.

Abstract

Motivation:

Gene Ontology (GO) has been widely used to annotate functions of proteins and understand their biological roles. Currently only <1% of >70 million proteins in UniProtKB have experimental GO annotations, implying the strong necessity of automated function prediction (AFP) of proteins, where AFP is a hard multilabel classification problem due to one protein with a diverse number of GO terms. Most of these proteins have only sequences as input information, indicating the importance of sequence-based AFP (SAFP: sequences are the only input). Furthermore, homology-based SAFP tools are competitive in AFP competitions, while they do not necessarily work well for so-called difficult proteins, which have <60% sequence identity to proteins with annotations already. Thus, the vital and challenging problem now is how to develop a method for SAFP, particularly for difficult proteins.

Methods:

The key of this method is to extract not only homology information but also diverse, deep-rooted information/evidence from sequence inputs and integrate them into a predictor in a both effective and efficient manner. We propose GOLabeler, which integrates five component classifiers, trained from different features, including GO term frequency, sequence alignment, amino acid trigram, domains and motifs, and biophysical properties, etc., in the framework of learning to rank (LTR), a paradigm of machine learning, especially powerful for multilabel classification.

Results:

The empirical results obtained by examining GOLabeler extensively and thoroughly by using large-scale datasets revealed numerous favorable aspects of GOLabeler, including significant performance advantage over state-of-the-art AFP methods.

Availability and implementation:

http://datamining-iip.fudan.edu.cn/golabeler.

Supplementary information:

Supplementary data are available at Bioinformatics online.

PMID:
29522145
DOI:
10.1093/bioinformatics/bty130
[Indexed for MEDLINE]

Supplemental Content

Full text links

Icon for Silverchair Information Systems
Loading ...
Support Center