Format

Send to

Choose Destination
Database (Oxford). 2014 Sep 4;2014. pii: bau088. doi: 10.1093/database/bau088. Print 2014.

Closing the loop: from paper to protein annotation using supervised Gene Ontology classification.

Author information

1
BiTeM group, University of Applied Sciences-HEG, Library and Information Sciences, Rte de Drize 7, 1227 Geneva, Switzerland, Division of Medical Information Sciences, University and Hospitals of Geneva, Geneva, Switzerland and SIBtex group, SIB Swiss Institute of Bioinformatics, Rue Michel-Servet 1, 1206 Geneva, Switzerland BiTeM group, University of Applied Sciences-HEG, Library and Information Sciences, Rte de Drize 7, 1227 Geneva, Switzerland, Division of Medical Information Sciences, University and Hospitals of Geneva, Geneva, Switzerland and SIBtex group, SIB Swiss Institute of Bioinformatics, Rue Michel-Servet 1, 1206 Geneva, Switzerland Julien.gobeill@hesge.ch.
2
BiTeM group, University of Applied Sciences-HEG, Library and Information Sciences, Rte de Drize 7, 1227 Geneva, Switzerland, Division of Medical Information Sciences, University and Hospitals of Geneva, Geneva, Switzerland and SIBtex group, SIB Swiss Institute of Bioinformatics, Rue Michel-Servet 1, 1206 Geneva, Switzerland BiTeM group, University of Applied Sciences-HEG, Library and Information Sciences, Rte de Drize 7, 1227 Geneva, Switzerland, Division of Medical Information Sciences, University and Hospitals of Geneva, Geneva, Switzerland and SIBtex group, SIB Swiss Institute of Bioinformatics, Rue Michel-Servet 1, 1206 Geneva, Switzerland.

Abstract

Gene function curation of the literature with Gene Ontology (GO) concepts is one particularly time-consuming task in genomics, and the help from bioinformatics is highly requested to keep up with the flow of publications. In 2004, the first BioCreative challenge already designed a task of automatic GO concepts assignment from a full text. At this time, results were judged far from reaching the performances required by real curation workflows. In particular, supervised approaches produced the most disappointing results because of lack of training data. Ten years later, the available curation data have massively grown. In 2013, the BioCreative IV GO task revisited the automatic GO assignment task. For this issue, we investigated the power of our supervised classifier, GOCat. GOCat computes similarities between an input text and already curated instances contained in a knowledge base to infer GO concepts. The subtask A consisted in selecting GO evidence sentences for a relevant gene in a full text. For this, we designed a state-of-the-art supervised statistical approach, using a naïve Bayes classifier and the official training set, and obtained fair results. The subtask B consisted in predicting GO concepts from the previous output. For this, we applied GOCat and reached leading results, up to 65% for hierarchical recall in the top 20 outputted concepts. Contrary to previous competitions, machine learning has this time outperformed standard dictionary-based approaches. Thanks to BioCreative IV, we were able to design a complete workflow for curation: given a gene name and a full text, this system is able to select evidence sentences for curation and to deliver highly relevant GO concepts. Contrary to previous competitions, machine learning this time outperformed dictionary-based systems. Observed performances are sufficient for being used in a real semiautomatic curation workflow. GOCat is available at http://eagl.unige.ch/GOCat/.

DATABASE URL:

http://eagl.unige.ch/GOCat4FT/.

PMID:
25190367
PMCID:
PMC4154439
DOI:
10.1093/database/bau088
[Indexed for MEDLINE]
Free PMC Article

Supplemental Content

Full text links

Icon for Silverchair Information Systems Icon for PubMed Central
Loading ...
Support Center