Format

Send to

Choose Destination
Bioinformatics. 2012 Mar 15;28(6):867-75. doi: 10.1093/bioinformatics/bts042. Epub 2012 Jan 27.

Literature mining of host-pathogen interactions: comparing feature-based supervised learning and language-based approaches.

Author information

1
Department of Computer Science, University of Missouri, Columbia, MO 65211, USA.

Abstract

MOTIVATION:

In an infectious disease, the pathogen's strategy to enter the host organism and breach its immune defenses often involves interactions between the host and pathogen proteins. Currently, the experimental data on host-pathogen interactions (HPIs) are scattered across multiple databases, which are often specialized to target a specific disease or host organism. An accurate and efficient method for the automated extraction of HPIs from biomedical literature is crucial for creating a unified repository of HPI data.

RESULTS:

Here, we introduce and compare two new approaches to automatically detect whether the title or abstract of a PubMed publication contains HPI data, and extract the information about organisms and proteins involved in the interaction. The first approach is a feature-based supervised learning method using support vector machines (SVMs). The SVM models are trained on the features derived from the individual sentences. These features include names of the host/pathogen organisms and corresponding proteins or genes, keywords describing HPI-specific information, more general protein-protein interaction information, experimental methods and other statistical information. The language-based method employed a link grammar parser combined with semantic patterns derived from the training examples. The approaches have been trained and tested on manually curated HPI data. When compared to a naïve approach based on the existing protein-protein interaction literature mining method, our approaches demonstrated higher accuracy and recall in the classification task. The most accurate, feature-based, approach achieved 66-73% accuracy, depending on the test protocol.

PMID:
22285561
DOI:
10.1093/bioinformatics/bts042
[Indexed for MEDLINE]

Supplemental Content

Full text links

Icon for Silverchair Information Systems
Loading ...
Support Center