The value of an in-domain lexicon in genomics QA

Yutaka Sasaki; John McNaught; Sophia Ananiadou

doi:10.1142/s0219720010004513

The value of an in-domain lexicon in genomics QA

J Bioinform Comput Biol. 2010 Feb;8(1):147-61. doi: 10.1142/s0219720010004513.

Authors

Yutaka Sasaki¹, John McNaught, Sophia Ananiadou

Affiliation

¹ National Centre for Text Mining, School of Computer Science, University of Manchester, MIB, 131 Princess Street, Manchester M17DN, United Kingdom. Yutaka.Sasaki@manchester.ac.uk

PMID: 20183880
DOI: 10.1142/s0219720010004513

Abstract

This paper demonstrates that a large-scale lexicon tailored for the biology domain is effective in improving question analysis for genomics Question Answering (QA). We use the TREC Genomics Track data to evaluate the performance of different question analysis methods. It is hard to process textual information in biology, especially in molecular biology, due to a huge number of technical terms which rarely appear in general English documents and dictionaries. To support biological Text Mining, we have developed a domain-specific resource, the BioLexicon. Started in 2006 from scratch, this lexicon currently includes more than four million biomedical terms consisting of newly curated terms and terms collected from existing biomedical databases. While conventional genomics QA systems provide query expansion based on thesauri and dictionaries, it is not clear to what extent a biology-oriented lexical resource is effective for question pre-processing for genomics QA. Experiments on the genomics QA data set show that question analysis using the BioLexicon performs slightly better than that using n-grams and the UMLS Specialist Lexicon.

Publication types

Evaluation Study
Research Support, Non-U.S. Gov't

MeSH terms

Computational Biology
Data Mining / statistics & numerical data
Databases, Genetic / statistics & numerical data
Genomics / statistics & numerical data*
Information Storage and Retrieval / statistics & numerical data