GNormPlus: An Integrative Approach for Tagging Gene, Gene Family and Protein Domain
Authors: Chih-Hsuan Wei, Hung-Yu Kao and Zhiyong Lu (PI)
GNormPlus: an end-to-end system that handles both gene/protein name and identifier detection in biomedical literature, including gene/protein mentions, family names and domain names. Moreover, GNormPlus also integrates several advanced text-mining techniques (i.e., GenNorm, SR4GN, SimConcept, Ab3P and CRF++) for resolving composite gene names. On two public benchmarking datasets, we show that GNormPlus compares favorably to the other state-of-the-art methods.
Our proposed approach includes two main steps: mention recognition and concept normalization, respectively. In the mention recognition step, we developed a new module based on CRF++, together with our previous species recognition system (i.e., SR4GN) to recognize gene and species names and match them accordingly. In concept normalization step, we applied our previous system, GenNorm, combined with a composite mention simplification tool (i.e., SimConcept) and an abbreviation resolution tool (i.e., Ab3P) for optimized performance.
The first evaluation is a species-specific experiment where only human genes are considered. GNormPlus was evaluated on the BioCreative II GN test set. We compared GNormPlus with several previously reported systems, including our previous system, GenNorm. In the second experiment, we evaluate GNormPlus in multi-species gene normalization using the BioCreative III GN task data set. GNormPlus presents a competitive performance in both evaluations.
|Open source tools||Precision||Recall||F-measure|
|Open source tools||TAP-5||TAP-10||TAP-20||F-measure|
- Wei C-H, Kao H-Y, Lu Z. GNormPlus: An Integrative Approach for Tagging Gene, Gene Family and Protein Domain. BioMed Research International Journal, Text Mining for Translational Bioinformatics special issue, BioMed Research International Journal, Article ID 918710; DOI: dx.doi.org/10.1155/2015/918710 (2015)