Format

Send to

Choose Destination
PLoS Comput Biol. 2014 Sep 25;10(9):e1003799. doi: 10.1371/journal.pcbi.1003799. eCollection 2014 Sep.

Quantifying the impact and extent of undocumented biomedical synonymy.

Author information

1
Institute for Genomics and Systems Biology, University of Chicago, Chicago, Illinois, United States of America; Committee on Genetics, Genomics, and Systems Biology, University of Chicago, Chicago, Illinois, United States of America.
2
Computation Institute, University of Chicago, Chicago, Illinois, United States of America.
3
Computation Institute, University of Chicago, Chicago, Illinois, United States of America; Department of Sociology, University of Chicago, Chicago, Illinois, United States of America.
4
Institute for Genomics and Systems Biology, University of Chicago, Chicago, Illinois, United States of America; Committee on Genetics, Genomics, and Systems Biology, University of Chicago, Chicago, Illinois, United States of America; Computation Institute, University of Chicago, Chicago, Illinois, United States of America; Departments of Medicine and Human Genetics, University of Chicago, Chicago, Illinois, United States of America.

Abstract

Synonymous relationships among biomedical terms are extensively annotated within specialized terminologies, implying that synonymy is important for practical computational applications within this field. It remains unclear, however, whether text mining actually benefits from documented synonymy and whether existing biomedical thesauri provide adequate coverage of these linguistic relationships. In this study, we examine the impact and extent of undocumented synonymy within a very large compendium of biomedical thesauri. First, we demonstrate that missing synonymy has a significant negative impact on named entity normalization, an important problem within the field of biomedical text mining. To estimate the amount synonymy currently missing from thesauri, we develop a probabilistic model for the construction of synonym terminologies that is capable of handling a wide range of potential biases, and we evaluate its performance using the broader domain of near-synonymy among general English words. Our model predicts that over 90% of these relationships are currently undocumented, a result that we support experimentally through "crowd-sourcing." Finally, we apply our model to biomedical terminologies and predict that they are missing the vast majority (>90%) of the synonymous relationships they intend to document. Overall, our results expose the dramatic incompleteness of current biomedical thesauri and suggest the need for "next-generation," high-coverage lexical terminologies.

PMID:
25255227
PMCID:
PMC4177665
DOI:
10.1371/journal.pcbi.1003799
[Indexed for MEDLINE]
Free PMC Article

Supplemental Content

Full text links

Icon for Public Library of Science Icon for PubMed Central
Loading ...
Support Center