Format

Send to

Choose Destination
J Am Med Inform Assoc. 2019 Mar 1;26(3):211-218. doi: 10.1093/jamia/ocy171.

Spell checker for consumer language (CSpell).

Author information

1
Lister Hill National Center for Biomedical Communications National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA.

Abstract

Objective:

Automated understanding of consumer health inquiries might be hindered by misspellings. To detect and correct various types of spelling errors in consumer health questions, we developed a distributable spell-checking tool, CSpell, that handles nonword errors, real-word errors, word boundary infractions, punctuation errors, and combinations of the above.

Methods:

We developed a novel approach of using dual embedding within Word2vec for context-dependent corrections. This technique was used in combination with dictionary-based corrections in a 2-stage ranking system. We also developed various splitters and handlers to correct word boundary infractions. All correction approaches are integrated to handle errors in consumer health questions.

Results:

Our approach achieves an F1 score of 80.93% and 69.17% for spelling error detection and correction, respectively.

Discussion:

The dual-embedding model shows a significant improvement (9.13%) in F1 score compared with the general practice of using cosine similarity with word vectors in Word2vec for context ranking. Our 2-stage ranking system shows a 4.94% improvement in F1 score compared with the best 1-stage ranking system.

Conclusion:

CSpell improves over the state of the art and provides near real-time automatic misspelling detection and correction in consumer health questions. The software and the CSpell test set are available at https://umlslex.nlm.nih.gov/cSpell.

PMID:
30668712
PMCID:
PMC6351975
[Available on 2020-01-21]
DOI:
10.1093/jamia/ocy171

Supplemental Content

Full text links

Icon for Silverchair Information Systems
Loading ...
Support Center