Paraphrasing to improve the performance of Electronic Health Records Question Answering

Sarvesh Soni; Kirk Roberts

Paraphrasing to improve the performance of Electronic Health Records Question Answering

AMIA Jt Summits Transl Sci Proc. 2020 May 30:2020:626-635. eCollection 2020.

Authors

Sarvesh Soni¹, Kirk Roberts¹

Affiliation

¹ School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston TX, USA.

PMID: 32477685
PMCID: PMC7233085

Abstract

This paper describes a paraphrasing approach to improve the performance of question answering (QA) for electronic health records (EHRs). QA systems for structured EHR data usually rely on semantic parsing, which aims to generate machine-understandable logical forms from free-text questions. Training semantic parsers requires large datasets of question-logical form (QL) pairs, which are labor-intensive to create. Considering the scarcity of large QL datasets in the clinical domain, we propose a framework for expanding an existing dataset using paraphrasing. We experiment with different heuristics for multiple sample sizes and iterations to assess the effect of adding paraphrasing to the task of semantic parsing. We found that adding paraphrases to an existing dataset based on TERTHRESHOLD scores results in an improved performance in the majority (74%) of the experimental runs. Hence, the proposed paraphrasing-based framework has the potential to improve the performance of QA systems using a limited set of existing QL annotations.

Grants and funding

R00 LM012104/LM/NLM NIH HHS/United States