Development of a Human Evaluation Framework and Correlation with Automated Metrics for Natural Language Generation of Medical Diagnoses

Emma Croxford; Yanjun Gao; Brian Patterson; Daniel To; Samuel Tesch; Dmitriy Dligach; Anoop Mayampurath; Matthew M Churpek; Majid Afshar

doi:10.1101/2024.03.20.24304620

Development of a Human Evaluation Framework and Correlation with Automated Metrics for Natural Language Generation of Medical Diagnoses

medRxiv [Preprint]. 2024 Apr 9:2024.03.20.24304620. doi: 10.1101/2024.03.20.24304620.

Authors

Emma Croxford¹, Yanjun Gao¹, Brian Patterson², Daniel To¹, Samuel Tesch¹, Dmitriy Dligach³, Anoop Mayampurath⁴, Matthew M Churpek¹, Majid Afshar¹

Affiliations

¹ Department of Medicine, School of Medicine and Public Health, University of Wisconsin Madison.
² Department of Emergency Medicine, School of Medicine and Public Health, University of Wisconsin Madison.
³ Department of Computer Science, Loyola University Chicago.
⁴ Biostatistics and Medical Informatics, School of Medicine and Public Health, University of Wisconsin Madison.

Abstract

In the evolving landscape of clinical Natural Language Generation (NLG), assessing abstractive text quality remains challenging, as existing methods often overlook generative task complexities. This work aimed to examine the current state of automated evaluation metrics in NLG in healthcare. To have a robust and well-validated baseline with which to examine the alignment of these metrics, we created a comprehensive human evaluation framework. Employing ChatGPT-3.5-turbo generative output, we correlated human judgments with each metric. None of the metrics demonstrated high alignment; however, the SapBERT score-a Unified Medical Language System (UMLS)- showed the best results. This underscores the importance of incorporating domain-specific knowledge into evaluation efforts. Our work reveals the deficiency in quality evaluations for generated text and introduces our comprehensive human evaluation framework as a baseline. Future efforts should prioritize integrating medical knowledge databases to enhance the alignment of automated metrics, particularly focusing on refining the SapBERT score for improved assessments.

Publication types

Preprint

Abstract

Publication types

Grants and funding