On the limitations of large language models in clinical diagnosis

Objective: Large Language Models such as GPT-4 previously have been applied to differential diagnostic challenges based on published case reports. Published case reports have a sophisticated narrative style that is not readily available from typical electronic health records (EHR). Furthermore, even if such a narrative were available in EHRs, privacy requirements would preclude sending it outside the hospital firewall. We therefore tested a method for parsing clinical texts to extract ontology terms and programmatically generating prompts that by design are free of protected health information. Materials and Methods: We investigated different methods to prepare prompts from 75 recently published case reports. We transformed the original narratives by extracting structured terms representing phenotypic abnormalities, comorbidities, treatments, and laboratory tests and creating prompts programmatically. Results: Performance of all of these approaches was modest, with the correct diagnosis ranked first in only 5.3–17.6% of cases. The performance of the prompts created from structured data was substantially worse than that of the original narrative texts, even if additional information was added following manual review of term extraction. Moreover, different versions of GPT-4 demonstrated substantially different performance on this task. Discussion: The sensitivity of the performance to the form of the prompt and the instability of results over two GPT-4 versions represent important current limitations to the use of GPT-4 to support diagnosis in real-life clinical settings. Conclusion: Research is needed to identify the best methods for creating prompts from typically available clinical data to support differential diagnostics.


Introduction
A recent study 1 reported that the Generative Pretrained Transformer 4 (GPT-4) model performed well in complex differential diagnostic reasoning.They evaluated the performance of GPT-4 on 74 case records from the New England Journal of Medicine (NEJM) published in 2021 and 2022 by sending to GPT-4 a standard prompt, followed by the part of the case report that included the case presentation up to but not including the discussant's initial response.The authors found that GPT-4 returned the correct diagnosis as a part of its response in 64% of cases, with the correct diagnosis being at rank 1 in 39% of cases. 1 We examined the influence of linguistic context on the performance of GPT-4 in differential diagnostic reasoning by developing equivalent queries that contained the phenotypic abnormalities described in the original reports without the accompanying narrative text.

Methods
We included NEJM case reports from Case 2-2021 to 40-2022 in our study, omitting 5 of 80 case reports that did not describe a diagnostic dilemma.The text representing the initial clinical presentation was taken from the documents by extracting the first discussant's section.We appended this text to the standardized prompt as described in the recent study, 1 and used the OntoGPT 2 tool to query GPT-4.To assess the influence of the narrative context of the case reports, we extracted the clinical abnormalities as human phenotype ontology (HPO) terms. 3Observed and excluded abnormalities were included in the prompt using a standard template, whereby the clinical features were arranged according to the time point of presentation (Figure 1).  4 (B) The corresponding example using the simplified, feature-based template.The features in the original text that were mined to generate the feature-based query are highlighted (Observed features in blue and excluded features in red; the phrases introducing the time periods are underlined)..The actual diagnosis in this case was lead poisoning GPT-4 returned the correct diagnosis at rank 11 using the narrative query and did not return any related diagnosis with the feature based query.In this example, the case presentation is relatively short; in many of the analyzed cases, the presentation had a length of a page or longer.

Results
We presented GPT-4 with the original description by the primary discussant (narrative approach), and observed that GPT-4 included the final diagnosis in its differential in 29/75 cases (38.7%; rank 1 in 17.3% of cases; mean rank of 3.4).These results are similar to but not identical to those of the above mentioned study, perhaps because of the stochasticity of the GPT-4 algorithm or changes to the application subsequent to the original study.We then tested the feature-based approach that includes the major clinical abnormalities without additional narrative text.Here, GPT-4 included the final diagnosis in its differential in 8/75 cases (10.7%; rank 1 in 4.0% of cases; mean rank of 3.9) (Figure 2).

Discussion
The potential of large language models (LLM) such as GPT has been a subject of debate, with ascribing near sentient abilities to the models and others claiming that LLMs merely p "autocomplete on steroids".For the purpose of applying LLMs to the problem of clinical diagnos important to realize that LLMs generate text based on patterns learned from huge amounts of t texts 5 .LLMs such as GPT-4 do not possess an explicit model of medical domain knowledge a not perform a symbolic human-like reasoning, but instead performs autocompletion by im learning medical domain knowledge from the data.
We compared the performance of GPT-4 on the original narrative texts and simplified versions cases in which only clinical features representable by HPO terms are presented to GPTperformance on the feature-based queries was substantially worse than that of the narrative q (Figure 2).We consider the feature-based queries to be a more appropriate test of the perfor of GPT-4 in diagnostic tasks, since it is unlikely that the narrative approach can be used in clinical practice.NEJM-style clinical narratives are not readily available for most cases and EHR cannot be transmitted across the internet without violating privacy regulations.In contras straightforward to generate a feature-based list of clinical problems, symptoms, and abnormalities that can be used to generate a prompt for GPT.Currently, GPT-4 is not availa installation within medical centers, and it remains an open question as to whether smaller m eventually embedding structured information, will demonstrate comparable performance.A po solution could consist in coupling LLMs with a formal representation of medical knowledg example using biomedical knowledge graphs. 6Future research and algorithmic developm needed to determine the optimal approach to leveraging LLMs for clinical diagnosis.

Figure 1 .
Figure 1.GPT-4 prompt templates.(A) An example showing the narrative template provided to GPT-4 (case 38-2021).4(B) The corresponding example using the simplified, feature-based template.The features in the original text that were mined to generate the feature-based query are highlighted (Observed features in blue and excluded features in red; the phrases introducing the time periods are underlined)..The actual diagnosis in this case was lead poisoning GPT-4 returned the correct diagnosis at rank 11 using the narrative query and did not return any related diagnosis with the feature based query.In this example, the case presentation is relatively short; in many of the analyzed cases, the presentation had a length of a page or longer.

Figure 2 .
Figure 2. Performance of GPT-4 in diagnostic challenges using narrative and feature-based queries histogram of scores (0-5) denoting accuracy of GPT-4 diagnosis using narrative (blue bars) and extracted (orange bars) from NEJM case reports.(B) A histogram of the ranks of the correct diagnosis in the differential di produced by GPT-4 in cases where the score was 4 (nearly the correct diagnosis) or 5 (correct diagnosis) using n (blue bars) and extracted features (orange bars) from NEJM case reports.The count of unranked cases (scores shown for comparison.The results were assigned scores using the same scale as in the previous study. 1Resu scored independently by three coauthors (P.N.R, D.D., J.T.R.) and disagreements were resolved by consensus.actual diagnosis was suggested in the differential; 4 = the suggestions included something very close, but not ex the suggestions included something closely related that might have been helpful; 2 = the suggestions included so related, but unlikely to be helpful; 0 = no suggestions close to the target diagnosis.