Automatic Personalized Impression Generation for PET Reports Using Large Language Models

Purpose: To determine if fine-tuned large language models (LLMs) can generate accurate, personalized impressions for whole-body PET reports. Materials and Methods: Twelve language models were trained on a corpus of PET reports using the teacher-forcing algorithm, with the report findings as input and the clinical impressions as reference. An extra input token encodes the reading physician’s identity, allowing models to learn physician-specific reporting styles. Our corpus comprised 37,370 retrospective PET reports collected from our institution between 2010 and 2022. To identify the best LLM, 30 evaluation metrics were benchmarked against quality scores from two nuclear medicine (NM) physicians, with the most aligned metrics selecting the model for expert evaluation. In a subset of data, model-generated impressions and original clinical impressions were assessed by three NM physicians according to 6 quality dimensions (3-point scale) and an overall utility score (5-point scale). Each physician reviewed 12 of their own reports and 12 reports from other physicians. Bootstrap resampling was used for statistical analysis. Results: Of all evaluation metrics, domain-adapted BARTScore and PEGASUSScore showed the highest Spearman’s ρ correlations (ρ=0.568 and 0.563) with physician preferences. Based on these metrics, the fine-tuned PEGASUS model was selected as the top LLM. When physicians reviewed PEGASUS-generated impressions in their own style, 89% were considered clinically acceptable, with a mean utility score of 4.08 out of 5. Physicians rated these personalized impressions as comparable in overall utility to the impressions dictated by other physicians (4.03, P=0.41). Conclusion: Personalized impressions generated by PEGASUS were clinically useful, highlighting its potential to expedite PET reporting.


Introduction
The radiology report serves as the official interpretation of a radiological examination and is essential for communicating relevant clinical findings amongst reading physicians, the healthcare team, and patients.Compared to other imaging modalities, reports for whole-body PET exams (e.g., skull base to thigh or skull vertex to feet) are notable for their length and complexity (1).In a typical PET report, the findings section details numerous observations about the study and the impression section summarizes the key findings.Given that referring physicians primarily rely on the impression section for clinical decision-making and management (2), it is paramount to ensure its accuracy and completeness.However, creating PET impressions that encapsulate all important findings can be time-consuming and error-prone (3).Large language models (LLMs) have the potential to accelerate this process by automatically drafting impressions based on the findings.
While LLMs have been used previously to summarize radiology findings (3)(4)(5)(6)(7)(8), impression generation for whole-body PET reports has received comparatively little attention.Unlike general chest x-ray reports that comprise 40-70 words (9) in the findings section, whole-body PET reports often have 250-500 words in the findings section and contain observations across multiple anatomical regions with cross comparison to available anatomic imaging modalities (e.g., CT and MRI) and clinical correlation.This complexity heightens the risk of omissions in the generated impressions.Furthermore, the length of PET impressions can accentuate the unique reporting styles of individual reading physicians, underscoring the need for personalized impression generation.Consequently, adapting LLMs for PET report summarization presents distinct challenges.
Evaluating the performances of LLMs in the task of impression generation is also challenging, given that a single case can have various acceptable impressions.While expert evaluation stands as the gold standard, it is impractical for physicians to exhaustively review outputs of all LLMs to determine the leading model.In recent years, several evaluation metrics designed for general text summarization have been adapted to evaluate summaries of medical documents (10)(11)(12).However, it remains unclear how well these metrics could assess the quality of PET impressions and which of them align most closely with physician judgments.
This study aimed to determine whether the LLMs fine-tuned on a large corpus of PET clinical reports can accurately summarize PET findings and generate impressions suitable for clinical use.We benchmarked 30 evaluation metrics against physician preferences and then selected the top-performing LLM.To assess its clinical utility, we conducted an expert reader study, identifying common mistakes made by the LLM.We also investigated the importance of personalizing impressions for reading physicians.Lastly, we evaluated the model's reasoning capability within the nuclear medicine (NM) domain and performed external testing of our fine-tuned LLM.

Dataset Collection
Under a protocol approved by the institutional review board and with a waiver of informed consent, we collected 37,370 retrospective PET reports, dictated by 65 physicians, from our institution between January 2010 and January 2023.Appendix S1 presents the statistics of PET reports in our dataset.All reports were anonymized using NLM-Scrubber (13).Of 37,370 PET reports, 4000 were randomly selected for internal testing, 2000 were used for validation, and the remaining 31,370 reports were used for training.For external testing, we retrieved 100 whole-body PET/CT reports, dictated by 62 physicians, from Children's Oncology Group AHOD1331 Hodgkin lymphoma clinical trial (ClinicalTrials.govnumber, NCT02166463) (14).There is no overlap between physicians in the internal and external sets.

Report Preprocessing
In this work, we investigated both encoder-decoder and decoder-only language models.Considering their different architectures, we customized input templates as illustrated in Figure 1.For encoder-decoder models, the first lines describe the categories of PET scans, while the second lines encode each reading physician's identity using an identifier token (details in Appendix S2).The "Findings" section contains the clinical findings from the PET reports, whereas the "Indications" section encompasses relevant background information, including the patient's medical history and the reason for the examination.For decoder-only models, we employed the instruction-tuning method (15) and adapted the prompt from (16).Each case starts with the instruction: "Derive the impression from the given [description] report for [physician]."The PET findings and background information are concatenated to form the "Input" section.The original clinical impressions are used as the reference for model training and evaluation.

Models for PET Report Summarization
We formulated our work as an abstractive summarization task since physicians typically interpret findings in the impression section, rather than merely reusing sentences from the findings section.We fine-tuned 8 encoder-decoder models and 4 decoder-only models, covering a broad range of language models for sequence generation.The encoderdecoder models comprised state-of-the-art (SOTA) transformer-based models, namely BART (17), PEGASUS (18), T5 (19) and FLAN-T5 (20).To investigate if the medical-domain adaptation could benefit our task, we fine-tuned 2 biomedical LLMs, BioBART (21) and Clinical-T5 (22).Additionally, we included 2 baseline models, pointer-generator network (PGN) (3) and BERT2BERT (23).The decoder-only models encompassed GPT2 (24) and OPT (25) as well as the recently released LLaMA (26) and Alpaca (16).All models were trained using the standard teacher-forcing algorithm.LLaMA and Alpaca were fine-tuned with low-rank adaptation (LoRA) (27) to reduce memory usage and accelerate training, while the other models were subjected to full fine-tuning.We adopted the beam search decoding algorithm to generate impressions.More comprehensive information about these models, including their training and inference settings, can be found in Appendix S3.
Figure 1: Formatting of reports for input to encoder-decoder and decoder-only models.For encoder-decoder models, the first two lines describe the examination category and encode the reading physician's identity, respectively."Findings" contains the clinical findings from the PET report, and "Indication" includes the patient's medical history and the reason for the examination.For decoder-only models, each case follows a specific format for the instruction: "Derive the impression from the given [description] for [physician]"."Input" accommodates the concatenation of clinical findings and indications.The output always starts with the prefix "Response:".Both model architectures utilize the cross-entropy loss to compute the difference between original clinical impressions and model-generated impressions.

Benchmarking Evaluation Metrics
To identify the evaluation metrics most correlated with physician preferences, we presented impressions generated by 4 different models (PGN, BERT2BERT, BART, PEGASUS) to two NM physicians.These models represented a wide performance spectrum.One physician (M.S.) reviewed 200 randomly sampled reports in the test set, then scored the quality of model-generated impressions on a 5-point Likert scale.The definitions of each level are provided in Appendix S4.To assess inter-observer variability, a second physician (S.Y.C.) independently scored 20 of the cases based on the same criterion.
Table 1 categorizes the evaluation metrics (detailed introductions in Appendix S4) included in this study.To address the domain gap between general-domain articles and PET reports, we fine-tuned BARTScore on our PET reports using the method described in (28) and named it BARTScore+PET.Following the same approach, we developed PEGASUSScore+PET and T5Score+PET.These three evaluators are made available at https://huggingface.co/ xtie/BARTScore-PET.The Spearman's ρ correlation quantified how well evaluation metrics correlated with the physicians' judgments.Metrics with the highest correlations were used to determine the top-performing model.Table 1: All evaluation metrics included in this study and their respective categories.

Expert Evaluation
To examine the clinical utility of our best LLM, we conducted a reader study involving three physicians: two boardcertified in NM (N.I., S.Y.C.) and one board-certified in NM and radiology (A.P.).Blinded to the original interpreting physicians, each reader independently reviewed a total of 24 whole-body PET reports along with model-generated impressions.Of these, twelve cases were originally dictated by themselves, and the rest were dictated by other physicians.The LLM impressions were always generated in the style of the interpreting physician by using their specific identifier token.The scoring system included 6 quality dimensions (3-point scale) and an overall utility score (5-point scale).Their definitions are described in Table 2.The application we designed for physician review of test cases can be accessed at https://github.com/xtie97/PET-Report-Expert-Evaluation.

Additional Analysis
To further evaluate the capability of our fine-tuned LLMs, we conducted three additional experiments.Implementation details are provided in Appendix S5: 1. Deauville score (DS) prediction: To test the reasoning capability of our models within the NM domain, we classified PET lymphoma reports into DS 1-5 based on the exam-level DSs (29) extracted from model-generated impressions.
The original clinical impressions served as the reference for the DSs.The evaluation metrics included the 5-class accuracy and the linearly weighted Cohen's κ index.For context, a prior study (29) showed that a human expert predicted DSs with 66% accuracy and a Cohen's κ of 0.79 when the redacted PET reports and maximum intensity projection images were given.

Encoding physician-specific styles:
We compared the impressions generated in the styles of two physicians (Physician 1 and Physician 2) who had distinct reporting styles.Physician 1's impressions tended to be more detailed, whereas Physician 2's impressions were more concise.

External testing:
We generated the impressions for all external cases in the styles of three primary physicians (Physician 1, Physician 2, and Physician 3) from our internal dataset and compared these impressions with clinical impressions originally dictated by external physicians.
Table 2: Definitions of six quality dimensions and an overall utility score used in our expert evaluation, along with their corresponding Likert systems.

Statistical Analysis
Using bootstrap resampling (30), the 95% confidence intervals (CI) for our results were derived from 10,000 repetitive trials.The difference between two data groups was statistically significant at 0.05 only when one group exceeded the other in 95% of trials.

Benchmarking evaluation metrics
Figure 2 shows the Spearman's ρ correlation between evaluation metrics and quality scores assigned by the first physician (M.S.).BARTScore+PET and PEGASUSScore+PET exhibited the highest correlations with physician judgment (ρ=0.568 and 0.563, P =0.30).Therefore, both metrics were employed to determine the top-performing model for expert evaluation.However, their correlation values were still below the degree of inter-reader correlation (ρ=0.654).Similar results were observed in the correlation between evaluation metrics and the second physician's scores (Appendix S6).Without adaption to PET reports, the original BARTScore showed lower correlation (ρ=0.474,P <0.001) compared to BARTScore+PET, but still outperformed traditional evaluation metrics like Recall-Oriented Understudy for Gisting Evaluation-L (ROUGE-L, ρ=0.398,P <0.001) (31).
The metrics commonly used in radiology report summarization, including ROUGE (31), BERTScore (32) and RadGraph (10), did not demonstrate strong correlation with physician preferences.Additionally, most reference-free metrics, although effective in general text summarization, showed considerably lower correlation compared to referencedependent metrics.

Model Performance
Figure 3 illustrates the relative performance of 12 language models assessed using all evaluation metrics considered in this study.For better visualization, metric values have been normalized to [0, 1], with the original values available in Appendix S7.The SOTA encoder-decoder models, including PEGASUS, BART, and T5, demonstrated similar performance across most evaluation metrics.Since BARTScore+PET and PEGASUSScore+PET identified PEGASUS as the top-performing model, we selected it for further expert evaluation.After being fine-tuned on our PET reports, the medical knowledge enriched models, BioBART (BARTScore+PET: -1.46; ROUGE-L: 38.9) and Clinical-T5 (BARTScore+PET: -1.54; ROUGE-L: 39.4), did not show superior performance compared to their base models, BART (BARTScore+PET: - When the physicians evaluated clinical impressions dictated by other physicians, the mean utility score (4.03, 95% CI, 3.69, 4.33) was significantly lower than scores they assigned to their own impressions (P <0.001), suggesting a strong preference for their individual reporting style.The primary quality dimensions contributing to such difference included "additions" (Physician's own impressions vs.Other physicians' impressions: 2.94 vs. 2.75, P =0.039) and "clarity and organization" (2.92 vs. 2.50, P <0.001).On average, the physicians considered the overall utility of PEGASUSgenerated impressions in their own style to be comparable to the clinical impressions dictated by other physicians (mean utility score: 4.08 vs. 4.03, P =0.41).
Figure 5 presents four PEGASUS-generated impressions (findings and background information in Appendix S8) with overall utility scores ranging from 2 to 5. For each case, PEGASUS successfully identified the salient findings, offered interpretations, and provided recommendations.However, the model showed susceptibility to factual incorrectness, including misinterpretation of findings and inconsistent statements in the impressions, as evidenced in case 4. Additionally, the model could give overly definite diagnoses, as observed in case 3.

Deauville Score Prediction
Of the 4,000 test cases, 405 PET lymphoma reports contained DSs in the impression sections.Table 3 presents the DS classification results for all evaluated models.PEGASUS achieved the highest 5-class accuracy (76.7%, 95% CI, 72.0%, 81.0%), while PGN was least effective in deriving DSs.All SOTA encoder-decoder models attained an accuracy exceeding 70%.Among decoder-only models, GPT2 demonstrated the best performance, with an accuracy of 71.3% (95% CI, 65.8%, 76.4%).

Encoding Physician-specific Styles
Figure 6 shows the PEGASUS-generated impressions given unique identifier tokens associated with two physicians, Physician 1 and Physician 2. Altering a single token in the input could lead to a drastic change in the output impressions.
For each case, both impressions managed to capture the salient findings and delivered similar diagnoses, however, their length, level of detail and phrasing generally reflected the respective physician's style.This reveals the model's ability to tailor the impressions to individual physicians.The associated findings and background information are presented in Appendix S9.
Figure 4: Expert evaluation consisting of an overall utility score and 6 specific quality dimensions.For the physician's own reports, 89% (32/36) of the PEGASUS-generated impressions were deemed clinically acceptable.The primary reasons for the discrepancy between original clinical impressions and PEGASUS-generated impressions are factual inaccuracies, inappropriate interpretations, and unsuitable recommendations."Orig, own": original clinical impressions from the physician's own reports; "LLM, own": PEGASUS-generated impressions for the physician's own reports; "Orig, other": original clinical impressions from other physicians' reports; "LLM, other": PEGASUS-generated impressions for other physicians' reports.

External Testing
When PEGASUS was applied to the external test set, a significant drop (P <0.001) was observed in the evaluation metrics.Averaged across the reporting styles of Physicians 1, 2, and 3, BARTScore+PET in the external set was 15% worse than in the internal test set.Similarly, ROUGE-L decreased by 29% in the external set.Quantitative results are detailed in Appendix S10, along with four sample cases.

Discussion
In this work, we trained 12 language models on the task of PET impression generation.To identify the best metrics to evaluate model-generated impressions, we benchmarked 30 evaluation metrics against quality scores assigned by physicians and found that domain-adapted text-generation-based metrics, namely BARTScore+PET and PEGASUSS-core+PET, exhibited the strongest correlation with physician preferences.These metrics selected PEGASUS as the top-performing LLM for our expert evaluation.A total of 72 cases were reviewed by three NM physicians, and the large majority of PEGASUS-generated impressions were rated as clinically acceptable.Moreover, by leveraging a specific token in the input to encode the reading physician's identity, we enabled LLMs to learn different reporting styles and generate personalized impressions.When physicians assessed impressions generated in their own style, they considered these impressions to be of comparable overall utility to the impressions dictated by other physicians.Past research on text summarization has introduced numerous evaluation metrics for assessing the quality of AIgenerated summaries.However, when these metrics were employed to evaluate PET impressions, the majority did not align closely with physician judgments.This observation is consistent with findings from other works that evaluated medical document (33) or clinical note summarization (12).In general, we found that model-based metrics slightly outperformed lexical-based metrics, although better evaluation metrics are needed.
Based on our comparison of 12 language models, we observed that the biomedical-domain pretrained LLMs did not outperform their base models.This could be attributed to two reasons.First, our large training set diminished the benefits of medical-domain adaptation.Second, the corpora, such as MIMIC-III and PubMed, likely had limited PET related content, making pretraining less effective for our task.Additionally, we found that the large decoder-only models showed inferior performance in summarizing PET findings compared to the SOTA encoder-decoder models.It stems from their lack of an encoder mechanism that can efficiently distill the essence of input sequences.In this study, we did not test large proprietary models like GPT4 due to data ownership concerns and the inability to fine-tune the models for personalized impressions.Recent works (7,8) explored their capability in radiology report summarization using the in-context learning technique.The question of whether this approach could surpass the full fine-tuning method for public LLMs and its suitability for clinical use remains to be answered.While most PEGASUS-generated impressions were deemed clinically acceptable in expert evaluation, it is crucial to understand what mistakes are commonly committed by the LLM.First, the main problem in model-generated impressions is factual inaccuracies, which manifest as misinterpretation of findings or contradictory statements.Second, the diagnoses given by the LLM could sometimes be overly definite without adequate supporting evidence.Third, some recommendations for clinical follow-up were non-specific, offering limited guidance for patient management.It is worth mentioning that final diagnoses and recommendations are usually not included in the report findings and must be inferred by the model.These observations underscore the need for review and appropriate editing by physicians before report finalization.Of note, LLM-based impression generation can be akin to preliminary impression drafts by radiology resident trainees provided for review by the radiology faculty in an academic training setting.
This study had several limitations.First, when fine-tuning LLaMA and Alpaca, we only investigated a lightweight domain adaptation method, LoRA, constrained by computational resources.Second, we controlled the style of generated impressions by altering a specific token in the input, leaving other potential techniques unexplored.Third, during external testing, we observed a moderate decrease in the evaluation metrics.This is expected given the differences in reporting styles between our internal and external physicians.However, whether this result aligns with physician judgments remains uncertain and warrants further investigation.Lastly, our training dataset was restricted to a single institution.Future work should be expanding our research to a multi-center study.
To conclude, we systematically investigated the potential of LLMs to automate impression generation for whole-body PET reports.Our reader study showed that the top-performing LLM, PEGASUS, produced clinically useful and personalized impressions for the majority of cases.Given its performance, we believe our model could be integrated into clinical workflows and expedite PET reporting by automatically drafting initial impressions based on the findings.14) is a collection of decoder-only transformers, ranging from 7B to 65B.LLaMA-13B showed superior performance compared to GPT3 on most benchmarks.In this study, we chose LLaMA-7B and used LoRA (15) to accelerate training and reduce memory usage.The hyperparameters of the LoRA module are listed as follows: the rank of the low-rank factorization is 8, the scaling factor for the rank is 16, the dropout rate is 0.05, the target modules for LoRA are projection layers in query (q_proj) and value (v_proj).The model weights for LLaMA are available upon request.
12. Alpaca-LoRA: Alpaca ( 16) is the instruction tuned LLaMA-7B model that behaves qualitatively similarly to some closed-source large language models (LLMs), including OpenAI's text-davinci-003.When we finetuned Alpaca, we retained the same hyperparameters as used in LLaMA-LoRA.The weight difference between LLaMA and Alpaca is available at huggingface.co/tatsu-lab/alpaca-7b-wdiff.
All twelve language models were trained using the standard teacher-forcing algorithm.The training objective can be written as a maximum likelihood problem: 5. CIDEr (22): It computes the term frequency-inverse document frequency (TF-IDF) vectors for both human and machine-generated texts based on the n-gram (n ranges from 1 to 4) co-occurrence, and then measures the cosine similarity of the two vectors.(23): It is an extension of the ROUGE metric, designed to assess the semantic similarity between generated and reference texts using pretrained word embeddings.

ROUGE-WE
7. BERTScore (24): It evaluates the cosine similarity of contextual embeddings from BERT for each token in the output and reference sequences.8. MoverScore (25): Similar to BERTScore, it leverages the power of BERT's contextual embeddings to measure the semantic similarity between generated and reference texts.Instead of token-level cosine similarity, MoverScore calculates the Earth Mover's Distance between the embeddings of the two texts.E1, except for different training/validation splits and random seeds.
11. PRISM (28): It is an evaluation metric used in multilingual machine translation.PRISM employs a sequence-to-sequence model to score the machine-generated output conditioned on the human reference.
12. S 3 (29): It uses previously proposed evaluation metrics, including ROUGE and ROUGE-WE, as input features for a regression model to estimate the quality score of the generate text.S 3 -resp is based on a model trained with human annotations following the responsiveness scheme, while S 3 -pyr follows the pyramid scheme.13.UniEval (30): It first constructs pseudo summaries by perturbing reference summaries, then defines evaluation dimensions using different prompt templates.The model is trained to differentiate pseudo data from reference data in a Boolean question-answering framework.While UniEval evaluates coherence, consistency, fluency, and relevance, we only present the overall score which is the average of these 4 dimensions.
14. SummaQA (31)It creates questions from the source document by masking entities.The generated text is then evaluated by a question-answering BERT model, with results reported in terms of the F1 overlap score.15.BLANC (32): It measures how well a generated summary can help improve the performance of a pretrained BERT model in understanding each sentence from the source document with masked tokens.16.SUPERT (33): It creates pseudo-reference summaries by extracting important sentences from the source document and then measures the semantic similarity between the generated text and this pseudo reference.17.Stats (Data Statistics) (34): Stats-compression refers to the word ratio of the source document to its summary.Stats-coverage measures the proportion of words in the generated text that also appear in the source document.Stats-density is the average length of the fragment (e.g., sentence in the source document) from which each summary word is extracted.Stats-novel trigram is the percentage of trigrams present in the summary but absent in the source document.
For the metrics that have precision, recall and F1, we only present the F1 score, which is the harmonic mean of precision and recall.The evaluation codes are partially adapted from (35) and made available on GitHub: github.com/xtie97/PET-Report-Summarization/tree/main/evaluation_metrics.

Appendix S5: Implementation Details of Additional Analysis
1. Deauville score (DS) extraction: Whole-body PET reports that contained physician assigned DSs in the impression sections were identified by searching for the term "Deauville" and its common misspellings.N-gram analysis was then performed to extract the score for each case.Among 405 cases with DSs in the impression section, 34 cases also had DSs in the findings section.To avoid leakage, we removed the scores in these findings.If multiple DSs were present in the impression, the highest value was used to represent the exam-level DS (36).It is likely that model-generated impressions did not contain DSs in some cases, but their original clinical impressions had DSs or vice versa.Considering that we did not force the model to generate DSs in the impressions, we excluded these cases when

Figure 2 :
Figure 2: Spearman's ρ correlations between different evaluation metrics and quality scores assigned by the first physician.The top row quantifies the inter-reader correlation.Notably, domain-adapted BARTScore (BARTScore+PET) and PEGA-SUSScore (PEGASUSScore+PET) demonstrate the highest correlations with physician preferences.

Figure 3 :
Figure 3: Performance of 12 language models evaluated by the metrics included in this study.The X-axis displays the metrics arranged in descending order of correlation with physician preferences, with higher correlations on the left and lower correlations on the right.For each evaluation metric, values underwent min-max normalization to allow comparison within a single plot.The actual metric values are referenced in Appendix S7.The star denotes the best model for each metric, and the circle denotes the other models that do not have statistically significant difference (P >0.05) with the best model.

Figure 5 :
Figure 5: A side-by-side comparison of clinical impressions and PEGASUS-generated impressions (overall utility scores range from 2 to 5).The last column presents comments from the physicians in our expert reader study.Sentences with similar semantic meanings in the original clinical impressions and the PEGASUS-generated impressions are highlighted using identical colors.Protected health information (PHI) has been anonymized and denoted with [X], where X may represent age or examination date.

Figure 6 :
Figure 6: Examples of PEGASUS-generated impressions customized for the physician's reporting style.The first column shows the original clinical impressions: the first example from Physician 1 and the second from Physician 2. Subsequent columns present impressions generated in the style of Physician 1 and Physician 2, respectively.For each case, both impressions capture the critical findings and deliver similar diagnoses.However, their length, level of detail and phrasing generally reflect each physician's unique style.Sentences with similar semantic meanings in the original clinical impressions and the PEGASUS-generated impressions are highlighted using identical colors.

𝜃𝜃��
log  () �  () �() ,  <() ; � Where  denotes the parameters of model ,  () estimates the probability of the next word   given the previous sequence  < in the reference text and the source text .Superscript  denotes the word position in the reference text and  denotes a single sample.The AdamW optimizer(17)  was employed Table E2: Definition of the 5-point Likert scale for evaluating the quality of model-generated impressions.We investigated a broad spectrum of evaluation metrics, comprising 17 different methods.1.ROUGE(18): It measures the number of overlapping textual units between generated and reference texts.ROUGE-N (N=1,2,3) measures the overlap of N-grams, and ROUGE-L measures the overlap of longest common subsequence.ROUGE-LSUM extends ROUGE-L by computing the ROUGE-L for each sentence, and then summing them up.2.BLEU(19): It computes the precision of n-gram overlap (n ranges from 1 to 4) between generated and reference texts with a brevity penalty.3. CHRF (20): It computes the character-based n-gram overlap between the output sequence and the reference sequence.In this study, we set the n-gram length to 10. 4. METEOR (21): It computes an alignment of the generated text and the reference text based on synonymy, stemming, and exact word matching.

9 .
RadGraph(26): It is a specialized evaluation metric tailored for radiology report summarization.RadGraph works by initially extracting clinical entities and their relations from the model-generated impression and the original clinical impression.Leveraging this data, it constructs knowledge graphs to compare the content coverage and structural coherence between the two impressions.10.BARTScore(27): It leverages a pretrained BART model to compute the log probability of generating one text conditioned on another text.In this study, BARTScore is the BART model finetuned on the CNN Daily Mail dataset.BARTScore+PET is the BART model finetuned on our internal PET report dataset.PEGASUSScore+PET is the PEGASUS model finetuned on our internal dataset.T5Score+PET is the FLAN-T5 model finetuned on our internal dataset.The training settings are the same as those in Table

Figure E6 :
Figure E6: The findings section and relevant background information for Case 3 in Figure 5 (in the main body).

Figure E7 :
Figure E7: The findings section and relevant background information for Case 4 in Figure 5 (in the main body).

Table 3 :
Performance of 12 language models on Deauville score prediction 33. Wang LL, Otmakhova Y, DeYoung J, et al.Automated Metrics for Medical Multi-Document Summarization Disagree with Human Evaluations.arXiv;2023.http://arxiv.org/abs/2305.13693.Accessed August 22, 2023.3.BART (5): It is an encoder-decoder model built on the transformer architecture.BART introduced a denoising auto-encoder for pretraining, involving reconstructing the original texts from the corrupted samples.Pretrained BART is available at huggingface.co/facebook/bart-large. 4. BioBART (6): The model shares the same architecture with BART (5) but underwent further training on the PubMed dataset.Pretrained BioBART is available at huggingface.co/GanjinZero/biobart-large. 5. PEGASUS (7): It is an encoder-decoder model built on the transformer architecture.PEGASUS introduced a novel pretraining objective (gap sentence prediction), involving masking important sentences from documents and forcing the model to recover them based on the remaining sentences.Pretrained PEGASUS is available at huggingface.co/google/pegasus-large. 6. T5 (8): It is an encoder-decoder model built on the transformer architecture.T5 established a unified framework that treats almost all natural language tasks as a text-to-text problem.Instead of the original T5, we used T5v1.1 that had multiple modifications of the architecture and was solely pretrained on unsupervised tasks.The model weights are available at huggingface.co/google/t5-v1_1-large.7.Clinical-T5 (9): It is tailored to handle the language structures, terminologies in medical documents by further pretraining T5 on the MIMIC-III dataset(10).The model weights are available at huggingface.co/luqh/ClinicalT5-large.8.FLAN-T5(11): It is a variant of T5 that underwent instruction finetuning in a mixture of tasks.This enabled FLAN-T5 to achieve enhanced performance compared to the original T5 in various downstream applications.The model weights are available at huggingface.co/google/flan-t5-large. 9. GPT2(12): It is a decoder-only model built on the transformer architecture.Unlike the encoderdecoder models, GPT2 is pretrained on a massive corpus of text to predict the next word in a sequence.