Semisupervised Learning with Report-guided Pseudo Labels for Deep Learning–based Prostate Cancer Detection Using Biparametric MRI

Purpose To evaluate a novel method of semisupervised learning (SSL) guided by automated sparse information from diagnostic reports to leverage additional data for deep learning–based malignancy detection in patients with clinically significant prostate cancer. Materials and Methods This retrospective study included 7756 prostate MRI examinations (6380 patients) performed between January 2014 and December 2020 for model development. An SSL method, report-guided SSL (RG-SSL), was developed for detection of clinically significant prostate cancer using biparametric MRI. RG-SSL, supervised learning (SL), and state-of-the-art SSL methods were trained using 100, 300, 1000, or 3050 manually annotated examinations. Performance on detection of clinically significant prostate cancer by RG-SSL, SL, and SSL was compared on 300 unseen examinations from an external center with a histopathologically confirmed reference standard. Performance was evaluated using receiver operating characteristic (ROC) and free-response ROC analysis. P values for performance differences were generated with a permutation test. Results At 100 manually annotated examinations, mean examination-based diagnostic area under the ROC curve (AUC) values for RG-SSL, SL, and the best SSL were 0.86 ± 0.01 (SD), 0.78 ± 0.03, and 0.81 ± 0.02, respectively. Lesion-based detection partial AUCs were 0.62 ± 0.02, 0.44 ± 0.04, and 0.48 ± 0.09, respectively. Examination-based performance of SL with 3050 examinations was matched by RG-SSL with 169 manually annotated examinations, thus requiring 14 times fewer annotations. Lesion-based performance was matched with 431 manually annotated examinations, requiring six times fewer annotations. Conclusion RG-SSL outperformed SSL in clinically significant prostate cancer detection and achieved performance similar to SL even at very low annotation budgets. Keywords: Annotation Efficiency, Computer-aided Detection and Diagnosis, MRI, Prostate Cancer, Semisupervised Deep Learning Supplemental material is available for this article. Published under a CC BY 4.0 license.


Introduction
Computer-aided diagnosis (CAD) systems in fields where clinical experts are matched or outperformed, typically use very large training datasets. Top performing deep learning systems used 29,541 training cases (10,306 patients) for the detection of lung cancer (Ardila et al., 2019), 121,850 training cases (121,850 women) for the detection of breast cancer (McKinney et al., 2020) and 16,114 training cases (12,399 patients) for the classification of skin diseases (Liu et al., 2020).
Annotation time and cost are major limiting factors, resulting in significantly smaller labelled training datasets in most deep learning fields. In the natural image domain, leveraging sam-In the medical domain, popular techniques to leverage unlabelled samples include transfer learning from a distant or related task, and self-training with automatically generated labels (Cheplygina et al., 2019). Other techniques to leverage unlabelled samples include contrastive learning (Chaitanya et al., 2020;Sowrirajan et al., 2021;Azizi et al., 2021) and selfsupervised representation learning (Zhou et al., 2019). These techniques either pre-train without labels, or directly use model predictions as true labels.
Leveraging clinical information, which is often available in medical reports, to improve training with unlabelled samples is under-explored. Clinical information from reports typically differ from regular training annotations, but can inform the generation of automatic annotations for self-training. One study, (Bulten et al., 2019), generated pixel-level Gleason score annotations in H&E stained prostate biopsies by leveraging pathology reports. First, they generated precise cancer masks. Then, they extracted the Gleason scores from the pathology reports to classify the cancer masks into Gleason grades. These steps allowed the generation of Gleason score annotations in thousands of prostate biopsies, which would have been infeasible to obtain manually. Incorporating clinical information to guide automatic annotations for self-training remains to be investigated for medical tasks other than biopsy grading.
We hypothesise that medical detection tasks, where the structures of interest can be counted, can leverage unlabelled cases by semi-supervised training with report-guided annotations. Specifically, we focus on lesion detection, where each case can have any number of lesions. To demonstrate feasibility of our method, we developed a semi-supervised training procedure for clinically significant prostate cancer detection in MRI.
Prostate cancer (PCa) has 1.2 million new cases each year , a high incidence-to-mortality ratio and risks associated with treatment and biopsy; making non-invasive diagnosis of clinically significant prostate cancer (csPCa) crucial to reduce both overtreatment and unnecessary (confirmatory) biopsies (Stavrinides et al., 2019). Multiparametric MRI (mpMRI) scans interpreted by expert prostate radiologists provide the best non-invasive diagnosis (Eldred-Evans et al., 2021), but is a limited resource that cannot be leveraged freely. Computer-aided diagnosis (CAD) can assist radiologists to diagnose csPCa, but present-day solutions lack stand-alone performance comparable to that of expert radiologists (Cao et al., 2021;Saha et al., 2021;Hosseinzadeh et al., 2021;Schelb et al., 2019;Seetharaman et al., 2021).
Prior work investigated the effect of training set size on prostate cancer detection performance, with radiologicallyestimated ground truth for training and testing . This work shows patient-based area under the receiver operating characteristic curve (AUROC) for their internal test set increased logarithmically between 50 and 1,586 training cases, from an AUROC of 79.9% to 87.5%. If this trend continues, tens of thousands of annotated cases would be required to reach expert performance -in concordance with similar applications in medical imaging.
Trained investigators supervised by an experienced radiologist annotated all PI-RADS ≥ 4 findings in more than three thousand of our institutional prostate MRI exams. According to our principal annotator, I.S., she requires about four minutes to annotate a single prostate cancer lesion in 3D. Difficult cases are discussed with radiologists, further increasing the overall duration. Annotating tens of thousands of cases would therefore incur huge costs and an incredibly large time investment.
Our report-guided semi-supervised training procedure aims to leverage unlabelled exams paired with clinical reports, to improve detection performance without any additional manual effort. We investigate the efficacy of our training procedure by comparing semi-supervised training of csPCa detection models against supervised training. First, we investigate the setting with 3,050 manually annotated exams and 4,706 unlabelled exams. Secondly, we investigate several labelling budgets: 100, 300, 1,000 or 3,050 manually annotated exams, paired with the remaining 7,656, 7,456, 6,756 or 4,706 unlabelled exams, respectively. Finally, to determine annotation-efficiency, we compare supervised training performance with 3,050 manually annotated exams against semi-supervised training with reduced manual annotation budgets.
In this study, the training procedure with report-guided automatic annotations is presented for csPCa detection in bpMRI using radiology reports. However, the underlying method is neither limited to csPCa, MRI, nor radiology reports, and can be applied universally. Any detection task with countable structures of interest, and clinical information reflecting these findings, can use our training method to leverage unlabelled exams with clinical reports.
To train and tune our models, 7,756 studies (6,380 patients) out of 9,275 consecutive studies (7,430 patients) from Radboud University Medical Centre (RUMC) were included. 1,519 studies were excluded due to incomplete examinations, preprocessing errors, prior treatment, poor scan quality, or a prior positive biopsy (Gleason grade group ≥ 2). All scans were obtained as part of clinical routine and evaluated by at least one of six experienced radiologists (4-25 years of experience with prostate MRI). All 1,315 csPCa lesions (PI-RADS ≥ 4) in 3,050 studies between January 2016 and August 2018 were manually delineated by trained investigators (at least 1 year of experience), supervised by an experienced radiologist (M.R., 7 years of experience with prostate MRI).
To test our models, an external dataset of 300 exams (300 patients) from Ziekenhuisgroep Twente (ZGT) was used, acquired between March 2015 and January 2017. All patients in the test set received TRUS biopsies and patients with suspicious findings on MR also received MR-guided biopsies. For 61 exams (20.3%) the ground truth was derived from radical prostatectomy, which superseded the biopsy findings. All examinations in the test set have histopathology-confirmed ground truth, while retaining the patient cohort observed in clinical practice.
Further details on patient demographics, study inclusion/exclusion criteria and acquisition parameters can be found in the Supplementary Materials.

Report-guided Automatic Annotation
Radiology reports were used to automatically create voxellevel annotations for csPCa. At a high level, our labelling procedure consists of two steps: 1. Count the number of clinically significant findings in each radiology report, 2. Localise these findings in their corresponding bpMRI scans with a prostate cancer segmentation model.
A rule-based natural language processing script was developed to automatically extract the PI-RADS scores from radiology reports. The number of clinically significant findings, n sig , is then defined as the number of PI-RADS ≥ 4 findings in an exam. The clinically significant findings are localised by keeping the n sig most confident candidates from a csPCa segmentation model, as depicted in Figure 1. These automatically generated voxel-level masks can be used to augment the training dataset and produce new csPCa segmentation models.

Extraction of Report Findings
Most of the radiology reports in our dataset were generated from a template, and modified to provide additional information. Although multiple templates were used over the years, this resulted in structured reports for most exams. This makes a rule-based natural language processing script a reliable and transparent way to extract PI-RADS scores from our radiology reports.
Simply counting the occurrences of 'PI-RADS 4/5' in the report body is reasonably effective, but has some pitfalls. For example, prior PI-RADS scores are often referenced during follow-up exams, resulting in false positive matches. Findings can also be grouped and described jointly, resulting in false negatives. To improve the reliability of the PI-RADS extraction from radiology reports, we extracted the scores in two steps.
First, we tried to split the radiology reports in sections for individual findings. Secondly, we extracted the PI-RADS scores for each section individually. In case the report could not be split in sections per lesion, we applied strict pattern matching on the full report. See the Supplementary Materials for more details. Example report sections are shown in Figure 2.

Localisation of Report Findings
To localise csPCa findings in unlabelled bpMRI scans, we employed an ensemble 1 of csPCa segmentation models. From the resulting voxel-level confidence maps we created distinct lesion candidates, as illustrated in Figure 3. Specifically, we created a lesion candidate by starting at the most confident voxel, and including all connected voxels (in 3D) with at least 40% of the peak's confidence. Then, the candidate lesion is removed from the confidence map, and the process is repeated until no candidates remain or a maximum of 5 lesions are extracted. Tiny candidates of 10 or fewer voxels (≤ 0.009 cm 3 ) are discarded.
Automatic voxel-level csPCa annotations were generated by keeping the n sig most confident lesion candidates, with n sig the number of clinically significant report findings as described in Section 2.2.1. If there were fewer lesion candidates than clinically significant report findings, the automatic label is excluded.

Models, Preprocessing and Data Augmentation
We have posed the prostate cancer detection task as a voxellevel segmentation task and employed two independent architectures, nnU-Net and Dual-Attention U-Net (DA-UNet). nnU-Net is a self-configuring framework that follows a set of rules to select the appropriate architecture based on the input dataset . The DA-UNet architecture was developed specifically for csPCa detection, where it performed best among state-of-the-art architectures . Both are derived from the U-Net architecture (Ronneberger et al., 2015), extended to 3D (Ç içek et al., 2016), and use anisotropic pooling or convolutional strides to account for the difference in through-plane and in-plane resolution of prostate MRI. See the Supplementary Materials for more details.
The nnU-Net framework typically uses the sum of crossentropy and soft Dice loss, and applies the loss at multiple resolutions (deep supervision). Motivated by Baumgartner et al. (2021) and exploratory experiments, we trained nnU-Net using cross-entropy only. Based on prior experience , we trained the DA-UNet model with Focal Loss (α = 0.75) (Lin et al., 2017).
The acquisition protocol of bpMRI ensures negligible movement between imaging sequences, and little deviation of the prostate from the centre of the scan. Therefore, neither registration between sequences, nor centring of the prostate was deemed necessary. In previous work, we have observed that a centre crop size of 72.0 mm × 72.0 mm × 64.8 mm at a resampled resolution of 0.5 mm × 0.5 mm × 3.6 mm/voxel works well for our dataset . To prevent the nnU-Net  Example lesion report sections. The rule-based score extraction matched the T2W, DWI and DCE scores coloured orange, green and red, respectively. The resulting PI-RADS score is coloured purple. The reports were split in sections by matching the lesion identifier in blue. All reports were originally Dutch.
framework to zero-pad these scans, we extended the field of view slightly for this model, to 80.0 mm × 80.0 mm × 72.0 mm, which corresponds to a matrix size of 160 × 160 × 20. nnU-Net comes with a predefined data preprocessing and augmentation pipeline, detailed in Supplementary Notes 2.2, 3.2 and 4 of . In short, T2W and DWI scans undergo instance-wise z-score normalisation, while ADC maps undergo robust, global z-score normalisation with respect to the complete training dataset. For our anisotropic dataset, the nnU-Net framework applies affine transformations in 2D and applies a wide range of intensity and structure augmentations (Gaussian noise, Gaussian blur, brightness, contrast, simulation of low resolution and gamma augmentation).
For the DA-UNet model we used our institutional augmen-  Fig. 3. Depiction of the dynamic lesion candidate selection from a voxellevel prediction. The selected slice shows two high-confidence and one lowconfidence lesion candidate extracted from the initial voxel-level prediction. All steps are performed in 3D. tation pipeline. We perform instance-wise min/max normalisation for T2W and DWI scans. For ADC maps, we divide each scan by 3000 (97.4 th percentile), to retain their diagnostically relevant absolute values. Then, we apply Rician noise (Gudbjartsson and Patz, 1995) with σ = 0.01 to each scan at original resolution, with a probability of 75%. Subsequently, we resample all scans to a uniform resolution of 0.5 mm × 0.5 mm × 3.6 mm/voxel with bicubic interpolation. Finally, with a probability of 50%, we apply 2D affine data augmentations: horizontal mirroring, rotation with θ ∼ 7.5·N(0, 1), horizontal translation with h x ∼ 0.05 · N(0, 1), vertical translation with h y ∼ 0.05 · N(0, 1) and zoom with s xy ∼ 1.05 · N(0, 1), where N(0, 1) is a Gaussian distribution with zero mean and unit variance.

Extraction of Report Findings
Accuracy of automatically counting the number of PI-RADS ≥ 4 lesions in a report (n sig ) is determined by comparing against the number of PI-RADS ≥ 4 lesions in the manually annotated RUMC dataset. To account for multifocal lesions (which can be annotated as two distinct regions or a single larger one) and human error in the ground truth annotations, we manually checked the radiology report and verified the number of lesions when there was a mismatch between the ground truth and automatic estimation.

Localisation of Report Findings
Localisation of clinically significant report findings is evaluated with the sensitivity and average number of false positives per case. Evaluation is performed with 5-fold cross-validation on the labelled RUMC dataset, for which PI-RADS ≥ 4 lesions were manually annotated in the MRI scan.

Segmentation of Report Findings
Quality of the correctly localised report findings is evaluated with the Dice similarity coefficient (DSC). This evaluation is performed with 5-fold cross-validation on the labelled RUMC dataset, which has manual PI-RADS ≥ 4 lesion annotations.

Prostate Cancer Detection and Statistical Test
Prostate cancer detection models are evaluated on 300 external exams from ZGT, with histopathology-confirmed ground truth for all patients. Studies are considered positive if they have at least one Gleason grade group ≥ 2 lesion (csPCa) (Epstein et al., 2016). Patient-based diagnostic performance was evaluated using the Receiver Operating Characteristic (ROC), and summarised to the area under the ROC curve (AU-ROC). Lesion-based diagnostic performance was evaluated using Free-Response Receiver Operating Characteristic (FROC), and summarised to the partial area under the FROC curve (pAUC) between 0 and 1 false positive per case, similar to Saha et al. (2021). We trained our models with 5-fold crossvalidation and 3 restarts for nnU-Net and 5 restarts for DA-UNet, resulting in 15 or 25 independent AUROCs and pAUCs on the test set for each model configuration. To determine the probability of one configuration outperforming another configuration, we performed a permutation test with 1, 000, 000 iterations. We used a statistical significance threshold of 0.01. 95% confidence intervals (CI) for the radiologists were determined by bootstrapping 1, 000, 000 iterations, with each iteration selecting ∼ U(0, N) samples with replacement and calculating the target metric. Iterations that sampled only one class were rejected.

Annotation-efficiency
Annotation-efficiency of semi-supervised training is defined as the fraction of manual annotations that are required to reach the same performance as with supervised training. With N supervised the number of manually annotated exams used for supervised training and N semi−supervised the number of manually annotated exams required to reach the same performance, we obtain the annotation-efficiency ratio: To construct a continuous curve for semi-supervised performance as function of the number of manually annotated exams, the performance of manual annotation budgets is piecewise logarithmically interpolated, as illustrated by the coloured dashed lines in the bottom row of Figure 6 (which appear linear with a logarithmic x-axis). The required number of manual annotations is then derived from the intersections, as illustrated by the black dashed lines in the bottom row of Figure 6. This evaluates to: Where N a is the number of manually annotated exams for the budget with performance just below supervised training, N b is the number of manually annotated exams for the budget with performance just above supervised training, per f supervised is the performance from supervised training, per f a is the performance with semi-supervised training just below supervised training and per f b is the performance with semi-supervised training just above supervised training.
Manual annotation budgets of 100, 300, 1,000 or 3,050 manually annotated exams, paired with the remaining 7,656, 7,456, 6,756 or 4,706 unlabelled exams were investigated. For supervised training with 5-fold cross validation, this corresponds to 80, 240, 800 or 2,440 manually annotated training samples per run. The supervised models are ensembled to generate automatic annotations for semi-supervised training, so for semisupervised training the full manual annotation budgets are required.

Extraction of Report Findings
Our score extraction script correctly identified the number of clinically significant lesions for 3,024 out of the 3,044 (99.3%) radiology reports in our manually labelled RUMC dataset. We excluded reports and their studies when no PI-RADS scores could be extracted from the report: 8 cases (0.3%) from the labelled RUMC dataset and 121 cases (2.6%) from the unlabelled RUMC dataset. Full breakdown of automatically extracted versus manually determined number of significant lesions is given in Figure 4. Typing mistakes and changed scores in the addendum were the main source of the 20 (0.7%) incorrect extractions, which is an error rate similar to what we observed for our annotators.

Localisation of Report Findings
Both prostate cancer segmentation architectures, nnU-Net and DA-UNet, can achieve high detection sensitivity. At this high sensitivity operating point, the models also propose a large number of false positive lesion candidates, as indicated by the Free-Response Receiver Operating Characteristic (FROC) curve shown in Figure 5. Masking the models' lesion candidates with the number of clinically significant report findings, n sig , greatly reduces the number of false positive lesion candidates. At the sensitivity of the unfiltered automatic annotations, masking with radiology reports reduced the average number of false positives per case from 0.39 ± 0.14 to 0.064 ± 0.008 for nnU-Net and from 0.88 ± 0.29 to 0.097 ± 0.011 for DA-UNet (trained with 5-fold cross-validation on 3,050 manually annotated exams). This more than five-fold reduction in false positives greatly increases the usability of the automatic annotations.
Studies where we could extract fewer than n sig lesion candidates were excluded. This excludes studies where we are certain to miss lesions, and thus increases sensitivity. From the automatic annotations from nnU-Net we excluded 119 studies, resulting in a sensitivity of 83.8 ± 1.1% at 0.063 ± 0.008 false positives per study. From the automatic annotations from DA-UNet we excluded 4 studies, resulting in a sensitivity of 78.5 ± 3.3% at 0.096 ± 0.012 false positives per study.

Segmentation of Report Findings
Spatial similarity between the automatic and manual annotations is good. Trained with 5-fold cross-validation on 3,050 manual annotations, nnU-Net achieved a Dice similarity coefficient (DSC) of 0.67 ± 0.19, and DA-UNet achieved 0.57 ± 0.17 DSC. Including the missed manual annotations as a DSC of zero, reduces this to 0.51 ± 0.33 for nnU-Net and 0.45 ± 0.28 for DA-UNet. The full distribution of DSC against lesion volume is given in the Supplementary Materials. Figure 1 shows automatic annotations from nnU-Net, with a DSC of 0.70 (≈ mean) for the upper lesion of Patient 1, a DSC of 0.87 (≈ mean + std.) for Patient 2 and a DSC of 0.55 (≈ mean − std.) for Patient 3.

Annotation-efficiency of Training with Report-guided Automatic Annotations
Semi-supervised training significantly increased model performance for all investigated manual annotation budgets, compared to supervised training with the same number of manually annotated exams (P < 10 −4 for each budget). Iteration 2 of semi-supervised training, where automatic annotations were generated by ensembling models from iteration 1 of semisupervised training, generally performed better than iteration 1 of semi-supervised training. However, only AUROC improvement for manual annotation budgets of 100 and 300 exams were statistically significant (P = 3.9 · 10 −3 and P = 3.1 · 10 −3 ). ROC and FROC performances, summarised to the AUROC and pAUC, are shown in the bottom row of Figure 6.
Semi-supervised training (iteration 2) with 1,000 manually annotated exams exceeded lesion-based pAUC performance of supervised training with 2,440 manual annotations (P < 10 −4 ). Performance of semi-supervised training with 300 manually annotated exams came close (P = 0.032). Interpolation suggests the supervised performance is matched with 431 manually annotated exams (5.7× more annotation-efficient).

Discussion and Conclusion
Semi-supervised training significantly improved patientbased risk stratification and lesion-based detection performance of our prostate cancer detection models for all investigated manual annotation budgets, compared to supervised training with the same number of manually annotated exams. This improved performance demonstrates the feasibility of our semisupervised training method with report-guided automatic annotations. Furthermore, the automatic annotations are of sufficient quality to speed up the manual annotation process, by identifying negative cases that do not need to be looked at, and by providing high quality segmentation masks for the majority of the positive cases. Our semi-supervised training method enabled us to utilise thousands of additional prostate MRI exams with radiology reports from clinical routine, without manually annotating each finding in the MRI scan.
Semi-supervised training with automatically labelled prostate MRI exams consistently improved model performance, with a group of baseline and augmented models truly reflecting the difference in performance due to the automatically labelled exams. Comparing groups of models ensures results are not due to variation in performance inherent to deep learning's stochastic nature 2 .
Automatically annotating training samples enables us to improve our models, but also speeds up the manual labelling process for new cases. Accurate PI-RADS score extraction from radiology reports enables us to automatically identify all negative cases (with ≤ 1% error rate), saving the repetitive operation of reading the radiology report only to designate a study as negative. As this entails approximately 60% of the studies, this already amounts to a large time saving. Furthermore, the segmentation masks from nnU-Net are often of sufficient quality to only require verification of the location, saving significant amounts of time for positive studies as well.
Direct applicability of the automatic PI-RADS extraction from radiology reports is, however, limited. The rule-based score extraction was developed with the report templates from RUMC in mind, and is likely to fail for reports with a different structure. For institutes that also have structured reports, the rule-based score extraction can be adapted to match their findings. For unstructured (free text) reports, the task of counting the number of clinically significant findings can be performed by a deep learning-based natural language processing model. This model can be trained on the reports of the manually labelled subset, using the number of findings found in the manual annotations as labels.
Another limitation is that the prostate cancer detection models were trained with prostate MRI scans from a single vendor (Siemens Healthineers, Erlangen; Magnetom Trio/Skyra/Prisma/Avanto). Therefore, these models are likely to perform inferior on scans from different scanner models.
In conclusion, semi-supervised training with report-guided automatic annotations significantly improved csPCa detection

Lesion-Based pAUC for External Test Set (GGG 2)
nnU-Net supervised nnU-Net semi-supervised (iteration 1) nnU-Net semi-supervised (iteration 2) Fig. 6. (top row) Model performance for (semi-)supervised training. Supervised models are trained with 5-fold cross-validation on 3,050 manually annotated exams, and self-training also includes automatically annotated exams from 4,706 unlabelled exams. Automatic annotations for self-training are generated using the supervised models with the same architecture. (bottom row) nnU-Net performance for 100, 300, 1,000 or 3,050 manually annotated exams, combined with 7,656, 7,456, 6,756 or 4,706 unlabelled cases, respectively. Automatic annotations for iteration 1 are generated using the supervised models, automatic annotations for iteration 2 are generated using the semi-supervised models from iteration 1. The (left) panels show ROC performance for patient-based diagnosis of exams with at least one Gleason grade group (GGG) ≥ 2 lesion, and the (right) panels show FROC performance for lesionbased diagnosis of GGG≥ 2 lesions. All models are trained on radiology based PI-RADS ≥ 4 annotations and evaluated on the external test set with histopathology-confirmed ground truth. Shaded areas indicate the 95% confidence intervals from 15 or 25 independent training runs.
performance, allowing unlabelled samples to be leveraged without additional manual effort. Furthermore, automatic annotations can speed up the manual annotation process. Our proposed method is widely applicable, paving the way towards larger datasets with equal or reduced annotation time.

B. Extraction of Report Findings
First, we tried to split the radiology reports in sections for individual findings, by searching for text that matches the following structure:

[PI-RADS] (separators) [number 1-5]
Where the optional separators include 'v2 category' and ':'. The dash between 'PI' and 'RADS' is optional. The T2W, DWI and DCE scores, which define the PI-RADS score, are extracted analogous to the PI-RADS score, while also allowing joint extraction:
In case the report could not be split in sections per lesion, we applied strict pattern matching on the full report. During strict pattern matching we only extract T2W, DWI and DCE scores jointly, to ensure the scores are from the same lesion. The resulting PI-RADS scores were extracted from the full report and matched to the individual scores.

C. Detection Model Architecture
For our prostate cancer segmentation task, nnU-Net [1] configured itself to use a 3D U-Net with five down-sampling steps, as shown in Figure B. This figure also shows the specific choice of 2D/3D convolutional blocks, max-pooling layers and transposed convolutions. No cascade of U-Nets or 2D U-Net was triggered for our dataset. The implementation of the DA-UNet architecture is the same as in [2], with the exception of LeakyReLU [3] activation functions throughout the decoder and decreased L 2 kernel regularisation of 10 −4 . See [2] for implementation details.
[2] A. Saha  Spatial congruence between automatic and manual csPCa annotations, as measured by the Dice similarity coefficient, for automatic annotations derived from (top) nnU-Net or (bottom) DA-UNet. Both methods were evaluated on the labelled RUMC dataset with 5-fold cross-validation, excluded studies due to empty PI-RADS extraction from the radiology report, and excluded studies with insufficient lesion candidates. All metrics were computed in 3D.