Spectral decoupling for training transferable neural networks in medical imaging

Summary Many neural networks for medical imaging generalize poorly to data unseen during training. Such behavior can be caused by overfitting easy-to-learn features while disregarding other potentially informative features. A recent implicit bias mitigation technique called spectral decoupling provably encourages neural networks to learn more features by regularizing the networks' unnormalized prediction scores with an L2 penalty. We show that spectral decoupling increases the networks′ robustness for data distribution shifts and prevents overfitting on easy-to-learn features in medical images. To validate our findings, we train networks with and without spectral decoupling to detect prostate cancer on tissue slides and COVID-19 in chest radiographs. Networks trained with spectral decoupling achieve up to 9.5 percent point higher performance on external datasets. Spectral decoupling alleviates generalization issues associated with neural networks and can be used to complement or replace computationally expensive explicit bias mitigation methods, such as stain normalization in histological images.


Introduction
Neural networks have been adapted to many medical imaging tasks with impressive results, often surpassing human counterparts in consistency, speed and accuracy [1].However, these networks are prone to overfit easy-to-learn, or statistically dominant, features while disregarding other potentially informative features.This leads to poor generalisation to data generated by different medical centres, reliance on the dominant features, and lack of robustness [2,3].For example, a neural network classifier for skin cancer, approved to be used as a medical device in Europe, had overfit the correlation between surgical margins and malignant melanoma [4].Due to this, the false positive rate of the network was increased by 40 percentage points during external validation.Furthermore, three out of five neural networks for pneumonia detection showed significantly worse performance during external validation [5] and recent neural networks for COVID-19 detection rely on confounding factors rather than actual medical pathology [6].Even small differences in the sharpness of images from two different scanners can degrade the performance of neural networks significantly (see Section 3.2).
Although generalisation issues need to be solved before any neural networks can be applied in clinical practice, the phenomenon is still poorly understood [7].This may be because the detection of generalisation issues is hard and often requires state-of-the-art methods of explainable AI [6].An external dataset is one of the only methods of testing generalization performance, although it will uncover generalisation issues only when the neural network fails to generalize to the dataset.If a neural network achieves high overall accuracy on the external dataset, it may still always fail for some subset of samples.Any particular external dataset may also contain the same sources of bias as the training data.
Explicit methods have been proposed to address specific sources of bias, like using augmentation to address staining differences in tissue section slides [8] or normalising each image with a common standard [9,10].The obvious problem with explicit methods is that they only control for selected biases and more subtle sources of bias, like small differences between patient populations, may go unaddressed.Implicit methods of bias control are required before neural networks can be safely applied to clinical practice.

arXiv:2103.17171v4 [eess.IV] 17 Dec 2021
Preprint -Spectral decoupling for training transferable neural networks in medical imaging 2 Learning dominant features at the cost of other potentially informative features, also known as shortcut-learning, is a common problem in all neural networks and one of the main reasons behind the generalisation issues [3].Shortcut-learning occurs mainly because of gradient starvation, where gradient descent updates the parameters of a neural network in directions capturing only dominant features, thus starving the gradient from other features [11].The gradient descent algorithm finds a local optimum by taking small steps towards the opposite sign of the derivative, the direction of the steepest descent [12].The recently proposed method of spectral decoupling [2] provably decouples the learning dynamics leading to gradient starvation when using cross-entropy loss, thus encouraging the network to learn more features.The effect is achieved by simply adding an L2 penalty on the unnormalised prediction scores (logits) of the network.
We evaluate the utility of spectral decoupling as an implicit bias mitigation method in the context of medical imaging.We use simulation experiments to show that spectral decoupling increases networks' robustness to data distribution shifts and can be used to train generalisable networks on datasets with a strong superficial correlation.The findings are then evaluated by training prostate cancer and COVID-19 classifiers, where the networks trained with spectral decoupling achieve significantly higher performance on all evaluation datasets.

Spectral decoupling
In spectral decoupling, the network is regularised by imposing an L2 penalty on the unnormalised outputs of the last layer of the network, or logits ŷ, which is then added to cross-entropy loss, L CE .This penalty provably [2] avoids the conditions leading to gradient starvation in networks trained with cross-entropy loss.Two variants of the penalty are defined as For Equation 1, there is a single tunable hyper-parameter λ.
For Equation 2, hyper-parameters λ and γ are tuned separately for each class, a total of four hyper-parameters for the binary classification task in our study.Pseudo-code for implementing Equation 1 is presented in Algorithm 1.

for (images, targets) in loader:
# Pass images through the network.logits = net(images) All digital slide images are cut and processed with HistoPrep [15].A summary of the prostate datasets is presented in Table 1.

COVID-19
For COVID-19 detection, we use large open-access repositories of chest radiographs.COVIDx8 dataset is compiled from five different open-source repositories and contains radiographs from over 15,000 patient cases from at least 51 countries, with over 1500 COVID-19 positive patient cases [16,17,18,19,20].BIMCV± dataset (iteration 2) contains 3033 positive and 2743 negative COVID-19 patient cases, and 9171 radiographs, after exclusions, collected from the multiple same medical centres during the same time period [21].Only PA and upright AP radiographs [16] with windowing information were selected from the BIMCV± dataset.PadChest dataset contains over 67,000 COVID-19 negative patient cases, and 114,227 radiographs from a single medical centre in Valencia, Spain [22].19 corrupted images were excluded from the PadChest dataset.
COVIDx8 dataset is reserved as an external dataset, and two training datasets are compiled by using only the BIMCV± dataset and by adding the PadChest and BIMCV± datasets together.5% of both training datasets are set aside for validation.

Simulation datasets
Two simulation experiments are used to more closely investigate the utility of spectral decoupling as an implicit bias mitigation method.For both experiments, the dataset from Helsinki University Hospital described in Section 2.2 is modified in specific ways.

Cutout dataset
A dominant feature present in a real-world dataset could be, for example, a biological marker, a certain cancer type or a scanner artefact.To represent these kinds of features, 16 cutouts of 8 × 8 pixels are added to the images (Figure 1).
For the experiment, 200,000 images are selected for the training set with an equal amount of samples with cancerous and benign annotations.For the training set, cutouts are added to 25% and 2.5% of the benign and cancerous samples, respectively.This makes the presence of cutouts in the image spuriously correlated with a benign annotation.If the network overfits this correlation, cancerous samples with cutouts may be classified as benign.Thus for the test set, cutouts are added to all cancerous samples and none of the benign samples.For a control training set, cutouts are added to all images.Networks trained with this dataset provide a reference point of the performance with cutouts but without the spurious correlation.

Robustness dataset
Shifts from the training data distribution are common when evaluating the neural network with datasets from different medical centres.Small changes in the images due to differences in, for example, sample preparation or imaging equipment can cause shifts from the training data distribution.We assess the networks' robustness to these data distribution shifts, by applying transformations with increasing magnitudes to the images in the test set.Image sharpness and stain intensity were selected to represent possible dataset shifts caused by differences in the used scanner and sample preparation, respectively.
The UniformAugment augmentation strategy consists of applying random transformations with a uniformly sampled magnitude to the images before feeding them to the network [23].Sharpening the image is included in the set of possible transformations [24], meaning that the network sees sharpened images during training.Thus, the data distribution shift caused by sharpening images is being explicitly mitigated, which should help the network to predict correct labels for evaluation images with higher sharpness.Blurring the image is not included in the set of possible transformations [24], meaning that the network will not see randomly blurred images during training.Thus, the data distribution shift caused by blurring the images will not be explicitly mitigated and the use of UniformAugment should not directly help the network with blurry evaluation images.
By evaluating the network with increasingly sharpened or blurred images, it is possible to assess whether spectral decoupling can improve upon situations where the data distribution shift is, and is not explicitly addressed.Additionally, there are large differences in the sharpness values of real-world datasets from different medical centres and scanners (Figure 2).
Step-wise blurring is achieved by simple averaging with a n × n kernel, where n ∈ {2, . . ., 20}.Sharpened version of the image x sharp is created by applying kernel to the original image x original .Sharpness is then gradually increased by creating a new image x blend with where α ∈ {0, 0.1, . . ., 1} defines the amount of sharpness increase.To assess the data distribution shifts caused by differences in sample preparation, the intensity of haematoxylin and eosin stains are computationally modified.Haematoxylin highlights cell nuclei, and eosin cytoplasm, connective tissue and muscle.The stain intensities depend on multiple steps in the staining process, and thus the final colour distribution of the slide images varies a lot [8].The stain intensity modification is achieved by first separating the haematoxylin and eosin stains with the Macenko method [25].The concentrations of each stain can then be reduced by multiplication with a value between 0 and 1 before the stains are combined back together.An example of the method is shown in Figure 3.

Training details
EfficientNet-b0 network [26], with dropout [27] and stochastic depth [28] of 20% and an input size of 224 × 224, is used as a prostate cancer classifier for all experiments.For augmentation, the input images are randomly cropped and flipped, resized, and then transformed with UniformAugment [23], using a maximum of two transformations.Each network is trained for 90 epochs, with a learning rate of 0.005 batch size 512 and cosine scheduling.Weight decay of 0.0001 is used for networks trained without spectral decoupling.When training neural networks with spectral decoupling, weight decay is disabled.
For COVID-19 detection, we replicate the training regimen from [6], where a DenseNet-121 network [29] is pre-trained with the ImageNet dataset and then fine-tuned for 30 epochs as a binary COVID-19 classifier.All hyper-parameters, other than spectral decoupling, are set to values reported in the paper.
For spectral decoupling, Equation 2 is used for the first simulation experiment on dominant features (Section 3.1) and COVID-19 detection (Section 3.4).Equation 1is used for all other experiments (Sections 3.2 and 3.3).
Each experiment is repeated five times and the summary metrics for these runs are reported.All reported performance metrics are balanced between the classes when necessary and a cut-off value of 0.5 is used to obtain a binary label from the normalised predictions of the network.To compare paired receiver under the operating characteristic (ROC) curves, we use one-tailed DeLong's test and report the Z-values and p-values [30].

Experiments
In this section, the utility of using spectral decoupling as an implicit bias mitigation method is explored with both simulation and real-world experiments.

Dominant features
To assess the utility of spectral decoupling in situations where the training dataset contains a strong dominant feature, the cutout dataset defined in Section 2.3.1 is used.Five networks are trained with either spectral decoupling or weight decay on the training set.Additionally, five networks are trained on the control dataset with weight decay to provide a reference point of the performance under no spurious correlation caused by the domi-Preprint -Spectral decoupling for training transferable neural networks in medical imaging 5  2. Accuracy is defined as the fraction of all instances that were correctly identified, and recall as the fraction of positive instances that were correctly identified.
The use of spectral decoupling increases the accuracy by 8.5 percentage points over weight decay and almost reaches the performance of the network trained on the control dataset.The networks trained without spectral decoupling appear to make false predictions based on the dominant feature, although the class activation maps [34] of the trained neural networks, do not significantly differ between weight decay and spectral decoupling.As hyper-parameters were tuned on the test set, the results should be interpreted only as a demonstration that spectral decoupling can offer an important level of control over the features that are learned.
The simpler variant of spectral decoupling in Equation 1 did not increase the networks' performance in any way, and only after extensive hyper-parameter tuning, Equation 2produced the reported results.The hyper-parameter tuning was sensitive to the selected parameters, and even small changes to the final values significantly reduced the accuracy of the neural network.Similar results were also reported with the real-world example in the original paper [2].As extensive hyper-parameter tuning can deter researchers from applying the method, we limit hyperparameter tuning to a simple grid search over limited search spaces for all other experiments, as described in Section 2.1.

Robustness
To assess whether spectral decoupling increases neural networks' robustness to data distribution shifts, five networks are trained with either spectral decoupling or weight decay and evaluated on the robustness dataset described in Section 2.3.2.Additionally, five networks are trained with weight decay but without UniformAugment to assess how much the augmentation strategy improves robustness.The robustness to data distribution shifts caused by sharpening, blurring and reducing the intensity of either haematoxylin or eosin stain are presented in Figure 4.
Performance of all networks trained with weight decay and without the augmentation strategy degrade to roughly 50% accuracy.Training the networks again with UniformAugment significantly increases robustness to all data distribution shifts except with haematoxylin stain intensity reduction (Figure 4C).When the data distribution shift is included as a possible augmentation (Figure 4A), the increase in accuracy is almost 40 percentage points with the most severe distribution shift.When the data distribution shift is not included as a possible transformation (Figure 4B-D  without augmentation.This result demonstrates the importance of using augmentation as an explicit bias mitigation method. Although the use of augmentation already increased the accuracy by almost 40 percentage points, the use of spectral decoupling is able to improve the accuracy by a further 4.6 percentage points with the most severe data distribution shift (Figure 4A).The increase in accuracy is more pronounced with blurring, 12.4 percentage points with n = 19 (Figure 4B), and eosin stain intensity reduction, where networks trained with spectral decoupling achieve 1.2 to 8.5 percentage points higher accuracy with a 0.9 to 0.0 multiplier (Figure 4D).These data distribution shifts are not included as a possible transformations in UniformAugment, and thus not explicitly controlled.With haematoxylin stain intensity reduction, all networks degrade similarly in performance (Figure 4C).These results show that spectral decoupling is able to significantly complement and improve upon augmentation, as well as improve robustness to data distribution shifts that are not explicitly controlled by augmentation. Preprint

Prostate cancer detection
To assess whether the results of the simulation experiments translate into improvements in real-world datasets, we train networks with and without spectral decoupling to detect prostate cancer on haematoxylin and eosin stained whole slide images of the prostate.These networks are then evaluated on four different datasets described in Section 2.2.
The results are presented in Figure 5. Networks trained with spectral decoupling show higher performance on all evaluation datasets.The difference between weight decay and spectral decoupling gets more pronounced as we move further away from the training dataset distribution.Finally, there is a 9.5 percentage point increase in accuracy over weight decay on the dataset from a different medical centre.The reported performances are not comparable between evaluation datasets, as each dataset has been annotated with a different strategy and thus contain different amounts of label noise.
To further explore why networks trained without spectral decoupling fail to generalise to the dataset from Radboud University Medical Center (Figure 5D), the robustness to haematoxylin and eosin stain intensities are explored in Figures 6A-B.Spectral decoupling is less sensitive to both haematoxylin and eosin stain intensity reduction and interestingly, networks trained with weight decay actually increase in accuracy when reducing the eosin stain intensity.This indicates that the difference between spectral decoupling and weight decay performance in Figure 5D, may be partly due to differences in the stain intensities between the two medical centres.To explore this possibility, the stain intensities of the external dataset are normalized with the Macenko method [25] to match the training data stain intensities and the resulted performance increases are reported in Figure 6C.Both networks trained with either spectral decoupling or weight decay benefit from stain normalization.Stain normalization is especially beneficial for networks trained with weight  decay, where the mean network accuracy is increased by 7.5 percentage points.Networks trained with spectral decoupling still perform better than networks trained with weight decay coupled with stain normalization.These results demonstrate that spectral decoupling can complement or even replace normalization methods, with negligible computational requirements (Figure 6D).

COVID-19 detection
To assess whether spectral decoupling can help in real-world situations with strong dominant features and spurious correlations, we train 5 networks with and without spectral decoupling to detect COVID-19 positive patients in chest radiographs.Two different training datasets are used to train the networks and all networks are evaluated on the same external validation set, described in Section 2.2.2.We first train neural networks with the BIMCV± dataset, which represents an ideal situation where both the positive and negative samples originate from similar sources.Second, we train networks with the combined PadChest and BIMCV± dataset.This dataset represents a situation where the network can easily achieve high performance by only learning to detect where a sample originates as most of the negative samples come from a single medical centre.
After training all networks, the predictions from each network are averaged to obtain ensemble predictions for both weight decay and spectral decoupling.ROC curves for ensemble predictions are presented in Figure 7, with bootstrapped (n = 1000) 95% confidence intervals (CI) for each area under the ROC curve (AUROC) value.When training networks with the combined PadChest and BIMCV± dataset, AUROC values of networks trained with either method decrease, although the number of training samples is increased over tenfold.The decrease in AUROC is similar for both weight decay and spectral decoupling, 0.065 and 0.067, respectively.This indicates that spectral decoupling is unable to mitigate bias in the combined dataset.As most of the negative samples originate from a single medical centre, shortcut learning seems to happen even though spectral decoupling encourages the network to learn more features.Detecting where a sample originates is especially easy with radiographs due to systematic differences between data repositories and medical centres, which could be exploited by a neural network [6].Thus, the higher AUROC value of spectral decoupling is more likely due to increased robustness to data distribution shifts than avoidance of shortcut learning.

Discussion
Generalisation performance is defined as the main challenge standing in the way of true clinical adoption of a neural network [7].Van der Laak et al. [7] argue that there is a need for public datasets which are truly representative of clinical practice.Although this is indeed important, we argue that training datasets, no matter how large, will never account for all possible variations caused by differences in imaging equipment, sample preparation and patient populations.Thus, it is crucial to couple extensive multi-source datasets with explicit and implicit bias mitigation methods to train neural networks which are robust to unseen variations.
Two explicit methods of bias mitigation have been proposed for medical imaging.Augmentation of the training samples is crucial as it substantially increases robustness for distribution shifts from the training data caused by differences in imaging equipment or sample preparation (Figure 4, [8]).Thus, it is strongly recommended to use extensive augmentation strategies for training neural networks intended for clinical practice.Normalization of all images to a common standard would substantially reduce the distribution shifts [9,10,35], but comes with a considerable computational cost (Figure 6D).Both methods address important problems and should be complementary to any implicit methods of bias control.
Spectral decoupling is, to our knowledge, the first implicit bias mitigation method for addressing the generalisation issues in neural networks.The method is complementary to augmentation, increasing the robustness for distribution shifts already addressed with augmentation (Figure 4A).Above all, spectral decoupling significantly increases the robustness for distribution shifts not addressed by augmentation (Figure 4B) and could be used to replace computationally expensive stain normalisation methods (Figure 6C).
By encouraging the neural network to learn more features, spectral decoupling can also help in situations where the training dataset contains strong dominant features or spurious correlations (Table 2).This is crucial as the dominant features can also be inherent to the data, such as different cancer types.For example, with prostate cancer, different Gleason grades [36] are often unbalanced in the training set.Due to gradient starvation [11], the features of the underrepresented Gleason grades may not be learned by the neural network.Balancing the dataset, so that all Gleason grades are represented equally, is not easy or even desired as the grading is based on a continuous range of histological patterns.
In COVID-19 detection, the networks' performance decreased similarly for both weight decay and spectral decoupling (Figure 7), when training the networks on the combined BIMCV± and PadChest dataset.Radiographs contain systematic differences between data repositories and medical centres, such as laterality tokens and differences in the radiopacity of the image borders, which could arise from variations in patient position, radiographic projection or image processing [6].These differences can be easily leveraged by neural networks to detect where a single radiograph originates.We speculate that spectral decoupling was unable to prevent shortcut-learning due to the ease of shortcut learning in the combined PadChest and BIMCV± dataset.In addition, our results showing the ability to prevent shortcut learning (Table 2) were obtained after considerable hyper-parameter optimization and no significant differences could be seen in the class activation maps between networks trained with either weight decay or spectral decoupling.Thus, removal of any obvious superficial correlations from the training dataset is crucial as there seems to be a limit of how much spectral decoupling can help with dominating features and spurious correlations.
The advantages of spectral decoupling can be clearly seen when the network is evaluated with out-of-distribution samples (Figures 4, 5) and 7).Neural networks trained with spectral decoupling retain their performance with samples further from the training data distribution, which is exactly what is required from neural networks intended for clinical practice [7].Although using an external dataset may not reveal all generalization problems, it is clear that without spectral decoupling the neural networks fail to generalize to this particular external dataset Preprint -Spectral decoupling for training transferable neural networks in medical imaging 8 from Radboud University Medical Center (Figures 5D and 6).Even in COVID-19 detection, where spectral decoupling seems to fail in preventing shortcut learning, the performance of the network is significantly increased over the state-of-the-art.

Conclusions
Spectral decoupling is the first implicit bias mitigation method for training neural networks to be used across multiple medical centres.The method adds no computational costs, is easy-toimplement, and complements and improves upon explicit bias mitigation methods.Our results recommend the use of spectral decoupling in all neural networks intended for clinical use.

Figure 2 :
Figure 2: Kernel density estimation of the variance of the images after a Laplace transformation.A higher variance indicates a sharper image.The image is generated from the pre-processing metrics calculated by HistoPrep [15].

Figure 3 :
Figure 3: Separation of the heamatoxylin and eosin stains with the Macenko method.

Figure 4 :
Figure 4: Robustness for data distribution shifts from the training data.The lines show the mean accuracy and the shaded regions represent one standard deviation around the mean.

Figure 5 :
Figure 5: Neural network performance on evaluation datasets.Each consecutive evaluation dataset moves further from the training data distribution.Networks trained with spectral decoupling improve accuracy by 0.35 (A), 1.0 (B), 3.6 (C) and 9.5 (D) percentage points over weight decay.All models are trained with UniformAugment.

Figure 6 :
Figure 6: Spectral decoupling can complement or even replace computationally heavy stain normalization methods.Robustness to data distribution shifts, on the external dataset, caused by haematoxylin (A) or eosin (B) stain intensity reduction.(C) Network accuracy increases when normalizing haematoxylin and eosin stain intensities with the Macenko method.(D) Comparison of the computational requirements between spectral decoupling and the Macenko method.Images per seconds estimation for spectral decoupling is calculated with a Equation 1, where ŷ is a 512 × 1 matrix and Macenko stain normalisation is performed on resized images of size 224 × 224.

Table 2 :
Results of the simulation study with the cutout dataset on dominant features.The mean and standard deviation (SD) values are reported for each set of five trained networks.
-Spectral decoupling for training transferable neural networks in medical imaging