Fusion of Deep Convolutional Neural Networks for No-Reference Magnetic Resonance Image Quality Assessment

The quality of magnetic resonance images may influence the diagnosis and subsequent treatment. Therefore, in this paper, a novel no-reference (NR) magnetic resonance image quality assessment (MRIQA) method is proposed. In the approach, deep convolutional neural network architectures are fused and jointly trained to better capture the characteristics of MR images. Then, to improve the quality prediction performance, the support vector machine regression (SVR) technique is employed on the features generated by fused networks. In the paper, several promising network architectures are introduced, investigated, and experimentally compared with state-of-the-art NR-IQA methods on two representative MRIQA benchmark datasets. One of the datasets is introduced in this work. As the experimental validation reveals, the proposed fusion of networks outperforms related approaches in terms of correlation with subjective opinions of a large number of experienced radiologists.


Introduction
Image quality assessment (IQA) of magnetic resonance images (MR) plays a vital part in the diagnosis and successful treatment [1][2][3]. The IQA methods aim to provide automatic, repeatable, and accurate evaluation of images that would replace tests with human subjects. Such tests are often time-consuming, difficult to organize, and their output may depend on the considered group of participants. Therefore, the progress in the development of IQA techniques depends on the availability of assessed image databases. This is particularly important for MRIQA methods that require MR image databases with opinions of a representative number of radiologists, i.e., the databases are used for their comparison and stimulate the emergence of new approaches in the field. The IQA approaches are divided into three groups, depending on whether distortion-free images are used: full-reference (FR), reduced-reference (RR), and no-reference (NR). The availability of unaltered, distortion-free, reference images is a basis for their differentiation. Nevertheless, such pristine images, or their partial characteristics, are unavailable for MR images, limiting the practical application of FR and RR methods. Therefore, the NR MRIQA measures are highly desired, while FR and RR approaches are mostly employed for artificially distorted, high-quality MR images.
There are several approaches to the FR medical IQA [4]. Among them, the Peak Signal-to-Noise Ratio (PSNR) is the most popular. However, it might not be accurate enough to give a proper measure between distorted and reference images, concerning

Related Work
The introduced method belongs to the category of NR techniques that use a deep learning approach to predict the quality of assessed images. However, before such approaches were possible for the IQA of natural images, many handcrafted IQA measures were proposed. For example, the method of Moorthy and Bovik [13] employs a twostage framework in which distortion type is predicted and used for the quality evaluation. A framework in which a probabilistic model with DTC-based natural scene statistics (NSS) is trained was proposed by Saad et al. [14]. Then, the popular BRISQUE technique [15] was introduced, which uses the training of the Generalized Gaussian Distribution (GGD) with Mean Substracted Contrast Normalization (MSCN) coefficients. The Gabor features and the soft-assignment coding with the max-pooling are employed in the work of Ye et al. [16]. In the High Order Statistics Aggregation (HOSA) [17] method, low and high order statistics for the description of normalized image patches obtained from codebook using the soft assignment is presented. In the method, the codebook was obtained with the k-means approach. As gradient-based features can effectively describe distorted images, many approaches use them for quality prediction. They employ global distributions of gradient magnitude maps [18], relative gradient orientations or magnitude [19], and local gradient orientations captured by Histogram of Oriented Gradient (HOG) technique for variously defined neighborhoods [20]. A histogram of local binary patterns (LBP) characterizing a gradient map of an image is used in the GWHGLBP approach [21]. In the NOREQI [22] measure, an image is filtered with gradient operators and described using speeded-up robust feature (SURF) descriptor. Then, the descriptors are characterized by typical statistics. The joint statistics of the gradient magnitude map and the Laplacian of Gaussian (LOG) response are used to characterize images in the GM-LOG technique [18].
Most learning-based NR-IQA techniques devoted to natural images employ the SVR method to create a quality model. However, some methods do not require training. For ex-ample, in the Integrated Local Natural Image Quality Evaluator (IL-NIQE) [23], natural image statistics derived from multiple cues are modeled by the multivariate Gaussian model and used for the quality prediction without additional training step. In the BPRI [24], a pseudo-reference image is created and compared with the assessed image using quality metrics that measure blockiness, sharpness, and noisiness.
In recent years, more complex IQA approaches have been introduced that use deep neural network architectures. They do not contain a separate image description and prediction stages. However, their training requires a large amount of data or an adaptation of architectures developed for computer vision tasks not related to the IQA. Some of the early models address those challenges by using image patches [25,26], training with scores generated by FR-measures or [26,27], or fine-tuning of popular networks [28].
Considering the quality assessment of MR images, the number of approaches is much less diverse. Here, only several works have been published, revealing the lack of successful techniques and the scarcity of the IQA MRI benchmarks that could be used to stimulate their development. Furthermore, most clinical applications use the SNR and CNR [1] measures to assess images or calibrate scanners. However, they require an indication of disjoint image regions with noise and tissue, despite providing an inferior quality evaluation of images in comparison with modern methods. Some of NR IQA measures designed for the assessment of MR images adapt solutions devoted to natural images. For example, Chow and Rajagopal [5] trained the BRISQUE on MR images, Yu et al. [2] used SNR scores to train BRISQUE and three other IQA methods, while Jang et al. [6] used MSCN multidirectional-filtered coefficients. In the work of Esteban et al. [9], image quality was not predicted but binarily classified based on a set of simple measures. Taking into account the inclusion of neural network architectures for processing MR images, Kustner et al. [11] and Sujit et al. [12] detected motion artifacts and performed binary classification of structural brain images, respectively. Volumetric and artifact-specific features were used by Pizarro et al. [10] to train the SVM classifier. In previous authors' works on the MRIQA, the entropy of local intensity extrema was used for direct quality prediction [7] or high-boost filtering followed by the detection and description of local features [8] was used with an SVR-based quality model.
Taking into account the lack of deep learning architectures for the MRIQA, it can be stated that their effectiveness remains largely uninvestigated, and the introduction of their effective fusion can be seen as a promising area of research.

Proposed Method
In the proposed approach, a fusion of deep network architectures is considered. Such architectures are mostly devoted to image recognition tasks and were propagated to other areas of computer vision [29]. Among popular deep learning networks, the approach of Simonyan and Zisserman [30] (VGG) uses 3 × 3 convolutional filters and achieves outstanding performance at the ImageNet Large Scale Visual Recognition Competition (ILSVRC) 2014 competition. Another solution, Resnet [31], introduces a residual learning framework, with a shortcut connection between layers to address the overfitting experienced by the VGG. With the same purpose, Szegedy et al. [32,33] introduced the Inception module in the GoogLeNet model. In other works, Howard et al. [34] created Mobilenet aiming to reduce the computational costs, or Huang et al. [35], in the DenseNet, used network layers with inputs from all preceding layers.
Deep learning models were also used for the IQA of natural images [26,[36][37][38][39]. Most of such adaptations employ transfer learning, making the networks aware of domainspecific characteristics. Therefore, in this study, a similar approach was applied at the beginning of the research. Thus, adapted single models can be seen as counterparts of the first approaches with deep learning models to the IQA of natural images. However, the performance of a single network turned out to be insufficient to provide superior performance in the MRIQA task (see Section 4.6). Therefore, the approach introduced in this paper considers an internal fusion of networks, assuming that the fusion of different network types can capture characteristics of MR images, leading to outstanding IQA accuracy across the benchmark datasets.
Considering an image I n that belongs to a set of N training images, n = 1, . . . , N, each of which is associated with subjective score q s n obtained in tests with human subjects. Note that the subjective scores are denoted as Mean Opinion Scores (MOS) or Difference MOS (DMOS). The input image I = R h×w×c , where h, w, and c corresponds to the height, width, and the number of channels, respectively, is processed by the network. The network often consists of stacked modules of convolutional and pooling layers, and several fully connected layers. The convolutional layers extract features from earlier layers. This can be written as O k = f(W k I), where W k denotes a k-th filter or weights, is convolutional operator, and f is the nonlinear activation function, often represented by rectified linear units (ReLUs). The pooling layers reduce feature maps, introducing average or maximum values of inputs to the next layers. For example, the max pooling O k (a, b) = max(I k (i, j)), where the (i, j) ∈ P(a, b) denotes location of the element in the pooled region P(a, b). The fully connected layers are used to interpret features and provide high-level problem representations. Their outputs are further processed by the softmax or regression layer for the classification or regression problems, respectively.

Network Fusion
As the number of images in MRIQA benchmarks is not large enough for efficient training of considered network architectures or their fusions, in this study, transfer learning [40] is applied instead of learning from scratch. The considered networks are pretrained on ImageNet dataset and classify images into 1000 categories. However, the IQA task requires solving the regression problem, which forces the modification of the architecture of the network towards the quality prediction purpose. Therefore, in this study, the last three network layers of each used network, configured for 1000 classes, are replaced with a fully connected layer and the regression layer. Note that the replacement is performed regarding all network architectures, either single or fused. Here, the networks share the inputs and are connected to each other with a feature concatenation layer that adds outputs from multiple layers in an element-wise manner. If needed, the input image is resized to match the input size of a network. As layers responsible for average pooling remain in each network architecture after the removal of the part associated with image classification, they are used as inputs to the concatenation (addition) layer. An exemplary fusion of ResNet-18, ResNet-50, and GoogLeNet is presented in Figure 1. In the network graph, each network is represented by connected sets of convolution layers (yellow blocks). Among the last elements in each network are the global average pooling layer, addition layer that fuses their outputs, fully connected layer, and regression layer.
For the training of the resulted network architecture, N images are used. However, as MR images are often 2-dimensional 16-bit matrices, they are concatenated to form three channels (c = 3) to facilitate processing by the pretrained networks. To estimate network parameters, the half Mean Squared Error (MSE) loss function L is applied.
is the vector of objective scores and Q s = (q s 1 , . . . , q s N ) represents subjective scores, L is calculated as Typically, transfer learning of the network assumes freezing the original layers to prevent back-propagation of errors. However, as in this study MR images are processed and have different characteristics from natural images, the weights of fused networks were modified using a small learning rate. Once the training of the network is finished, a second step of the approach is executed in which concatenated feature vectors (see the addition layer in Figure 1) are used as an input x to the SVR module to obtain a quality prediction model, x = x net 1 ⊕ x net 2 ⊕ · · · ⊕ x net M , where ⊕ is the concatenation operator and M is the number of fused deep learning architectures.
The SVR technique is commonly used to map perceptual features to MOS. In this paper, the ε-SVR is employed for training the regression model. Given training data , where x n is the feature vector and q s n is its MOS, a function f(x) = ω,x + b is determined in which ·,· denotes the inner product, ω is the weight vector, and b is a bias parameter. Introducing the slack variables ξ n and ξ * n , ω and b can be computed by solving the following optimization problem, where C is a the constant parameter to balance ω and the slack variables. The ω = ∑ N n=1 t n x n , where t n is a combination coefficient. Usually, in the first step, the input feature vector is mapped into a high-dimensional feature space Φ(x), and then the regression function is obtained: The inner product Φ(x n ), Φ(x) can be written as a kernel function c(x n , x). Therefore, The radial base function (RBF) is often used as c, c(x n , x) = exp(−γ(|x n − x|) 2 ), where the γ is the precision parameter [18].
To further justify the need for network fusion proposed in this paper and show its sensitivity to distortions, a visualization of exemplary features processed by single and fused networks using DeepDream (https://www.tensorflow.org/tutorials/generative/deepdream (accessed on 23 December 2020)) technique is provided in Figure 3. The technique is often used to show what a given network has learned at a given stage. In the experiment, the best and worst quality images of the same body part were used. As presented, the features in fused networks distinctively respond to distortions that are propagated through the architecture. Furthermore, their features seem affected by the existence of another network in the training which modifies their response comparing to single architectures. Interestingly, the features of GoogLeNet or ResNet in the GoogLeNet+ResNet-18 fusion are more similar to each other than to features from single network architectures. This can be attributed to joint transfer learning. The fusion exhibits a different response to different distortion severity. Consequently, it can be assumed that the quality prediction model based on the fusion would be able to reliably evaluate MR images of different quality.

Experimental Data
The proposed approach is evaluated on two MRIQA benchmark datasets. The first dataset, denoted for convenience as DB1, contains 70 MR images [8], while the second one (DB2) has been created for the needs of this study and contains 240 MR images.
The DB1 benchmark contains images selected from 1.5T MR T2-weighted sagittal sequences: the spine (14 images), knee (14), shoulder (16), brain (8), wrist (6), hip (4), pelvis (4), elbow (2), and ankle (2). The collection consists of images captured under different conditions affecting the image quality (IPAT software to made GeneRalized Autocalibrating Partially Parallel Acquisitions (GRAPPA); GRAPPA3 [41]). Apart from the images, the benchmark contains the MOS, ranging from 1 to 5, which was obtained in tests with a group of radiologists [8]. The resolution of images in the dataset is between 192 × 320 and 512 × 512. Exemplary images that can be found in the DB1 are shown in Figure 4.
As the DB1 collection is relatively small and databases that can be found in the literature were created for different purposes, are not available, or were assessed by a small number of radiologists, a novel dataset has been introduced in this study. The DB2 collection contains T2 weighted MR images acquired during routine diagnostic exams of the shoulders, knees, and cervical and lumbar spine. Patients aged 29-51 yo participated in the study. Siemens Essensa 1.5 Tesla scanner equipped with table coils, 6-channelknee and 4-channel-shoulder coils, were used to obtain two-dimensional images in axial, coronal, and sagittal planes. The gradient strength and a slew rate of the scanner were set to 30 mT/m, and 100 T/m/s, respectively. The following parameters were also used: the echo time TE ranged from 3060 up to 6040 ms, repetition time TR (77 to 101 ms), phase oversampling of 20, distance factor of 30, and flip angle of 150°. The dataset was made on matrices from 192 × 320 to 512 × 512, using a voxel of nonisotropic resolution at 0.8 mm × 0.8 mm × 3 mm. To obtain images of different quality in a controlled way, the parallel imaging technique was applied (Siemens IPAT software) [41], reducing the number of echoes. The parallel imaging shortens the acquisition time [42] as it is commonly employed to increase patient comfort. However, in this study, it was applied to obtain degraded images during the routine imaging process. The T2 sequences were obtained using the GRAPPA approach, repeated in four modes with gradually increased severity of echo reduction. Here, GRAPPA1, GRAPPA2, GRAPPA3, and GRAPPA4 were consecutively applied, resulting in up to a 4-minute increase of total patient examination time. Finally, 30%, 40%, 60%, and 80% of the signals were lost with GRAPPA 1-4, respectively [43,44]. Obtained images were anonymized and saved ensuring the highest standards of personal data safety. Then, the DB2 was created with images of different patients. The subset of 30 exams was further investigated: knee (images of eight patients), shoulder (10 patients), cervical spine (three patients), and lumbar spine (nine patients). Then, from sequences of a better and worse quality associated to a given patient, two images per sequence were automatically selected. The selected scans were located at 1/3 and 2/3 of the length of each sequence. Once the 240 images were selected for subjective tests, a large group of 24 radiologists with more than 6 years of experience in MR study reading was invited for the assessment of their quality. The group was gathered in a room with limited access to daylight, reflecting typical conditions of such examination. Images of the different quality were displayed in pairs and presented to radiologists for 30 s (each pair) on Eizo Radi-Force high-resolution monitors (2600 × 1800) connected to dedicated graphic cards. Each radiologist was introduced to the assessment procedure and assigned scores from 1 to 5 to each evaluated image on paper evaluation cards. In the assessment, a higher grade reflects a better quality of an image. Then, subjective scores were processed and averaged to obtain MOS. Exemplary images from the DB2 are presented in Figure 5.

Evaluation Methodology
Image quality assessment techniques are evaluated and compared on benchmark datasets using four performance criteria [45]: Spearman rank-order correlation coefficient (SRCC), Kendall rank-order correlation coefficient (KRCC), Pearson linear correlation coefficient (PLCC), and Root Mean Square Error (RMSE). The higher SRCC, KRCC, and PLCC, and lower RMSE, the better output of objective IQA approach. The calculation of the PLCC (Equation (5)) and RMSE (Equation (6) whereQ o andQ s are mean-removed vectors. The RMSE is calculated as where m is the total number of images. The SRCC is defined as where d i is the difference between i-th image in Q o and Q s , i = 1, 2, . . . , m. Consequently, the KRCC calculated as where m c is the number of concordant pairs in the dataset, and m d denotes the number of discordant pairs. As the proposed approach should be trained to obtain a quality model for the prediction, a widely accepted protocol for the evaluation of related methods is used in which 80% of randomly selected images of the dataset are selected for the training and the remaining 20% of images test the approach [16,24]. Both image subsets are disjoint based on the experiment that leads to the acquisition of images of a given body part. Then, to avoid bias, the performance of an NR method is reported in terms of the median values of SRCC, KRCC, PCC, and RMSE over 10 training-testing iterations [27,46].
For a fair comparison, all methods were run in Matlab with their default parameters, while the SVR parameters [52] were determined aiming at their best quality prediction performance. NR techniques that process color images were assessing MR images concatenated to form three channels. For the training of the proposed fusion architectures, stochastic gradient descent with momentum (SGDM) was used with a learning rate of 10 −4 , mini-batch size of 32, and 5 epochs. Furthermore, as the number of images in the first dataset is relatively low (70), data augmentation was employed in which each distorted image was rotated up to 360 • with the step of 3 • . The approaches were run in Matlab R2020b, Windows 10, on PC with i7-7700k CPU, 32GB RAM, and GTX 1080 Ti graphic card.
The results for both databases are presented in Table 1. Their analysis indicates that the proposed network fusion techniques outperform the state-of-the-art approaches. Specifically, the R50GR18 (composed of three network architectures) and MR50 (fusing two networks) obtained the best SRCC and KRCC performances for the first database, outperforming the recently introduced NOMRIQA. This is also confirmed by the results for the remaining criteria, i.e., PLCC and RMSE. It is worth noticing that NR MRIQA methods that do not require training, ENMIQA and SNRTOI, produce poorly correlated quality opinions while compared with outputs of learning-based approaches as they are not equipped with various perceptual features and powerful machine learning algorithms that efficiently map them with subjective scores. For the database introduced in this paper, DB2, the R18GR50M, and R50GR18 present superior performance, followed by the SEER and MR50. Here, more techniques obtain better results in comparison to those obtained for the DB1 due to the larger representation of distorted images in the dataset. The methods designed for MR images-ENMIQA and NOMRIQA-are outperformed by well-established learning methods devoted to natural images (SEER, HOSA, GM-LOG, or NFERM). Deep learning models-MEON and DEEPIQ-were pretrained by their authors and do not capture characteristics of MR images, leading to inferior prediction accuracy. The overall results, averaging criteria over both databases, indicate the superiority of introduced fusion approaches. In this case, the NOMRIQA is fourth in terms of the SRCC values, followed by NFERM and GWHGLBP after a large performance gap.
To compare relative differences between methods and determine whether they are statistically significant, the Wilcoxon rank-sum test is used. The test measures the equivalence of the median values of independent samples with a 5% significance level [20]. Here, the SRCC values are taken into consideration. In the experiment, the method with a significantly greater SRCC median obtained the score of "1". Consequently, the worse and indistinguishable method obtained "−1" and "0", respectively. Finally, scores were added and displayed in cells in Figure 6 to characterize methods in rows. The figure also contains sums of scores to indicate globally best approaches. As reported, all three introduced fusion architectures offer promising and stable performance across both datasets, outperforming the remaining approaches. Other learning-based methods designed for the MRIQA, i.e., ENMIQA or NOMRIQA, exhibit inferior relative performance in comparison to the proposed models and methods with rich image representations (GM-LOG, SEER, or NFERM).

Computational Complexity
The computational complexity of methods, reflected by the average time taken to assess an image from DB2, was also investigated ( Table 2). The time spent on extracting and predicting the quality by a method based on the fusion of networks depends on the number of networks. However, the proposed models are of moderate complexity, being on par with the fastest and more reliable approaches. Further reduction of the computation time can be achieved by a parallel feature extraction process or providing a native implementation (e.g., C++).

Cross-Database Experiments
The performances of R18GR50M, R50GR18, and MR50 are compared with those of related IQA methods in the cross-database experiment. In the experiment, learning-based methods are trained on one database and tested on another. The methods that do not require training are only tested on the second database. The obtained results are shown in Table 3. As reported, the introduced fusion models outperform other techniques and exhibit stable prediction accuracy. Here, the method for IQA of MR images, NOMRIQA, is close to the proposed architectures. The values of performance indices of fusion models trained on the small DB1 and tested on much larger DB2 are only several percent lower than values reported for the DB2 in the first experiment (see Table 1). This confirms their capability of successful extraction of MR image characteristics needed for the quality prediction. Overall, all three introduced architectures provide superior performance, followed by NOMRIQA and SEER.

Ablation Tests
As in the literature many different network architectures have been introduced, in this section, several proposing fusion approaches are reported and discussed. Fur-thermore, the inclusion of the SVR method is also supported experimentally to show that it improves the results of the networks. In experiments, single deep learning networks or their fusions were considered. The following fusions took part in the study: ResNet-50_GoogLeNet_MobileNet-V2 ( The results presented in Figure 7 reveal that the usage of the SVR module improves the results of the half networks for the DB1 and in all cases for the DB2. Interestingly, most network architectures outperform other state-of-the-art IQA methods on both databases (see Table 1), showing that they can be successfully used for the quality prediction of MR images. However, network architectures that are based on the proposed fusion of single models offer better performance than it can be seen for single networks. Among single architectures, ResNet-50, MobileNet-V2, and DenseNet-201 yield promising results. Therefore, two of them-ResNet-50 and MobileNet-V2-were fused together with ResNet-18 and GoogLeNet obtaining the best performing fusion architectures: (R18GR50M, R18GR50M, and MR50). Here, the fusion with the worst-performing GoogLeNet seems beneficial as its features turned out to be complementary with those of other networks.

Conclusions
In this study, a novel no-reference image quality assessment approach for automatic quality prediction of MR images has been presented. In the approach, deep learning architectures are fused, suited to the regression problem, and, after joint transfer learning, their concatenated feature maps are used for quality prediction with the SVR technique. The usage of two or more network architectures, the way they are fused, and their application to the no-reference IQA of MR images are among contributions of this work. Furthermore, several promising fusion models are proposed and investigated as well as a novel dataset for the development and evaluation of NR methods. The dataset contains 240 distorted images assessed by a large number of experienced radiologists. The comprehensive experimental evaluation of fusion models against 17 state-of-the-art NR techniques, including methods designed for NR IQA of MR images, reveals the superiority of the presented approach in terms of typical performance criteria.
Future work will focus on the investigation of alternative network fusion approaches or developing NR measures for IQA of medical images with different specificity, e.g., CT or RTG.