Limited View Tomographic Reconstruction Using a Cascaded Residual Dense Spatial-Channel Attention Network With Projection Data Fidelity Layer

Limited view tomographic reconstruction aims to reconstruct a tomographic image from a limited number of projection views arising from sparse view or limited angle acquisitions that reduce radiation dose or shorten scanning time. However, such a reconstruction suffers from severe artifacts due to the incompleteness of sinogram. To derive quality reconstruction, previous methods use UNet-like neural architecturesto directly predict the full view reconstruction from limited view data; but these methods leave the deep network architecture issue largely intact and cannot guarantee the consistency between the sinogram of the reconstructedimage and the acquiredsinogram, leading to a non-ideal reconstruction. In this work, we propose a cascaded residual dense spatial-channel attention network consisting of residual dense spatial-channel attention networks and projection data fidelity layers. We evaluate our methods on two datasets. Our experimental results on AAPM Low Dose CT Grand Challenge datasets demonstrate that our algorithm achieves a consistent and substantial improvement over the existing neural network methods on both limited angle reconstruction and sparse view reconstruction. In addition, our experimental results on Deep Lesion datasets demonstrate that our method is able to generate high-quality reconstruction for 8 major lesion types.


I. Introduction
Tomography imaging is a non-invasive projection-based imaging technique that visualizes an object's internal structures and hence finds wide applications in healthcare, security, and industrial settings [1]- [3]. In healthcare, tomography imaging techniques such as medical Computed Tomography (CT) based on x-ray projections, Positron Emission Tomography (PET), and Single-photon Emission Computed Tomography (SPECT) based on gamma-ray projections are indispensable imaging modalities for disease diagnosis and treatment planning. In the traditional CT setting, one assumes access to the measurements that are collected from a full range of view angles of an object. To reduce radiation dose and speed up acquisition, recently it is of increasing interest to develop methods that can recover images when a portion of the projection views is missing, namely limited view tomographic reconstruction. There are two notable sub-problems: limited angle (LA) reconstruction, i.e., when α ∈ [0, α max ] with α max < 180° for equivalent parallel beam geometry, and sparse view (SV) reconstruction with a view interval larger than normal. Both LA and SV acquisitions can efficiently reduce radiation dose. Using LA acquisition, the scan time can also be drastically reduced by restricting the physical movement of the scan arc. Note that fast acquisition or high temporal resolution is paramount; even a slightly longer scan time can lead to appreciable motion blur and artifact in the image [4], [5].
There are two major factors, namely reconstruction quality and speed that need to be properly considered in designing a tomographic reconstruction algorithm. Currently, Filtered Back Projection (FBP) is widely used as the standard algorithm as it can reconstruct a highquality image with a fast speed, following an analytical solution. However, FBP assumes the access to the measurements that are collected from a full range of views of an object. Reconstruction using FBP in both LA and SV conditions are highly ill-posed, yielding nonideal image quality with severe artifacts and high noise. Previous algorithms for tomographic reconstruction under limited view conditions can be classified into two general categories: model-based iterative reconstruction (MBIR) and deep learning based reconstruction (DLR). MBIR can generate images with high quality by minimizing the predefined image domain regularizers and the sampled sinogram inconsistency in an iterative fashion. Common choices of the regularizer include total variation [6], dictionary learning [7], and nonlocal patches [8]. However, MBIR methods are computationally heavy and time-consuming since they rely on repetitive forward-and back-projections. Moreover, using regularization solely based on prior assumptions requires careful hyper-parameter tuning and tends to bias the reconstruction results, especially when under-sampling rate is high.
performance [9]. Combining MBIR with deep learning, Gupta et al. [10] and We et al. [11] first proposed to model regularizer in MBIR frameworks with CNNs and Autoencoders. Adler et el. [12] unfolded the optimization procedure of MBIR to an N-stage network to balance the tradeoff between reconstruction and speed. Although improved over traditional MBIR methods, they still suffer from high computational cost with iterative procedures. As an alternative, DLR is often formulated as image post-processing. Jin et al. [13] and Chen et al. [14] proposed to use UNet [15] and Residual UNet to post-process the noise/artifacts in the sparse-view CT. In [16] and [17], adversarial loss and perceptual loss were used to reinforce the network's learning. Later, Zhang et al. [18] and Han et al. [19] proposed to incorporate dense block and wavelet decomposition into UNet for more robust feature learning for reconstruction. Direct sinogram inversion and sinogram completion strategies were also proposed. Lee et al. [20] found that synthesizing complete sinogram from sparse view sinogram and then using FBP can also reconstruct high-quality image. Although these methods can be easily applied to raw sinograms or corresponding FBP reconstructed images with relatively low computational cost and low design complexities, they either only applied on image domain that remove artifacts in already reconstructed image or synthesizing complete sinogram from sparse one, and cannot guarantee the sampled sinogram data are preserved. Note that the sampled sinogram data are the original sources that should be kept as identical as possible before and after reconstruction to ensure the high fidelity of reconstructed content. There are also recent ideas of replacing the already-sampled sinogram to the predicted sinogram during the test stage. Anurudh et al. [1] proposed to first use a sonogram-to-image auto encoder to predict an initial reconstruction. Then, during the test stage, the reconstruction's sinogram is partly replaced by the already-sampled sinogram to generate a final reconstruction. However, their method does not guarantee the continuity between the already-sampled sinogram and the predicted sinogram, which may further degrade the final reconstruction, and their method is limited to parallel-beam geometry. Similarly, Huang et al. [21] proposed to first use UNet [15] to predict an initial reconstruction. Then, during the test stage, the initial reconstruction is utilized in a TV reconstruction to help the projection data fidelity constraint of unmeasured projection data. However, the final reconstruction quality relies on a high-quality initial reconstruction from UNet's prediction. In addition, the projection data fidelity constraint of unmeasured projection data is not incorporated in the network design and used only in the separated test stage. On a different note, the network design issue is highly under-explored as a research topic and still limited to UNet-based or auto-encoder architectures [13], [14], [16], [17], [19], [20], [22]. In addition, none of previous works have evaluated the performance under both LA and SV scenarios, and reconstruction evaluation on CT scan with pathological finding are barely performed. While a k-space data consistency layer for MRI fast reconstruction is proposed in [23], [24], projection data consistency layer has not been systematically studied in tomographic reconstruction.
To tackle these limitations, we propose a Cascaded Residual Dense Spatial-Channel Attention Network (CasRedSCAN) for tomographic reconstruction under limited view conditions. Our CasRedSCAN consisting of Residual Dense Spatial-Channel Attention Network (RedSCAN) and Projection Data Fidelity Layer (PDFL) closely resembles the iterative process in MBIR methods, which allows end-to-end optimization of the reconstruction. Specifically, RedSCAN is the backbone network that is used in each cascade block for de-aliasing the input image. PDFL is concatenated to the RedSCAN output to ensure the prediction's projection data fidelity while allowing gradient back-propagation. Experiments on limited angle and spare view scans using AAPM Low Dose CT Grand Challenge [25] and DeepLesion dataset [26] demonstrate that our CasRedSCAN can provide high-quality limited view tomographic reconstructions.

II. Problem Formulation
Let I ∈ ℂ N represent a 2D tomography image with a size of N = N x N y , and Q ∈ ℂ M represent its full-view sinogram with M projection views. Our problem is to reconstruct I from Q u ∈ ℂ M u M u ≪ M , where Q u is the undersampled sinogram of limited views. Here, sinogram data is only measured for lines corresponding to a subset Ω ⊂ A ≜ 1, ⋯, M , where A is the full projection set. Denoting G and G u as the full-view and limited-view discretized forward projection operators, the full-view sinogram Q and limited-view sinogram Q u are obtained via Q = Gℐ and Q u = G u I, respectively. While FBP provides stable numerical implementation of pseudo-inverse for Q, applying FBP to Q u in the limited view conditions yields reconstructed I u with severe artifacts.
Previous works of MBIR propose to solve I by where T is the regularizer and ⋅ n n is the projection data fidelity constraint [6], [27].
Previous deep learning-based, post-processing methods utilize deep networks, denoted as P with parameters θ, to estimate the full-view reconstructed image P I u ; θ by training P on (I u , I gt ) pairs, where I gt is the full-view reconstruction ground truth. However, these methods only consider a subsequent regularization of the initial solution I u similar to the functionality of T( ⋅ ) in MBIR, and omit the projection data fidelity constraint of G u I − Q u n n . One should force reconstruction I to be well-approximated by the CNN reconstruction and ensure the consistency of acquired data in the projection domain by: However, it is not feasible to directly optimize the above equation since the deep network reconstruction and the projection data fidelity terms are independent. Specifically, as deep network P only operates in the image domain, P is trained to reconstruct the full-view image without prior knowledge of the already acquired data in the projection domain. Similar to the MRI k-space data fidelity [23], given a portion of already acquired projection data from limited-view acquisitions, the deep network should be discouraged from changing the already acquired projection data up to the level of acquisition noise. Incorporating the projection data fidelity in the network design could potentially better preserve the image content and lead to a better reconstruction. In this work, we propose a projection data fidelity layer (PDFL) embedded in a cascade network for full-view reconstruction. With PDFL in our cascade network, the reconstruction output from our network is now conditioned on both network parameter θ and limited-view projection data Ω: Then, given the training data pairs of (I u , I gt ), we can train our network by minimizing the L2 loss function: Details of our PDFL and cascade network are explained in Section III and Section IV, respectively.

III. Projection Data Fidelity Layer
Let G and G fbp be forward projection (FP) layer and filtered back-projection (FBP) layer, respectively. The projection data of the image reconstruction by a deep network can be formulated as: S cnn = GI cnn = GP I u ; θ , where S cnn (i) is the i-th projection data entry.
Similarly, we denote the already acquired projection data as S u , where S u has identical size to S cnn and the i-th projection data entry S u (i) is all zeros when i ∉ Ω. Then, we can write a closed-form solution for the second term in Eq.(2) as: where S rec is the reconstructed sinogram, which is updated by the projection data fidelity.
Then, the image can be reconstructed via filtered back projection, that is, I rec = G fbp S rec . To elaborate, when the i-th projection data is not acquired, we directly estimates the i-th projection data from the projection data of the deep network's output. Otherwise, the i-th projection data is a linear combination of the acquired projection data and projection data of the deep network's output, regularized by noise level parameter λ. Assuming noiseless sinogram acquisition, i.e. λ = 0, we simply replaces the i-th predicted projection data by the acquired projection data.

A. Forward Projection Layer
Our FP layer G is a differentiable layer implemented with fan-beam geometry, allowing gradient back-propagation while projecting the image into sinogram. In this work, we consider fan-beam geometry with arc detector [28]. Assuming the distance between x-ray source and the gantry rotation center as D, the forward pass of the FP layer can be written as: − ycos(β − γ)]dxdy (6) where a fan-beam sinogram S f an (γ, β) is generated. β means the detector rotation angle, and γ means the angle between central projection line and detector projection line. In the backward path of G, the loss in the sinogram domain should be aggregated and backprojected to the image domain. Thus, we define the derivative of G with respect to the input image I as the filtered back-projection operation G fbp (discussed in Section III-B).

B. Filtered Back-Projection Layer
Our FBP layer G fbp is also a differentiable layer implemented with fan-beam geometry, allowing gradient back-propagation while reconstructing the image from sinogram. Similar to above, assuming the distance between x-ray source and the gantry rotation center as D, we have a fan-beam sinogram S f an (γ, β), where β is the detector rotation angle and γ is the angle between central projection line and detector projection line. Our FBP layer consists of three modules: i) parallel-beam conversion module, ii) filtering module, and iii) backprojection module.
Parallel-beam conversion module converts the fan-beam sinogram S f an (γ, β) to parallelbeam sinogram S para (ρ, α) via: where the change of variable is implemented by grid sampling 1 in (ρ, α), which allows gradient back-propagation.
Filtering module applies the filtering to the converted sinogram S para in the Fourier domain: where T ρ and T ρ −1 are the discrete Fourier transform and inverse discrete Fourier transform along the detector dimension ρ, respectively. 2 ω is the window function and we used Ram-Lak in this work.
Back-projection module back-projects the filtered parallel-beam sinogram S to the image domain for every projection angle α via: I(x, y) = ∫ 0 2π S(x cos α + y sin α, α)dα ≈ Δα ∑ i S x cos α i + y sin α i , α i (9) where we parallelize the back-projection operation, 3 such that the reconstruction can be efficiently computed. In the backward path of G fbp , the loss in the image domain should be aggregated and projected to the sinogram domain. Thus, we define the derivative of G fbp with respect to the input sinogram S f an as the forward projection operation G (discussed in Section III-A).
Here, we use pixel-driven algorithm for our implementation of forward projection and backprojection [29].

C. Forward and Backward Pass
Our Projection Data Fidelity Layer (PDFL) consists of three operations: i) forward project G, ii) the projection data fidelity of Eq.(5), and iii) the FBP layer G fbp . The projection data fidelity of Eq.(5) can be formulated in matrix form as: where D = diag e 1 , e 2 , ⋯, e M with: 1, when i ∉ Ω (11) Then, our PDFL combines the three operations discussed above. Specifically, the forward pass of PDFL can be writtern as: where I cnn is the image predicted from an image-domain deep network and is the input of our PDFL. The output of PDFL is an image with projection data fidelity from limited-view projection data S u . Assuming low noise level, we set λ = 0.001 (analyzed in Section V-C.4). Given the forward pass of Eq.(12), the gradient of the PDFL with respect to the input I cnn can thus be written as: which is defined for our PDFL's backward pass. There is no learnable parameter in our PDFL.

IV. Cascaded Residual Dense Spatial-Channel Attention Network
Previous MBIR methods solve the optimization problem in Eq.(1) for CT reconstruction by switching the de-aliasing step and the projection data fidelity step back and forth until convergence. However, in many previous deep-learning based reconstruction methods [13], [18], [19], they use single-step deep networks for de-aliasing and reconstruction. Unfortunately, a trained single-step network cannot be used for iterative de-aliasing, since iteratively applying single-step network de-aliasing does not guarantee to converge to a reasonable reconstruction. Moreover, single-step deep networks with limited de-alising capability are prone to issues, such as over-fitting. Therefore, it is desirable to have a network structure that is able to iteratively de-alias the image using a deep network with sufficient de-aliasing capability, while preserving the projection data fidelity. Here, we propose a cascaded network structure, called CasRedSCAN, with basic units of Residual Dense Spatial-Channel Attention Network (RedSCAN) and PDFL.
Similar to the process of MBIR that alternates between the de-aliasing step and the projection data fidelity step, our CasRedSCAN also alternates between the RedSCAN and PDFL, as illustrated in Figure 1. With the initial FBP reconstruction image inputted into the first RedSCAN, the de-aliasing output is fed into the first PDFL. Then, the PDFL output is fed into the second RedSCAN+PDFL block. The same procedure is iterated a fix number of times for a final reconstruction output I z . The loss function can thus be formulated as: where I u is initial FBP reconstruction. θ is the RedSCAN network parameters. S u is the limited-view sinogram data. I gt is the ground truth reconstruction from full-view sinogram data. The algorithm is summarized in Algorithm 1. In our implementation, all the RedSCAN shared the same network parameters in CasRedSCAN, thus maintaining nearly the same model size as compared to the single-step RedSCAN.

A. Residual Dense Spatial-Channel Attention Network
Our RedSCAN consists of three key components, including initial feature extraction (IFE) using two 3×3 convolution layers, multiple Residual Dense Spatial-Channel Attention Block (RedSCAB) followed by global feature fusion, and global residual learning. The network architecture is demonstrated in Figure 2.
Let P IF E 1 and P IF E 2 be the first and second convolutional operations in IFE, we first extract F −1 = P IF E 1 I u for global residual learning, and F 0 = P IF E 2 F −1 for feeding into RedSCAB. Assuming we have n RedSCABs, the n-th output F n can thus be written as: where P RedSCAB n represents the n-th RedSCAB operation (n ≥ 1). Given the extracted local features from a set of RedSCAB, we apply our global feature fusion (GFF) to extract the global feature: where {} means concatenation along feature channel and our global feature fusion function P GF F consists of a 1 × 1 and 3 × 3 convolution layers to fuse the extracted local features from different levels of RedSCAB. The GFF output is used as input for our global residual learning: The element-wise addition of global feature and initial feature are fed into our final 3×3 convolution layer for unregularized output. In our experiment, we set the size of IFE feature channel to 32.

Cascaded Residual Dense Spatial-Channel Attention Network
Residual Dense Spatial-Channel Block contains four densely connected convolution layers, local feature fusion, local residual connection, and spatial-channel attention. In the nth RedSCAB, the t-th convolution output is: where ℋ n t denotes the t-th convolution followed by Leaky-ReLU in the n-th RedSCAB, {} means concatenation along feature channel, and the number of convolution t ≤ 4. Then, we apply our local feature fusion (LFF), a 1 × 1 convolution layer, to fuse the output from the last RedSCAB and all convolution layers in current RedSCAB. Thus, the LFF output can be expressed as: where P LF F , n denotes the LFF operation. Then, it is fed into our Spatial-Channel Attention (SCA) module with two branches to re-weigh channel-wise features and spatial-wise features, as illustrated in Figure 2. The channel attention output F C A,n and spatial attention output F SA,n are fused together via F SC A,n = F CA,n + F SA,n . Finally, we apply the local residual learning to SCA output by adding the residual connection from RedSCAB input, generating the n-th RedSCAB output: In our experiment, we set the number of RedSCAB to 5.

Spatial-Channel Attention contains two Squeeze-and-Excitation branches for Channel
Attention (CA) and Spatial Attention (SA), respectively [30], [31]. Traditional CNNs treat channel-wise features and spatial-wise features equally. However, in an image reconstruction task, it is desirable to have the network focus more on informative features by acknowledging both the channel-wise feature interdependence and the spatial-wise contextual interdependence. The CA and SA structures are illustrated in orange and blue boxes in Figure 2, respectively.
For CA, similar to [30], we spatial-wise squeeze the input feature map using global average pooling, where the feature map is formulated as F = [f 1 , f 2 , …, f C ] here with f n ∈ ℝ H × W denoting the individual feature channel. We flatten the global average pooling output, generating v ∈ ℝ C with its z-th element: where vector υ embeds the spatial-wise global information. Then, υ is fed into two fully connected layers with weights of w 1 ∈ ℝ C 2 × C and w 2 ∈ ℝ C × C 2 , producing the channel-wise calibration vector: v = σ w 2 η w 1 v (22) where η and σ are the ReLU and Sigmoid activation function, respectively. The calibration vector is applied to the input feature map using channel-wise multiplication: where v i indicates the importance of the i-th feature channel and lies in [0, 1]. With CA embedded into our network, the calibration vector adaptively learns to emphasize the important feature channels while plays down the others.
In SA, we formulate our feature map as F = [f 1,1 , …, f i,j , …, f H,W ], where f i, j ∈ ℝ C indicates the feature at spatial location (i, j) with i ∈ {1, …, H} and j ∈ {1, …, W}. We channel-wise squeeze the input feature map using a convolutional kernel with weights of w 3 ∈ ℝ 1 × 1 × C × 1 , generating a volume tensor m = w 3 ⊛ F with m ∈ ℝ H × W . Each f i,j is a linear combination of all feature channels at spatial location (i, j). Then, the spatial-wise calibration volume that lies in [0, 1] can be written as: m = σ(m) = σ w 3 ⊛ F (24) where σ is the sigmoid activation function. Applying the calibration volume to the input feature map, we have: where the calibration parameter m i, j provides the relative importance of a spatial information of a given feature map. Similarly, with SA embedded into our network, the calibration volume learns to stress the most important spatial locations while ignores the irrelevant ones.
Finally, channel-wise calibration and spatial-wise calibration are combined via element-wise addition operation F SCA = F SA + F CA . With the two branch fusion, features at (i, j, c) possess high activation only when they receive high activation from both SA and CA. Our SCA encourages the networks to re-calibrate the feature map such that more accurate and relevant feature maps can be learned.

A. Data Preparation and Training
We used two large-scale dataset for our experiments. In our first dataset, we collected 10 whole body CT scans from the AAPM Low Dose CT Grand Challenge [25]. Similar to the CT projection simulation in [33], we assume an equi-angular fan-beam projection geometry. A 120 kV p polyenergetic x-ray source is simulated. To simulated Poisson noise in the sinogram, we assume the incident x-ray contains 2 × 10 7 photons. The distance between the x-ray source and the rotation center is set to 39.7 cm. There are 439 detector bins in a row and each image consists of 256 × 256 pixels. For each image, the fully sampled sinogram data S was generated via 360 projection views uniformly spaced between 0 and 360 degrees. In sparse view experiments, we uniformly sampled 180, 90, and 60 projection views from the 360 projection views to form S u , mimicking 2, 4, and 6 fold radiation dose reduction. In limited angle experiments, we sampled 90, 120, and 150 (out of the 360 total) projection views that lies within 0 − 90, 0 − 120, and 0 − 150 degrees for our S u . The reconstructed image I and I u were obtained by applying FBP to S and S u , respectively.
We implemented our CasRedSCAN in Pytorch, 4 and trained it on an NVIDIA Quadro RTX 8000 GPU with 48G memory. The Adam solver [34] was used to optimize our models with a momentum of 0.99 and a 0.0005 learning rate. We used a batch size of 4 during training.

B. Experimental Results
For quantitative evaluation, both SV and LA results were evaluated using Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), and Root Mean Square Error (RMSE) by comparing the synthetic SV and LV reconstructions to the ground truth reconstruction from FBP of fully sampled sinogram. For comparative study, we compared our results on both SV and LA tasks against: 1) image-to-image translation-based methods, including the combination of Densenet and Deconvolution (DDNet) [18], Framing UNet (FUNet) [19], FBPNet [13], and 2) deep learning-based methods with projection data fidelity used in the test stage, including DCAR [21] and CTNet [1].
The qualitative comparison of different limited angle reconstruction methods with AAPM dataset is shown in Figure 3. As we can observe in chest region, previous methods have difficulties in reconstructing small anatomical structure, i.e. arteries. Similarly, with crowded organs in abdominal region, the organ boundaries are challenging to recover by previous methods along with additional patient boundary artifacts. Our CasRedSCAN with advanced network design and projection data fidelity constraint can provide superior limited angle reconstruction in terms of organ boundary recovery, small structure recovery, and boundary artifact elimination. The qualitative comparison of different sparse view reconstruction methods with AAPM datset is also shown in Figure 3. Similar to the observations from limited angle experiments above, our CasRedSCAN yields high-quality reconstruction in crowded soft tissue area with fine details. As evidenced in  Figure 4 shows the limited angle reconstructions and sparse view reconstructions from our CasRedSCAN at different settings.
As CT scan is often used for disease diagnosis, we also evaluated the reconstruction performance on CT images with 8 different lesion types. Figure 5 illustrates the qualitative comparison of various limited angle and sparse view reconstruction methods on 4 major lesion types. As we can observe, the liver lesion and kidney lesion are hard to recover by previous methods because these lesions have low contrast to the soft-tissue background, and their visualization are further degraded by the limited angle artifacts. Similarly, the lung lesion are also challenging to recover by previous methods due to their complex lesion texture. However, our CasRedSCAN can provide superior recovery of the shape and texture of the lesion even under these difficult conditions. For example, our liver and kidney reconstructions on the last column can provide clear lesion boundary which is critical for lesion progression assessment. The lung bronchi that originally diminished on FBP reconstruction can also be better recovered by our CasRedSCAN.  Figure 6. Our CasRedSCAN is able to keep the RMSE below 20 for limited angle reconstructions (150°) and sparse angle reconstructions (1/2) with different tumor types. However, the RMSE increases as the limited angle reduces or the sparse view undersampling rate increases.

1) Number of Cascade:
The number of cascade block can be flexibly adjusted in our CasRedSCAN. We analyzed the effect of increasing the number of cascade blocks in our CasRedSCAN. The result is summarized in Figure 7 and evaluated using AAPM dataset. As we can observe, using more cascade blocks boosts the reconstruction performance, while the rate of improvement starts to converge after the number of blocks reaches 3. In LA, increasing the number of cascade from 4 to 5 only increase SSIM by less than 0.002 and reduce RMSE by less than 2 in average. Similar observation can be found in SV.
2) Attention Mechanism: Two attention mechanisms are used and combined in our CasRedSCAN. We analyzed the effect of these two attention mechanisms in our CasRedSCAN. The result is illustrated in Table III and evaluated using AAPM dataset. We compared our CasRedSCAN's performance with or without channel attention or spatial attention. As we can observe, both channel attention and spatial attention can improve the reconstruction performance, and the combination of both attentions provides the best performance with the least variation, and significantly outperforms the baseline CasRedSCAN without both channel and spatial attentions.

3) Sinogram Evolution:
With the number of cascade block set to 4 in our CasRedSCAN, we further analyzed how the generated sinogram evolves over the cascaded network. We computed the mean RMSE between each cascade block's sinogram outputs and the ground truth full view sinogram. The results for both LA and SV are plotted in Figure 8. As we can see, the sinogram errors gradually reduce as the generated data passes through the next cascaded block, while the rate of sinogram error reduction starts to converge after the first cascade block.

4) PDFL Parameter:
In PDFL, λ is the noise level parameter that controls the linear combination of the acquired projection data and the projection data of RedSCAN's output. Assuming low noise x-ray acquisition as in our experiments, λ should be a small value as the impact of noise is minimal. We analyzed the impact of λ under both LA and SV conditions. The results are summarized in Figure 9. As we can observe, reconstruction without considering the noise, i.e. λ = 0, leads to degradation on reconstruction performance. Setting λ = 0.001 leads to the best reconstruction performance in our search range, while the RMSE difference is less than 1 between λ = 0.001 and λ = 0.005.

5) Embedded Networks:
We embedded different previous image-to-image reconstruction networks [13], [18], [19] into our cascaded network and compared the performance with or without cascade. The qualitative results are visualized in Figure 10. The quantitative results are summarized in Table IV. The number of cascade is set to 4 in this study. As we can observed, embedding different previous image-to-image networks into our cascade design improves the reconstruction performance, while RedSCAN embedded into our cascade network achieves the best reconstruction performance.

VI. Discussion
In this paper, a novel reconstruction framework, named CasRedSCAN, is proposed. Inspired by the recent advances in image super-resolution network designs and the projection data constraint in MBIR, we designed a customized RedSCAN as our backbone image reconstruction network, and we built a projection data fidelity layer that can be embedded in deep networks. First of all, our RedSCAN is developed based on image super-resolution network [35] with an addition of spatial-channel attention, which allows our RedSCAN to re-calibrate the channel attention and gives different levels of attention on recovering texture details at different spatial locations, as artifact distribution is not uniform in the image. In fact, Hu et al. [36] recently also demonstrated that spatial-channel attention can boost the image super-resolution performance. Then, we develop PDFL that can be concatenated to the RedSCAN's cascade outputs to ensure the projection data fidelity at the sampled projection views. Our PDFL based on the analytical FBP solution with fan-beam geometry allows it to be embedded in a deep network and used during training and inference.
We demonstrate the feasibility of our CasRedSCAN on both LA and SV tomographic reconstruction tasks, as shown in the result section. Firstly, the LA acquisition is more difficult to reconstruct as compared to the SV acquisition since a range of projection angles are not covered in the LA acquisition. Severe image artifacts at these projection angles can be observed when using conventional FBP. As a result, the general performance of LA reconstructions are inferior to the SV reconstruction performance. For example, in 120° LA reconstruction, while previous methods can mitigate the artifacts and recover PSNR up to 37.94 and SSIM up to 0.970, they still have difficulties in recovering the organ boundaries that are critical for clinical diagnosis and treatment planning. Our CasRedSCAN provides superior reconstructions with clear organ boundaries and is able to improve the PSNR to 41.48 and SSIM to 0.983. In 1/4 SV reconstruction, while previous methods can generate visually plausible image content, the reconstruction prediction without projection data fidelity can result in artificial texture which is undesirable in clinical tasks. Our CasRedSCAN with PDFL can better preserve the image fidelity by incorporating the already-sampled projection data, resulting in best performance in terms of PSNR, SSIM, and RMSE.
Furthermore, we demonstrate the feasibility of our CasRedSCAN on CT lesion imaging under LA and SV conditions. Lesion is highly heterogeneous, and CT is one of the primary tool for diagnosis. Obtaining high-quality lesion region reconstruction under LA and SV is essential for disease diagnosis, staging, as well as planning and evaluation of treatment. While previous methods can reduce the reconstruction artifacts from the whole image perspective, the reconstruction in lesion region with high heterogeneity is still unsatisfyingthe lesion boundary and texture are highly distorted by previous methods which will negatively impact the subsequent treatment options. On the other hand, our CasRedSCAN can better preserve the lesion reconstruction even the lesions are highly heterogeneous. For example, the supplying vessels of LA lung lesion in Figure 5 are totally missed by previous methods, while our CasRedSCAN can better recover it. The complex interior texture of SV lung lesion in Figure 5 is highly distorted by previous methods, but our CasRedSCAN can still preserve the structure. In Figure 5, liver and kidney lesions embedded in soft-tissue background with low contrast are prone to smooth-out in SV and distorted in LA by previous methods, and our CasRedSCAN can better recover the boundary and the contrast of the lesions.
We believe there are several reasons that potentially lead to the superior performance of using RedSCAN in CasRedSCAN. First of all, our RedSCAN has no image downsampling for abstraction, thus keeping the image restoration on original resolution. Second, convolutional layers in different depths have different sizes of receptive fields, resulting in hierarchical features. Image restoration should utilize all the hierarchical features, instead of only the last layer output. Our RedSCAN concatenating all the hierarchical features can potentially better learn the restoration. Thirdly, the hierarchical features are generated by our residual dense channel-spatial block that allows better feature learning at each hierarchical level. Moreover, the residual connection in each block also allows the gradient to be better passed to earlier layers, thus helping the training of our wide network design. As shown in Table , the design of our RedSCAN also provides a relatively smaller amount network parameter (0.51M) as compared to the previous method. Specifically, the RedSCANs in CasRedSCAN share the same network parameter and there is no learnable parameter in PDFL, thus the CasRedSCAN's parameter size remains the same as RedSCAN regardless of the number of cascading. In this case, our CasRedSCAN using the least amount of parameters achieves the best limited view reconstruction performance.
The presented work also has potential limitations. First of all, the inference time is longer compared to the previous deep learning based methods, as illustrated in Table VI. This is caused by the cascaded design with PDFL interleaved. On one hand, the iterative reconstruction prediction will increase the computation time. On the other hand, even though FBP is a fast analytic solution, the forward projection and FBP operations in PDFL still consume computation times. The combination of these two results in longer training and inference time. However, the inference time is about 150 ms which is acceptable and much faster than previous MBIR methods. Moreover, in our PDFL, we assume 360 degrees fanbeam projection combined from the already sampled sinogram and the predicted sinogram. The minimal complete sinogram with reduced number of projection could reduce the computation time of PDFL. However, additional step of sinogram weighting, such as Parker weighting [37], could be incorporated to address the data redundancy issue. Secondly, while increasing the number of cascade block in CasRedSCAN improves the performance, the memory consumption will increase along with longer training and inference time. As illustrated in Figure 7, the increase in performance starts to converge after n = 3. Thus, in this work, we set n = 4 to balance the memory consumption and inference time of our CasRedSCAN.
The architecture of our CasRedSCAN also suggests several interesting topics for future studies. The first one is combining the projection data fidelity layer with the deep learning based radon inversion techniques [38]. The cascaded framework with projection data fidelity can provide the projection domain constraint during the radon inversion via deep learning. It can potentially improve the inversion stability, yielding reconstruction with better data fidelity. Secondly, given the superior lesion region reconstruction performance demonstrated in the result sections, our framework could also potentially improve the projection data based Computer-Aided Diagnosis (CAD). Recently, there are increasing interests on combining limited-view reconstruction and CAD for a joint reconstruction-CAD network structure, and improved CAD performance is expected with such an end-to-end training strategy [39], [40]. We believe that our CasRedSCAN with high-quality lesion region reconstruction would provide new opportunities for these kinds of studies. Thirdly, CT metal artifact reduction (MAR) under limited-view acquisition is an important research direction. Current MAR techniques are mostly limited to full-view acquisition [41], [42]. The current state-of-the-art metal artifact reduction algorithm, such as DuDoNet [41], utilizes projection space and image space simultaneously which is similar to our CasRedSCAN design. Our CasRedSCAN could potentially integrated with current MAR network for MAR under limited view conditions. Fourthly, low-dose CT combined with limited-view acquisition may further reduce the radiation dose. As a matter of fact, Shan et al. [43] and Wu et al. [44] had proposed cascaded network structures with basic network of UNet [15] or sequential CNN layers, and demonstrated their efficiency in low-dose CT. As cascade network is also potentially efficient in low-dose CT, our CasRedSCAN could be adapted to limited-view low-dose CT that may further reduce the radiation dose and acquisition time. Lastly, we believe our CasRedSCAN could be adapted to other tomography imaging modalities with similar applications, such as SPECT, PET, and Cryo-ET [45]- [47].

VII. Conclusion
In this work, we proposed a cascaded network with RedSCAN and PDFL, a novel framework for limited view tomographic reconstruction. The proposed PDFL is interleaved in our cascaded network to ensure the sampled sinogram is consistent in sinogram domain with the network cascaded output. A customized image restoration network is used as the backbone in the cascaded network. Comprehensive evaluation demonstrates that our CasRedSCAN can provide high-quality limited angle and sparse view tomographic reconstruction while reducing radiation dose and shortening scanning time. The architecture of our CasRedSCAN. Each block consists of a RedSCAN (blue) and a PDFL (gray).

Fig. 2.
The architecture of our residual dense spatial-channel attention network (RedSCAN), which are used in both the recurrent image reconstruction blocks in Figure 1.