Learned Shrinkage Approach for Low-Dose Reconstruction in Computed Tomography

We propose a direct nonlinear reconstruction algorithm for Computed Tomography (CT), designed to handle low-dose measurements. It involves the filtered back-projection and adaptive nonlinear filtering in both the projection and the image domains. The filter is an extension of the learned shrinkage method by Hel-Or and Shaked to the case of indirect observations. The shrinkage functions are learned using a training set of reference CT images. The optimization is performed with respect to an error functional in the image domain that combines the mean square error with a gradient-based penalty, promoting image sharpness. Our numerical simulations indicate that the proposed algorithm can manage well with noisy measurements, allowing a dose reduction by a factor of 4, while reducing noise and streak artifacts in the FBP reconstruction, comparable to the performance of a statistically based iterative algorithm.


Problem Statement.
Computed tomography (CT) imaging produces a 3D map of the scanned object, where the different materials are distinguished by their X-ray attenuation properties. In medicine, such a map has a great diagnostic value, making the CT scan one of the most frequent noninvasive exploration procedures practiced in almost every hospital. The attenuation of biological tissues is measured by comparing the intensity of the X-rays entering and leaving the body. The main problem precluding pervasive use of the CT scan for diagnostics and monitoring is the damage caused to the tissues by the X-ray radiation. CT manufacturers make great efforts to reduce the X-ray dose required for images of diagnostic quality. In this work we propose an algorithm that enables a high-quality reconstruction from low-dose (and thus noisy) measurements.
In ideal conditions, the information obtained in the scan suffices to build an exact attenuation map, called the CT image. In practice, the measurements are degraded by a number of physical phenomena. The main factors are offfocal radiation, afterglow and crosstalk in the detectors, beam hardening, and Compton scattering (see [1] for a detailed overview). These introduce a structured error into the measurements, mostly the type that is modeled by a convolution with some kernel. Another source of deterioration, dominant in the low-dose scenario, is the stochastic noise. One type of such noise stems from the low photon counts, which occur when the X-rays pass through high-attenuation areas. This phenomenon is similar to the shot noise, encountered in photo cameras in poor lighting conditions. Statistically, the photon counts are modeled as instances of Poisson random variables. Another type of the stochastic noise originates from dark currents in the detectors, interference noise from interconnecting cables, and other hardware sources. This electronic noise is modeled in the measurements as an additive Gaussian random variable.
In this work we aim to reduce the influence of the stochastic noise on the image quality, with the assumption that the structured error components, mentioned above, are treated by the existing methods. Explicitly, we use the wellaccepted compound Poisson-Gaussian statistical model and propose a new noniterative method for image reconstruction, based on the concept of sparse representations [2] and involving machine learning concepts. 2 International Journal of Biomedical Imaging

Present Reconstruction Algorithms.
The basic linear reconstruction method, filtered back projection (FBP) [3], makes a very limited account of the noise structure in the data: it employs a low-pass 1D convolution filter in the Radon domain, whose parameters are preset for specific anatomical regions and standard scan protocols. Errors in the photon counts manifest in the output CT image in the form of streak artifacts, which corrupt its content and jeopardize its diagnostic value. Accordingly, each measured line integral is effectively smeared back over that line by the back projection; an incorrect measurement results in a line of wrong intensity in the image. Typically, the streaks radiate from bone regions or metal implants, which corrupt its content and jeopardize its diagnostic value. Images of better qualitywith reduced artifacts and increased spatial resolutionare obtained with a statistically-based approach, where the maximum a posteriori (MAP) is optimized with respect to the sought image. This problem is converted to a minimization of the penalized likelihood (PL) objective function [1], (1). The likelihood expression models the physical process of CT scan, which allows interpreting the measurements more correctly. The likelihood component expresses the expected statistical behavior of the data. The penalty component models the expected properties of the CT images (i.e., contains a prior information about the image to be reconstructed). The PL objective can be designed to restore the measurements from noisy observations [1,4,5] or to reconstruct directly the CT image [6]. In most cases, the optimization problem is difficult to solve, so it is sometimes replaced by a secondorder approximation, the penalized weighted least squares (PWLS) [6,7]. A drawback of a reconstruction based on explicit statistical modeling is the computationally heavy iterative solution.
To improve the performance of the fast FBP reconstruction, adaptive signal processing techniques, implicitly modeling the noise statistics, have been proposed. These are applied to the measurements data in a noniterative fashion and have computational complexity comparable to that of the FBP. Hsieh employs a trimmed mean filter, adaptive to the noise variance [8]. For each detector reading , the algorithm adaptively chooses a number of its neighbors participating in the filtering operation. The value of is replaced with the average of these neighbors after a portion of their highest and lowest values is discarded. To some extent, the aforementioned statistical model of the scan is used: the noisy samples get a stronger filtering than the more reliable ones. The experimental results in this work are very impressive. A similar concept, with a different kind of filter, is adopted by the work of Kachelrieß et al. who apply adaptive convolution-based filtering in the sinogram domain [9]. The filter width is data dependent and also it is applied only where the data intensity is below a threshold, so the algorithm processes only the regions where the noise is substantial. In [10], two signal processing steps are introduced to improve the performance of FBP. The measurements' data is processed by the penalized weighted least squares filter in a Karhunen-Loeve domain (more familiar under the name of principal component analysis, PCA). An additional step of image postprocessing is performed by edge-preserving smoothing with locally adaptive parameters.
Beyond the use of general-purpose tools, there are algorithms applying machine learning methods for adaptive processing of the tomographic data. Close in its spirit to our work is the algorithm described in [11]. Here, the measured projections are locally filtered according to a preliminary classification of its regions. The classes and the corresponding filters are derived automatically, via an offline examplebased training process. Thus, the standard smoothing by the low-pass convolution filter is replaced with locally adaptive filtering, optimized for the minimal mean square error in the training images.

Our Work.
Our method employs an adaptive local processing of the measurements and a matched postprocessing stage in the CT image domain. Those steps are intended to enable the FBP to deal with the noisy measurements. The technique employed in both stages is a learned shrinkage in the transform domain, following the ideas outlined in the work by Hel-Or and Shaked in [12].
The learned shrinkage filter was originally designed for noise reduction and is employed in our work to reduce an error measure in the CT image domain, while acting on the raw measurements (at the first stage) and on the reconstructed image (at the second). In a nutshell, the learned shrinkage operator is a nonlinear adaptive filter, applied locally to the signal data. It requires an example-based training of the filter's defining parameters, which minimizes a desired reconstruction error with respect to reference CT images.
Due to the fact that the noise in the measurements is data dependent, we introduce a scalar transformation which normalizes the noise variance according to its statistical model. The aforementioned filter is applied after this transformation. The error measure in the image domain, used in the training objective, is a function of the reconstructed CT image and its ground truth-a high quality reference image. The measure consists of two components, the standard mean square error and a gradient-based expression capturing the amount of blur at fine edges of the image. Filters, optimized with respect to this error measure, are shown to produce CT images with low spatial noise and artifacts, while producing sharp edges.
On one hand, our approach accounts for the statistical model of the noise, as used by the iterative algorithms; on the other, the computationally heavy learning procedure is performed once offline, at the calibration stage, while the processing of the new data is done very fast, on par with the FBP algorithm. This gives a hope to bridge the gap between the slow high-quality statistical algorithms and the fast linear reconstruction. We also mention that in the learning process our algorithm has the potential to adjust to additional unknown degradation factors and hardware specifications.
One possible application of the proposed method is to exploit the adaptive nature of its filtering stages to taylor the filter parameters to an individual patient, in a specific scan setup. Such step can be made in the scenario where repeated International Journal of Biomedical Imaging 3 CT scans are performed, for reasons of monitoring. Thus after the first full-dose scan, the X-ray exposure in the following procedures can be reduced by using filters trained on the image data very similar to that which is expected in the next scan. With the correct training protocol, there is no danger of overfitting the filters to the "healthy" images and thus to jeopardise detection of anomalies (an experiment, suggesting this fact, is reported later in this paper). In this way, a patient can avoid a substantial amount of X-ray exposure.
In our numerical experiments we compare the proposed algorithm against three existing ones. First is the optimally tuned FBP, which serves as a baseline; another is the nonlinear ATM filter for CT reconstruction, proposed by Hsieh in [8], and the third is the iterative, statistically-based PWLS reconstruction [6]. Our method is shown to outperform both the FBP and the ATM in the sense of image quality and robustness to changes of the anatomical region, and it is comparable to the statistically-based reconstruction. The comparison includes a visual display, a number of quantitative measures and evaluation of the local impulse response for each algorithm.
The paper is organized as follows. The mathematical description of the CT scan is given in Section 2. The learned shrinkage method is presented in Section 3. The new error measure is described in Section 4, laying the ground for our method for CT reconstruction, described in Section 5. A numerical study is given in Section 6. Section 7 concludes the paper.

Mathematical Model of the CT Scan
Our algorithm is designed in the setup of two-dimensional, parallel-beam scan geometry. An example of a scanned object is an axial slice (axial plane is the one parallel to the floor when the patient is standing) of the patient's body. The main part of a CT scanner is a rotating gantry, which has the X-ray source mounted against an array of detectors. During the scan, the gantry sweeps the angular range of [0, ], equally divided into a large number of projections or "views". For each angle , the one-dimensional array of detectors produces photon counts from rays that arrive in a fan-shaped beam from the source. Via a rebinning step, the rays are rearranged so that there is a comb of parallel rays for each projection. The acquired data is arranged into a 2D matrix, whose columns correspond to different angles, and the rows are assigned to different bins in each projection.
To describe the nature of measured data we use the compound Poisson-Gaussian statistical model that is assumed in [1,13] and is also empirically verified in [8]. Each measured photon count ℓ is viewed as an instance of the random variable ℓ given by where R is the Radon transform of the scanned image and the constant 0 is the photon count at the X-ray source. The Radon transform is defined on the collection of all straight lines ℓ through the object. For each ℓ, its value is the linear integral: The log-transformed photon counts ℓ = − log( ℓ / 0 ) are the approximate line integrals; the corresponding data matrix is called a sinogram since every point in the image space traces a sine curve in this domain. The filtered back-projection algorithm is based on the Radon inversion formula. First, the measurements are transformed to the Radon domain by the logarithm function: = − log( / 0 ). Then, a linear operator implementing the inverse of Radon transform is applied as follows: where R * is the adjoint of the Radon transform, also known as the back-projection operator: The filter F RL uses the Ram-Lak kernel [14], defined in the Fourier domain bŷ( In practice, an additional, low-pass filtering is performed in the Radon domain to reduce the high-frequency noise amplified by the Ram-Lak kernel. It is usually implemented by applying a Butterworth or a Shepp-Logan window in the frequency domain. The optimal parameters of the low-pass filter differ from one anatomical region to another. Their preset values are kept fixed in the clinical CT scanners, and the radiologist selects an appropriate one for each clinical study. As a study in [15] shows, the image properties depend nonnegligibly on this parameter.

Learned Shrinkage in a Transform Domain
We describe the learned shrinkage algorithm proposed by Hel-Or and Shaked in [12] for signal denoising. A popular Bayesian approach for recovery of a signal from measurements = + , contaminated with i.i.d. Gaussian noise, consists in solving the penalized least squares optimization problem,̂= and computing the signal estimate bŷ= D̂ [16]. Effectively, the sought signal is encoded in terms of the dictionary D (a linear transform, e.g., wavelets), whose properties encourage noise reduction. The left summand in this expression is a data fidelity term, and the right one expresses expected properties of the signal's coefficients. The classical assumption of a rapid decay of their magnitudes corresponds to the following penalty expression [17, Section 1.3]: 4 International Journal of Biomedical Imaging For a unitary transform D, the problem (6) is separable and admits a simple closed-form solution, which is described by a scalar shrinkage function S applied elementwise to the vector of coefficients D −1 . The formula for the shrinkage function is derived analytically from the expression for ( ) [18]. Thus, the signal estimatêhas the formula: This technique was pioneered by Donoho and Johnston [19], who developed it for the wavelet transform, and it is now widely applied in the broader context of nonunitary operators and even redundant dictionaries which are tight frames (see [12,16] for an overview). In those cases, any shrinkage operation can provide only an approximate solution to (6). For image denoising, the data dimensions are too large to process the entire image if D is not a computationally efficient structured transformation and also, while images vary wildly, small image patches fall into a well-structured statistical pattern. Therefore, the shrinkage idea is applied to an image denoising by extracting overlapping square patches and processing each of them separately. The overlaps help avoiding block artifacts and stabilize the filtering action. Each pixel is altered differently in each patch it belongs to; strong differences are tamed by averaging over all those patches.
Technically, a patch of size × , corresponding to a location in the signal matrix , is extracted by the linear operator E and is reinstalled (after a processing) into a signal-sized empty matrix by its transpose E ⊤ . Thus, the patchwise denoising action for a 2D signal is described bŷ Here M E = ∑ E ⊤ E is compensating for the overlapping by dividing each pixel by the number of patches containing it. When the dictionary D is a nonunitary full-rank matrix, the vector of coefficients is computed using the pseudoinverse D + of the dictionary. As mentioned before, in this case no exact solution for the shape of the shrinkage function is available. A practical solution for denoising in this setting is proposed in [12]: the shape of the shrinkage function in (9) is learned in an example-based process (rather than being defined descriptively), by optimizing an objective function. Also, for better results, it is preferable to use an array of shrinkage functions, corresponding to the structure of the transform D, rather than a single one. For instance, when an -levels wavelet transform is used, separate functions are dedicated to each level. The vector in this case is partitioned into subsets, each processed with an individual shrinkage function.
In [12], the shrinkage functions are modeled as linear combinations of splines of order 1. In other words, these are piecewise linear functions whose joints are configurable. The shrinkage functions S 1 , . . . , S are defined by two sets of vectors, q = [q 1 , . . . , q ] and p = [p 1 , . . . , p ]. The vector q is an evenly spaced sequence of numbers, covering the dynamic range of the th subset in ; each S is the antisymmetric piecewise linear function determined by the following equations: S (q ( )) = p ( ) , S (−q ( )) = −p ( ) , ∀ The antisymmetry assumption comes from the belief that only the absolute values of the coefficients should affect the amount of shrinkage applied. It was verified experimentally both in the work of Hel-Or and Shaked and in ours. The shrinkage operator now has a parameter set p, assuming a fixed set of domains q. We define an estimator G p,D for the signal , based on (9) with this addition: Let us denote by = (D + E ) the representation of the th patch in the noisy signal.
The objective function for tuning the parameter set p is the mean square error (MSE) of the signal estimatẽ= G p,D ( ), with respect to the true signal available at the training stage: Hel-Or and Shaked define the slice transform (SLT) which is applied to in order to reformulate the shrinkage operation as a linear function in p. Explicitly, a large sparse matrix q, encoding this data is designed ([12, Section IV]) to perform the shrinkage via a matrix-vector product: Using this approach, the optimization of the objective function (12) turns into a simple least squares problem, It is easily solved for p using the pseudo-inverse operator. An application of this method to image denoising, demonstrated in [12], shows very promising results. Moreover, the use of custom-built functions makes the shrinkage operation more robust and suitable for signal processing problems other than noise reduction. For instance, an algorithm for single image super resolution, proposed in [20] and based on the same principles, exhibits a state-of-the-art performance.
International Journal of Biomedical Imaging

Constructing the Error Measure.
Before considering an example-based training for CT reconstruction, one must establish a viable error measure in the image domain, minimization of which would lead to radiological images of a good quality. We consider a quantitative error measure for the deteriorated CT imagẽ(reconstructed with some algorithm), which uses the ground truth image . The basic choice for such measure is the mean square error (MSE): Pursuing a low MSE value is an accepted goal in the general field of image processing; however, we observe in our experiments that algorithms optimized for minimal MSE produce images with reduced spatial resolution. The reason is that the true image often contains large nearly constant regions, which calls for extensive smoothing. Fine details, washed off by this smoothing, do not increase the MSE notably, so its net value over the entire image is low. We therefore introduce an additional component to the error measure, which encourages the preservation of fine edges in the image.
Consider an edge between two homogeneous regions in the image, where the change in intensity is small comparing to the global dynamic range. If the filter applied blurs this edge, the MSE value is increased by a small amount; however, the gradient norm at the edge is much smaller than in the original image. Thus, it makes sense to penalize not only for difference in intensity values between the reference image 0 and the reconstructioñbut for the difference in gradient norms: However, the stated error measure is still reduced by oversmoothing the image: wherever the value of ‖∇̃( )‖ 2 2 is larger than ‖∇ 0 ( )‖ 2 2 (which happens a lot in the noisy imagẽ ), smoothing the imagẽwill reduce the error. Therefore, we restrict our error measure only to the regions where the original gradient norm is larger than the reconstructed one; those regions are most problematic in the sense of lost details. Thus, the following formula for a penalty component is proposed: The weight matrix , introduced here, is designed to remove all the locations where the reference gradient norm is above 2% of its maximal value. This (empirically chosen) threshold is applied in order to focus on the low-contrast edges and not to waste the learning capacity of the filter on the strong flesh-bone transitions. For a practical optimization we replace the function (max(0, ) has a noncontinuous derivative in 0) max(0, ) with a smoothed version ( , ), based on the Huber penalty function [21]: . The obtained error measure is named MSEg (mean square error, augmented with the g gradient), and its final expression is The reference image 0 will be omitted from the notation from now on. Using this function for training of data filters produces CT images, which combine low MSE value with high spatial resolution measure. In the following principle experiment, it is compared to the basic MSE penalty in order to demonstrate the effect of the proposed gradient-based term.

Empirical Evaluation of the Proposed Measure.
The following experiment shows the effect of the gradient-based term in (19) on the visual impression from the "optimized" image. The FBP algorithm is employed to perform CT reconstruction from noisy observations. The cutoff frequency 0 of the low-pass filter in the Radon domain is gradually increased to alter the variance-resolution tradeoff in the reconstructed image. In Figure 1 we display a graph of MSE values of the obtained image as a function of 0 and a graph of Q( 0 ,̃) values. Both are unimodal graphs with a single minimum. For Q( 0 ,̃), the minimum is obtained at a higher frequency than for the MSE measure. In Figure 2 we display reconstructions corresponding to these two optimal frequencies. A visual inspection led us to the conclusion that the minimal-MSE version (upper row, on the right) is a blurred image, where spatial resolution is sacrificed for noise reduction. The version minimizing the Q( 0 ,̃) penalty (lower row, on the left) is visually more appealing since it has a higher spatial resolution at the expense of stronger noise that is managed well by the human eye. For comparison, we also display the reference image (upper left) and two extreme cases corresponding to low/high cutoff frequency (middle column). The two conclusions drawn from this experiment are (1) if we add the gradient-based penalty Q( 0 ,̃) to the error measure and train the reconstruction chain to minimize its value, the obtained images are more informative to the human eye, with higher spatial resolution, and the cost is a higher noise level and (2) the values of Q( 0 ,̃) are lower by two orders of magnitude than the MSE; therefore, in a balanced total error term (19), the value of the weight should be around 100 in order for Q( 0 ,̃) to have an effect.

The Algorithm for CT Reconstruction
The algorithm consists of the sequence of steps, producing a CT image from the measured photon counts. They are stated here and detailed in the sequel.
(i) Data adjustment for signal processing: the photon counts are altered so as to enable an approximate modeling with only the Poisson random variable; then, they undergo the Anscombe transform [22] to normalize the noise variance.
(ii) Learned Shrinkage in the measurements domain: the 2D matrix of the adjusted measurements data is processed patchwise using the learned shrinkage algorithm (we modify the original method of Hel-Or and Shaked, adjusting it to indirect measurements).
(iii) Standard FBP: the FBP transform with no low-pass filter reconstructs the CT image from the restored measurements data.
(iv) Learned Shrinkage in the image domain: a different instance of the learned shrinkage algorithm is applied on the obtained image, producing the final outcome.
We now extend the discussion on each of these steps.

Adjusted
Measurements. The first step is to remove the Gaussian component from the measurements model stated in (1). Following [1], we compute the adjusted variableŝ: where the [ ] + = max{ , 0}. It is easily seen that the expectation and the variance of this distribution are equal to those of the single Poisson variable with the parameter̂ℓ = ℓ + 2 , and so for the purposes of noise normalization, we will assume that this is the distribution modeling for the adjusted measurementŝℓ. Also, we assume a zero-mean electronic noise ( = 0), and therefore the positivity correction is not relevant.
The variance of the Poisson random variable equals its expectation̂ℓ = ℓ + 2 and can be approximated by the measured valuêℓ. We assume that the noise reduction by the learned shrinkage works best when the noise is homogeneous (constant variance in all points). The reason is that the representations of all the patches in the data matrix are processed by the same scalar function, and if the noise energy in each patch was different, it would require a spatially varying shrinkage. In order to achieve unit variance in all the measurements, we use the Anscombe transform [22], intended to normalize the Poisson variable: To summarize, the overall data adjustment is performed elementwise by the following scalar function ( ):

The Objective Function for
Training. Let = ( ) denote the matrix of adjusted measurements. We state the expression for the imagẽreconstructed with our algorithm, before the postprocessing stage. The noise reduction in is done by the nonlinear filter G p = G p,D , defined in (11). Here the dictionary D is a fixed linear transformation (we use the unitary discrete cosine transform), and therefore it is omitted from the notation. After the filtering, the data is transformed to the Radon domain by applying the −1 ( ), followed by the − log function. Then the FBP operator T is applied to produce the CT imagẽ. To summarize, the image is computed as follows:̃p The objective function for the training of the parameter set p is the proposed error measure from (19), regularized with an additional factor: Here ‖p − q‖ 2 is a regularization term penalizing the deviation of each shrinkage function from the identity. Its purpose is to make the shape of the shrinkage functions more robust to outlier examples.
In order to minimize the function Γ p ( ) with respect to the argument p, we use the memory-efficient ℓ-BFGS convex optimization method [23], which requires computing the value and the gradient of the function being minimized. The implementation of the algorithm is by the courtesy of Mark Schmidt (see http://www.di.ens.fr/mschmidt/Software/minFunc.html). Since our objective function is not International Journal of Biomedical Imaging  convex, there is a theoretical question regarding the convergence of this numerical scheme. In practice, we have observed that the method converges in ∼100 iterations, and the obtained parameter set enables producing CT images of a good quality (in comparison to other reconstruction methods). The gradient of the function G ( ) (defined in (11)) with respect to p can be expressed with the help of the slice transform proposed in [12]. Recall that = D + E is a representation of the th patch in . The shrinkage operation S p is replaced by an equivalent matrix-vector multiplication q, p. Then G ( ) has the form which is linear in p. Now, if we consider the estimator̃p( ) as a function of̃= G ( ), we see that this is a composition of elementwise scalar function − log(1/ 0 ) −1 and the linear operator T. Thus, the gradient here is also easily computed. Finally, the gradient of the functional MSEg with respect toc onsists of 2 norms and derivative operators, so its expression is derived using standard methods. We conclude that the gradient ∇ p Γ p ( ) has a closed-form analytical expression and can be readily computed.
The expressions presented above involve only one image , just for a better readability; in practice, we may sum the errors over a training set of many images .
Another important remark is this: when this training objective is used, the parameters of the shrinkage operator S p are tuned to not to reduce the photon count noise in the measured data (via the processing of the adjusted data ) but to prepare the data in the best way for the specific reconstruction operator T. Here lies the key difference between this algorithm and existing methods for measurements denoising, which target high signal quality in the raw data but do not consider the final CT image.

Image Postprocessing.
At this point, after the preprocessing filter is tuned, we have a working reconstruction chain that produces CT images. The stated objective is pursued by its components somewhat indirectly, by changing the raw measurements. A further reduction of the reconstruction error can be achieved by administering another filter, acting on the obtained CT images themselves. We use again the method of learned shrinkage. The training data comprises of the set of CT images̃p( ), reconstructed from the original noisy measurements, as described before. The corresponding reference images serve again as the ground truth. Overlapping patches are extracted from an image, processed by the shrinkage functions in the transform domain and are reinstalled back. The noise statistics in the obtained CT images are difficult to estimate because of the preprocessing 8 International Journal of Biomedical Imaging stage. Therefore, we do not attempt to normalize the noise in the image patches.
We formulate the training objective for the parameter set p of the image domain shrinkage similarl to the case of measurements domain: Here the upper script denotes the image domain. The input imagẽis computed via the formula in (23) using the vector p of shrinkage parameters, learned earlier. This way the postprocessing is tuned for the very same kind of images it will be getting in the operational mode. As previously stated, the training stage consists of minimizing the value of Γ p (̃) with respect to p . This task is simpler than the optimization in the measurements' domain since no data adjustment or reconstruction is required. Using an expression for the gradient ∇ p Γ , we invoke again the ℓ-BFGS method to solve the optimization problem. The convergence here is faster than in the measurements domain, and it takes about 30 iterations to the convergence. We remark that the postprocessing method could be evaluated in its own right, when the corrupted input images may come from the standard FBP or from any other reconstruction method. This evaluation is left for a future study.

The Training
Set. The example-based training approach requires a collection of high quality reference images, each accompanied with a degraded set of measurements. It is preferable to compose the training set from clinical images obtained from a CT scan of a human body, rather than from images of synthetic phantoms. In any case, the example object has to be scanned twice: one time with a very high Xray dose to compute a high-quality reference image (using standard FBP, for instance), and another time with the low dose desired for the practical scan performed on patients. This configuration is feasible with human cadavers: there is no restriction on the X-ray dosage, and there is no problem of registration between the two consecutive scans for a still object (in our experience, with a clinical scanner of General Electric). Another approach for producing the training pairs is to start with given high-quality CT slices and simulate the low-dose measurements by reproducing the machine's X-ray operation as faithfully as possible. This is the approach we took in our work.
We suggest that the training set should be composed of CT images representing a specific anatomical region. This is in light of the observation that the characteristics of CT images vary substantially between different such regions. This leads to building a collection of learned parameter sets, specialized for head, lungs, abdomen, arms, and so forth. During the reconstruction, the operator of the CT scanner should choose the relevant version of the parameter set, just like it is done today with the different smoothing filters for the standard FBP. Nevertheless, in our numerical experiments we also show that a filter specialized on one anatomic region also makes a good performance on other ones. A deeper study of dependence of the learned filter on the training set should be carried out by professional radiologists.
A training set we build for a specific anatomic region consists of a sequence of axial slices from that region, uniformly distributed in -axis. The size of the training set is also a parameter to be investigated; we found that 9-12 images suffice for a stable optimization and a consecutive robust reconstruction of the thighs or the abdomen regions. However, in regions rich with small details where there are important but subtle differences between nearby slices (the brain, for instance) a larger collection of images is possibly required.

Computational
Complexity. The number of operations required for the × image reconstruction with our algorithm is O( 3 ), which is the same complexity as required by the regular FBP alone.
The measurements' matrix consists of 2 × √ 2 = 2.82 2 elements. The learned shrinkage applied to the measurements consists in applying the analysis operator D + at each patch of size × , then applying the scalar shrinkage function on each of the 2 coefficients and recomputing the patch by a multiplication with D. The dictionary D we use is the unitary 2D discrete cosine transform (DCT), which requires O(2 2 log( )) operations. We use patches of size = 11; thus, computing a representation of each patch requires about 2 2 log( ) = 580 operations and the same amount of work to convert the representation back to the signal. Further, applying a shrinkage function takes O( ⋅ 2 ) operations, where the constant is about 20, governed by the length of the vector q in (10). The number of patches, extracted from an × image, depends on the chosen amount of overlapping; up to 2 patches can be processed. To summarize, the application of the learned shrinkage filter has the computational complexity of O(4 ⋅ 2 2 (log( ) + 1)). The standard size of clinical CT images is = 512, so the number 2 (log( ) + 1) is of the same order of magnitude as (for = 11, this number equals 822). Therefore, it takes about 80 3 operations, which is comparable to the O( 3 ) complexity of the FBP transform.
We notice that the processing of individual patches can be naturally parallelized on multiple core or GPU, reducing the computation time by a nonnegligible factor (depending on the available hardware).  correspond to Hounsfield units (HU), given with the accuracy of 12 bits per pixel. Representatives of these sets we have assembled are displayed in Figure 3. We wish to point out that the reference images used for the training stage are not perfect since they were obtained using a standard X-ray dosage. If very-high-quality images were available for the training, we would expect our algorithm to perform better.

Empirical Study
In absence of raw measurements' data from a CT scanner we simulate the scan process by computing projections of given CT images (considered to be the ground truth) as follows. First, the intensity values in the image are converted from the Hounsfield units to the units of reciprocal length, corresponding to the linear attenuation coefficient . The relation between the two scales is (http://www.medcyclopaedia.com/library/topics/volume i/ h/hounsfield unit.aspx/) where (water) = 0.19 cm −1 , (air) = 0 cm −1 . The original 512 × 512 images are cropped to dimensions 461 × 461 by removing the empty background (to save computation time). Then, noiseless sinogram = R is simulated by applying to the reference image a pixel-driven implementation of the discrete 2D Radon transform. The algorithm for forward-and back projection uses linear interpolation in the locations of bins/pixels. Explicitly, each bin in a projection is a weighted sum of a few (temporary) finer bins, which are computed by integrating image intensities over a narrow (quarter of a pixel) ray in the image domain. The weights are linear in the distance between the center of the coarse bin and the centers of the fine bins. For × images, we have used views (projections), evenly distributed over the angle range [0, ]. Each projection consists of √ 2 bins. Ideal photon counts are computed from the sinogram entries via the relation ℓ = 0 − ℓ . The measured photon counts ℓ are produced by generating random Poisson variables with expectations ℓ and zeromean Gaussian variables with a chosen standard deviation . The X-ray dose is controlled by the maximal photon count 0 and the value of .
The design parameters of the proposed algorithm are set as follows. The filter, based on the learned shrinkage, is implemented using the 2D unitary discrete cosine transform (DCT) (see Figure 5). It is implemented using 2 elements composing a linear basis, which are matrices of dimensions × . The representation of a × signal is computed as the set of inner products between the signal and each basis element. This approach allows computing the representation of the entire collection of image patches in a batch, by convolving the image with each basis element. In our work we set = 11.
Each of the 121 corresponding shrinkage functions of the operator S p consists of 2 × 20 linear pieces (the factor of 2 is for the negative and the positive parts; recall that the shrinkage functions are antisymmetric by design, so there is just 20 degrees of freedom); this number was established empirically and is similar to the one used in [12]. Graphs of shrinkage functions, obtained in one of the training sessions, are displayed in Figure 4. The regularization parameters in (24), (26) were set to = 10 −4 , = 250. Use of the regularization has helped to constrain the shapes of the pre-and postprocessing shrinkage functions, avoiding jumps resulting from outlier samples. We discuss the process of their tuning in the sequel.
We mention that numerical experiments were also carried out with the nondecimated 3-level Haar Wavelet frame, but they are not presented here due to slightly inferior results (comparing to those obtained with the DCT). However, our impression is that a particular choice of the transform is not of a crucial importance. Another remark we shall make concerns the redundancy of the dictionary. We have compared the performance of the algorithm using the unitary DCT against its version with an overcomplete DCT. In our experiments, no improvement was induced by this change.

Implementation of the Existing Algorithms.
We implement three existing reconstruction algorithms. First is the standard FBP, with the classical noise reduction done by a sinogram smoothing. Second is an iterative, statisticallybased algorithm, approximating the penalized likelihood solution; specifically, it is the penalized weighted least squares (PWLS), proposed in [6]. The third method is the adaptive truncated median (ATM) filter [8]. It is close in its spirit to our work-a fast nonlinear preprocessing method, designed to improve the performance of FBP in a low-dose scan scenario.
6.2.1. Implementation of the FBP Algorithm. The FBP was realized according to the following formula: The low-pass filter F low is implemented by composing the Ram-Lak filter with the Butterworth window [25] in the Fourier domain. An expression for this window is The two defining parameters are the cutoff frequency 0 , which controls the frequency response of the filter and the order , which affects the steepness of its roll-off. In the experiments, these two parameters are tuned manually for the best visual impression on the training set (this issue is discussed in the sequel).

Implementation of the PWLS Algorithm.
We used the objective function stated in Equation (14) of [6], which represents an approximation to the log-likelihood function of the CT imagẽ, with an addition of a penalty component (̃). In our notation it is stated as follows: The expression for the regularization is Here ( ) is the convex edge-preserving Huber penalty and N( ) is the set of the four nearest neighbors of . Parameters , of the penalty component were tuned manually for best the visual impression on training images.

Implementation of the ATM Filter.
The ATM filter was briefly described in the introduction. Technically, its action on a location in a signal is defined by two parameters: the number of data values in the neighborhood of , involved in the filtering and the fraction of outliers assumed in this data. The filtered value at is computed as the average of this neighborhood, taken after removing the highest and the lowest values. This filter is made adaptive by using data-dependent parameters and , computed for each location of the measurements matrix. The formulas for ATM parameters, given in [8], involve the signal value (photon count) = ( ): In [8] it is not specified what is the shape of the neighborhood of the pixel , involved in its filtering. In our implementation, we assume it is a discrete disc. Also, no method for computing the inner parameters , , , is proposed; a set of prescribed values is given instead. In our implementation, these parameters are tuned to minimize the net MSE on the training set, by sweeping two-dimensional grids, built for different pairs of parameters. This is done in iterations, each time a different pair out of the four parameters is changed.

Visual Evaluation and Comparison of the Algorithms.
The existing and proposed methods are compared on the reconstruction of a test image of thighs' section. The noise on the projection data was generated by setting 0 = 1.5 ⋅ 10 5 , = 5. The visual comparison of is given in Figure 7, where the corresponding reconstructed images are presented. We display an enlarged region of the image, for better observability. The displayed dynamic range is set to [−220, 350] Hounsfield units (HU), chosen for best visualization of the particular axial slice. In general, there is no predefined HUwindow radiologists use to look at the CT images; the window is tuned manually in an interactive fashion, and it depends on the anatomical region, diagnostic purposes, and personal preferences. The reference image is depicted in Figure 6 in few different windows to display the effect of such tuning.
The FBP image (Figure 7, lower right) suffers from streak artifacts, which corrupt its content-especially, the fine details. The strength of the artifacts can be reduced at the cost of blurring the image; here the strength of the low-pass filter in FBP was tuned manually for visually plausible images. The ATM algorithm (middle right) displays a significant reduction in the strength of the streaks, which implies that most outliers in the Radon domain were removed. However, some residual noise and detailed corruption are still present.
The slow, iterative PWLS algorithm (upper row, right) provides a better version of the image, with improved sharpness and complete lack of streaks. The latter property is expected since the image is built as a MAP optimizer and there is no smear of the raw-data errors by the back projection. Still, there is a noise texture in the image. The PWLS performance can be tuned by varying the penalty weight, Huber parameter, and the number of iterations. These parameters were set manually to produce a high spatial resolution, at the cost of manageable noise; specifically, we used 90 iterations, and the value = 8 ⋅ 10 −5 in (31). Different, MSE-optimized reconstructions by the compared methods are given in Figure 16.
Finally, we refer to the two images produced by our method. The stage-I image (lower left) was obtained by applying the FBP to the preprocessed data. The postprocessing of this image results in the stage-II version (middle left). The FBP streaks are significantly reduced in stage-I image, similarl to the ATM method; however, some of the streaks are sharply visible around the bone area. The reason for their appearance is that the filter is not removing all of the problematic streak effects; since our method is based on FBP, some of its artifacts remain. Moreover, they are better visible on resulting the image due to reduced background noise and lack of the rest of the streaks. We should note that the general noise level is visibly lower than that appearing in the rest of the images. The noise is further reduced in stage-II image, without introducing additional blur.
The error images are displayed in Figure 8. With our method (stage II), the error image has less of the uniform noise than any of the three compared methods, and the edges appearing in the error image (they point to the loss of spatial resolution) are as weak as those observed in the PWLS reconstruction.
The structured similarity (SSIM) measure was introduced in [26] as an alternative to MSE which is more relevant to human perception of the images. The explicit formula involves first and second moments of the local image statistics and the correlation between the two compared images. In our numerical experiments, we have used the Matlab code provided by the authors of [26], which is available at http://www.cns.nyu.edu/∼lcv/ssim/. Finally, the MSEg measure is introduced in this paper, (19).
The SNR and MSEg values are measured in the range of [−220, 350] (HU). The MSEg value is detailed as a sum of the MSE component and the gradient-based component in order to show the balance between the two factors in the different cases. All three of the computed measure consistently point to the gradual quality improvement as one passes from FBP to ATM and to PWLS and finally to the proposed method.
To appreciate the effect of the postprocessing (stage II of our method), we display the absolute-valued difference between the two stages in Figure 9. Almost no structure can be observed in this error image, and this implies that very little of the image content is lost during the postprocessing.

Behavior in Different Anatomical
Regions. An important issue of the example-based training approach is the dependence of the reconstruction quality on the training set and the anatomical region. Recall that the reference images for training were taken from the thighs' region; we now use the trained shrinkage functions to restore axial head and abdomen sections. The compared reconstruction methods are also applied (without changing their parameters) to the new regions. The results are displayed in Figures 10-13. The test image from the head region, along with some other examples of head sections, is given in Figure 10. In the FBP reconstruction ( Figure 11, lower right), the quality of the fine details is reduced due to the streak noise; both PWLS (upper right) and our method (middle and lower left) exhibit better restoration of these details (see, for instance, the thin veinlike lines on the left and right sides of the image, as well as    the small bright spots in the upper central region). With our method the noise texture still exhibits streaks (since this is an FBP-based method), but the noise energy is lower than that in the PWLS, which is reflected in higher SNR value. The parameters of the ATM method should apparently be different for the head scan data since the resulting image is almost the same as the FBP without preprocessing. We conclude that ATM is sensitive to the choice of an anatomical region and has to be tuned using relevant training sets. Quantitative measures, presented in Table 1, also show the similarity between FBP and ATM images and point to improved quality of images produced by the shrinkage method. An exception is the MSEg measurement in the thighs' image, where the PWLS achieves a lower value of the gradient-based penalty.
In the abdomen image reconstruction ( Figure 13) the ATM performs better than the FBP but both ATM and PWLS do not succeed to reduce the noise like the shrinkage method does. Figures 12 and 14 depict the reconstruction error for the PWLS and our algorithm (other methods are omitted here out of space economy) and support the observations above.   Figure 15 with FBP, PWLS, and our method. Table 2 contains the standard quality measures-MSE and SSIM values for each noise level. The X-ray dose is increased exponentially from 0 = 3 ⋅ 10 4 to 0 = 3 ⋅ 10 6 (first to last rows in the figure), which results in linear improvement of the visual perception. The parameters of FBP were adjusted for each noise level; the parameters of the PWLS and our method are tuned for the exposure level of 0 = 1.5 ⋅ 10 5 . This is the noise level used to simulate the training set in our algorithm.
In the strongest noise (row 1), our approach exhibits more streaks than the PWLS version since it is an FBP-based method (because our methods was not trained for this level of noise). In other cases the visual impression is similar for both algorithms. As the X-ray intensity increased, the quality of images produced by our method rises promptly to attain a nearly perfect image at the highest exposure. This testifies for the robustness of the training procedure with respect to noise level since the reconstruction results are adequate for X-ray doses which are either significantly higher or lower than the dose in the training set. Notice that the parameters of the PWLS algorithm are not optimal for the lowest noise level, where an oversmoothing is observed. The SNR values of PWLS images lead the charts at very low and very high exposure levels, while in rows 2, 3 of the table our method shows superior results. The SSIM measure implies that the performance of PWLS is very close to that of our method. FBP is comparable to these two algorithms at high X-ray intensity and is way below in the low-dose scenario.
6.6. Aiming for Lowest MSE. Now we return to the question of whether MSE is an error measure relevant to radiological images. The images, presented earlier in Figure 7, were obtained by tuning the algorithms for high spatial resolution, at the price of noise level and the SNR value. Here we display three algorithms-FBP, PWLS, and our methodwith parameters optimized for maximal SNR (we were not able to control the ATM filter behavior that way). With FBP, it is achieved by tuning the cut-off frequency of its sinogram filter; for PWLS, we have increased the weight of the Huber penalty component. Our method is manipulated by tuning the weight of the gradient-based component in (24), (26).
In the upper row of Figure 16, reconstruction of the test image by those three methods with SNR-optimized parameters is displayed. The lower row contains the image versions from  ; they were produced with parameters optimized for visual perception. Specifically, the spatial resolution was improved at expense of a tolerable additional noise. This display is given to show the tradeoff between the noise reduction and spatial resolution and visualize our stimulus in pursuing the latter virtue rather than the former.

Detecting Lesions in Noisy Images.
A lesion detectability experiment is designed as follows: we add a small diskshaped blob in the homogeneous region of test image. The average intensity of the lesion is 105 HU on the background tissue of average 54 HU. The experiment is conducted in conditions of strong noise, concealing the lesion spot in the FBP reconstruction. Explicitly, it corresponds to 0 = 7 ⋅ 10 4 photons. In Figure 17 the reconstruction of a region, containing the lesion, is displayed. We compare our algorithm with the PWLS, ATM, and the FBP. The parameters of the learned shrinkage were trained offline on the training data with the same noise energy; the PWLS and ATM were used with the same parameters as earlier for the following reasons. For PWLS, a manual tuning of the smoothing weight did not result in any improvement of visibility. For ATM, there is no intuitive way to change the four parameters for a higher noise level. The FBP was used with same cut-off frequency but higher order of the Butterworth window, which made the lesion more observable. One can observe that the synthetic lesion (no similar structure was present in the training data) is recovered correctly and, in contrast with the FBP image, it can be clearly observed. PWLS produces an image which is more sharp and noisy than our result, and the ATM image is of a lower quality. The error images in Figure 18 (difference between the reconstruction and the reference) imply that the noise in the image produced by our method is lower than that in the PWLS reconstruction (this experiment is also the chance to compare the algorithms at a stronger noise). The lesion is not observed in any error image which means it is not lost in reconstruction; the problem with FBP is not that it fails to recover the lesion but the high streak-shaped noise obscuring it.

Design Parameters of the Proposed Method.
We now study the impact of various parameters appearing in the reconstruction chain. First we consider the objective function in (24). The weight controls the influence of the gradientbased component; when set to zero, the training leads to best MSE reduction. The influence of this component on the visual appearance of the image is observed in Figure 19. The presented sequence of images corresponds to values (these values are a subset of an exponential sequence of the range, swept in a numerical experiment. We chose these four values because they provide visibly different reconstructions.) = [14.3, 105.6, 205.5, 400]. As grows, the blur, introduced by the reconstruction, is reduced; however, new artifacts arise. They result from strong influence of the gradientbased component. After a finer tuning we chose to use = 100, which produces the most visually appealing image. Its sharpness is near best, and artifacts are on the level of background noise. For real-life application, a more elaborate study by a radiologist may be needed to tune this parameter for clinical needs.
Another aspect of the training process is the regularization weights in (24) and in (26), which restrict the deviation of the shrinkage functions from identity. In general, using such regularization is a good practice to increase the robustness of the method by preventing the overfitting. Also, this helps reducing the impact of outlier examples on the shape of shrinkage functions. In our experiments, the influence of those regularization terms was not significant: when the weights , are decreased, the effect of learned shrinkage becomes stronger but no negative phenomena appear. This is observed in Figure 20, where a number of versions of a test image, associated with different values of , are shown.
6.9. Local Impulse Response. The evaluation of spatial resolution in the CT images is carried out by computing the local impulse responses (LIR) of the projection-reconstruction operator in the image domain. The reference image is projected twice, one time in its original form and another with a random set of 212 single-pixel implants, scattered randomly in the image. The intensity of the implants is set to the maximal value present in the image. In Figure 21 we display an example region in a test image with added spikes and the corresponding maps of LIRs obtained by subtracting the two reconstructions-with and without spikes-by the compared methods. All the parameters of the compared methods are set as in the very first experiment. For each method, the fullwidth half-maximum (FWHM) measure is computed in all the locations and averaged. It is defined as follows: first, a 2-D image patch containing the response spot is resized into a ×16 larger image in order to reduce the discretization effect. Then, the number of pixels, which intensity is higher than half of the maximal value in the patch, is counted and divided by the refinement factor of 16.
The computation was done for FBP, PLWS, and our algorithm. An average FWHM value produced by stage I of our method is 2.11 pixels; the resolution is slightly improved to 2.06 pixels by the stage II. Notice that this postprocessing step simultaneously reduces the noise and increases the image sharpness; this virtue is attributed to the design of our error measure. The FBP exhibits the same average FWHM value-2.04 pixels. In Figure 21 both FBP and our method are seen to produce disk-shaped LIRs without distortions everywhere in the image. The situation is different with the data-adaptive PWLS, which achieves lower FWHM values-1.61 pixels on   while keeping the level of image quality. We estimate the reduction factor by comparing the SNR and MSEg values of the reconstructed image, acquired with different X-ray doses (controlled by the source intensity 0 ). The comparison is conducted between the standard FBP and the proposed reconstruction method. For each dose level, we tune the FBP parameters to choose a reconstruction with minimal MSEg value. In a second experiment, for each noise level, the FBP is tuned to maximize the SNR value. In Figure 23 we present the resulting comparison of these two measures for a low dose and a regular dose scan. The -axis is scaled to show the dose reduction factor, while the MSEg or SNR values change along the -axis. In the SNR graph we plot the values achieved by SNR-optimized FBP, and, in MSEg graph, the performance of MSEg-optimized FBP is displayed. Those are compared to the single SNR/MSEg value, achieved by our algorithm in the noise level it was trained for; the -axis of the graphs was scaled to display the ratio between the X-ray dose levels for FBP and our method. All the four graphs point to the effective dose savings of factor ∼4 when switching from the optimally tuned FBP to our method.

Discussion and Conclusions
We have introduced a practical CT reconstruction algorithm which performs a non linear processing of the measurements and the reconstructed image. Both actions are aimed at high-quality reconstruction from data corrupted with shot noise and electronic noise. The defining parameters of the learned shrinkage are trained in an offline session on a set of available reference images. When applied to deteriorated measurements of new images, the algorithm produces a reconstruction which improves upon the standard FBP output and the nonlinear ATM filter and is comparable to the iterative PWLS reconstruction.
The learned shrinkage in the transform domain is a nonlinear two-dimensional filter applied in the domain of the noisy measurements. It is shown to be capable of substantially reducing the streak artifacts caused by the measurements' noise. Further postprocessing action is essentially a classical image denoising task, which is carried out without a comprehensive noise model and is aimed to minimize a specific error measure. It is also performed with the learned shrinkage. Our observations, supported by quantitative measures, point to the quality improvement this technique brings to the reconstructed images.
We remark that a potentially greater quality improvement would be achieved by exploiting data correlation in three dimensions (instead of processing 2D slices), similarl to three-dimensional adaptive filtering presented in [9]. Our algorithm naturally generalizes to the 3D setup-the extension would involve 3D discrete cosine transform and a quality measure computed in some volumes of the training data.
As with any algorithm, based on supervised learning, there is a concern of whether the medical anomalies and special objects will be faithfully recovered. In our simulations, the visual comparison of the images reconstructed with our method to the reference images confirms that the content is faithfully recovered. Also, we point to the fact that the processing is performed locally (11 × 11 squares) and in a transform domain; the action of the shrinkage operation is of statistical rather than geometric nature; thus it is improbable and it will be biased by specific anatomical structures. Still, only the practical realization of the algorithm and verification by clinical radiologists can resolve this concern.
The proposed algorithm requires no hardware changes in a working CT scanner and can be easily incorporated into the reconstruction software of one; thus, in practice it can be implemented in existing clinical machinery with small effort.