A Medical Image Fusion Method Based on SIFT and Deep Convolutional Neural Network in the SIST Domain

The traditional medical image fusion methods, such as the famous multi-scale decomposition-based methods, usually suffer from the bad sparse representations of the salient features and the low ability of the fusion rules to transfer the captured feature information. In order to deal with this problem, a medical image fusion method based on the scale invariant feature transformation (SIFT) descriptor and the deep convolutional neural network (CNN) in the shift-invariant shearlet transform (SIST) domain is proposed. Firstly, the images to be fused are decomposed into the high-pass and the low-pass coefficients. Then, the fusion of the high-pass components is implemented under the rule based on the pre-trained CNN model, which mainly consists of four steps: feature detection, initial segmentation, consistency verification, and the final fusion; the fusion of the low-pass subbands is based on the matching degree computed by the SIFT descriptor to capture the features of the low frequency components. Finally, the fusion results are obtained by inversion of the SIST. Taking the typical standard deviation, QAB/F, entropy, and mutual information as the objective measurements, the experimental results demonstrate that the detailed information without artifacts and distortions can be well preserved by the proposed method, and better quantitative performance can be also obtained.


Introduction
e pathology information displayed by medical imaging of different modalities plays a key role in modern medical diagnosis. Unfortunately, it is difficult to synchronously get the full-information images by one imaging device at the same time due to their different imaging principles [1]. erefore, doctors have to spend more time and energy to read the medical information they want from different devices. A common method to deal with this problem is to fuse the multi-modal images from the same location of the body into one image, which is called the medical image fusion and has been widely used in medical image analysis, precision radiotherapy surgery, and computer-aided medical diagnosis [2].
Nowadays, various medical image fusion methods have been proposed, all of which can be roughly classified into two categories: methods in the spatial domain and in the transformed domain. Different from the former directly using some algebraic operations or filtering, the latter methods, capturing more features in different scales and directions, are the research hotspot. Such scheme usually contains three steps: decomposition, combination, and reconstruction [3].
From the fusion procedure, it is clear that the fusion performance is highly determined by the decomposition tools and the fusion rules. e tools play the role of providing the sparse representations of the features and the fusion rules play the role of transferring the features into the final fusion results. For the decomposition tools, the Laplace Pyramid transform cannot provide directional information; the typical wavelet transform only can decompose the images into three high-pass subbands in each level, so it is limited by the number of directions. e contourlet can get more directional subbands in each level, but the loss of shift-invariance is easy to result in the pseudo-Gibbs phenomenon [4]. For the fusion rules, the active level measurement-based rule [5] is popular; however, it is easy to produce the artifacts. ough some other fusion rules have been proposed, such as the SVM [6], PCA [7], ICA [8], etc., the fusion results are still unsatisfactory. It is important to consider the feature information during the implementation of the fusion rules [9]. Recently, there has been some good work to improve the fusion performance from these aspects. For example, in literature [10], it proposes a multi-modality image fusion method in the non-subsampled contourlet transform (NSCT), in which the high-pass subbands are integrated by the phase congruency-based rule and the low-pass subbands are combined by the local Laplacian energy-based rule. In literature [11], it proposes an image fusion framework, which integrates NSCT into sparse representation and a principal component analysis (PCA) is implemented in dictionary training to reduce the dimension of learned dictionary. e low-and high-pass coefficients are fused by the sparse representation and Sum Modified-Laplacian, respectively. In literature [12], the source multi-modality images are decomposed into cartoon and texture components. e cartoon components are combined by an energy-based fusion rule for morphological structure preservation and the texture components are combined by the dictionary training. In addition, some similar fusion schemes can be found in the literatures [13,14]. Such schemes provide good fusion results for they have made full use the of the good mathematical properties of NSCT and the learning abilities of dictionary learning to capture the important features. e main disadvantage, however, is the time cost. e NSCT, the shift-invariance version of the contourlet transform, is too time-consuming because it has to use the non-subsampled band-pass filters to produce the shift-invariance. And the dictionary training usually suffers from the number and the dimension of the dictionary. It is easy to result in the dimensional disaster in the fusion.
Another research hotspot is the neural networks-based medical image fusion methods. Some good results have been reported, such as the artificial neural networks-based method, the Pulse Coupled Neural Network-(PCNN-) based method [15]. However, the fusion performance is limited by how to tune the parameters and the number of the layers in the traditional neural networks models. Very recently, the deep learning technologies, such as the deep convolutional neural network, have achieved great success in the areas of image classification and target recognition, as well as in image fusion. For example, literature [16] proposes a novel image fusion algorithm based on deep support value convolutional neural network, literature [17] proposes the medical image fusion with the all convolutional neural network, and literature [18] proposes a general image fusion framework based on convolutional neural network, which is called IF-CNN. Literatures [19,20] review the recent advances and future prospects about deep learning for pixel-level image fusion. In the above methods, the good results are obtained for their better learning ability than the traditional neural network models. However, such methods are directly learned in the pixel level, losing the important feature information.
To deal with the above problems, a medical image fusion method based on the SIFT and CNN in the SIST domain is proposed. Different from other transformation tools, such as the wavelet and the contourlet, the SIST can decompose the images into high-pass and low-pass subbands to extract more useful features in different scales and directions. Besides, with the same shift-invariance as the NSCT, the calculation efficiency of SIST is higher. To make full use of the features in the source images, the fusion rule for the lowpass subbands is based on matching degree of the SIFT descriptor. e SIFT feature is based on the local points of interest on the object and is independent of the size and rotation of the image. So, its tolerance for changes in noise and micro-viewpoints is quite high [21]. From this point of view, it is more suitable than the structure features for medical image fusion. e fusion of the high-pass subband is based on the CNN-based scheme to employ the good learning ability of the CNN model. e rest of the framework is organized as follows. e details of the proposed method are shown in Section 2. Experimental results with important discussions are shown in Section 3. Finally, the conclusion is shown in Section 4.

Methodology
e whole procedure of the proposed medical image fusion method is described in Figure 1. After the decomposition and directional partition in different scales, the coefficients of the source medical images are obtained. en, the high-pass and low-pass coefficients of the fused image are produced by the corresponding fusion rules. Finally, the fusion results are obtained. e principle of the proposed method can be specifically explained from three aspects to understand: firstly, from the tools of sparse representation in medical image fusion, the SIST has better mathematical properties to provide the good representations of the important features; secondly, the traditional fusion rules are easy to lose the captured feature information during the procedure of transferring them into the final results; thirdly, the transferred feature information is only in the low level and it is not abstract enough to do the feature fusion. erefore, considering the above needs, a CNN model is pre-trained the get the deep and abstract features in the SISTdomain and the SIFT-based fusion rule is developed.

e Shift-Invariance Shearlet Transform.
e discretion of SIST mainly consists of two steps [21]: the multi-scale partition and the directional localization. To provide the shift-invariance, the former step is done by the non-subsampled pyramid filters, and the latter step is implemented by using the filters of shearing. Let j be the scale of image decomposition, j � 1, 2, . . . , M; the whole process can be summarized as the following steps.
(1) e image f j is decomposed into low-pass image f j+1 and high-pass image g j+1 using the non-subsampled pyramid f. (3) Directly re-assemble the Cartesian sampled values and apply the inversing 2-D FFT to produce the SIST coefficients.
e inversing transformation is the opposite process of the forward transformation. Since there is no need to use directional band-pass banks to get different directions like the NSCT; the SIST is more efficient. More details about the implementation can be found in literatures [22,23].

e Fusion of the High-Pass Subbands.
e procedure of the high-pass fusion is shown in Figure 2. Before the fusion, a CNN model is trained by the pre-fused images. e whole fusion process mainly consists of four steps: feature detection, initial segmentation, consistency verification [24][25][26], and the final fusion. In the first step, the high-pass subbands are input into the CNN model to output the score map, which contains the feature information of each high-pass subband. Each coefficient in the score map represents the feature attribute of a pair of corresponding blocks from two high-pass subbands.
en, by averaging the overlapping regions, a feature map of the same size is obtained from the score map. Furthermore, the feature map is segmented into a binary map with the threshold. In the third step, the consistency verification is implemented to refine the binary segmentation mapping to generate the decision map. Finally, the fused image is obtained by applying the pixel-weighted scheme on the decision map.

Train the CNN Model.
For a pair of medical image patches {A, B} of the pre-fuse images, the goal is to learn a CNN model whose output is a scalar ranging from 0 to 1. Specifically, when the feature is almost from A but not B, the output value should be close to 1, or the value should be close to 0. In other words, the output represents the feature degree of the pair of the image patches. erefore, a large number of example pairs are used to be the training examples.
In Figure 3, the structure of the trained CNN model is shown. It has two identical architecture branches, each of which takes the medical image blocks as the input. According to [27], it is suitable to set the size of image block to 16 × 16. ere are three convolution layers and a maximum pooling layer in each branch of the network. e size of neuron perception is determined by the core size of the convolution layer. In this paper, the core size is set to 3 × 3, the step size is set to 1, the scaling factor of the pooling layer is set to 2 × 2, and the span space is set to 2.

e Feature Detection.
Let A H and B H be the two highpass subbands; a score map can be obtained once A H and B H are input into the constructed CNN model. e value of each coefficient in the score map ranges from 0 to 1, indicating the feature degree of a pair of 16 × 16 blocks. e closer that the value is to 1, the more concentrated patches are from image A H , and vice versa. In order to generate a feature map (represented as M here) of the same size, it assigns the value of each coefficient in the score map to all the coefficients of the corresponding block in M and averages the overlapping pixels.

Initial Segmentation.
In order to retain as much useful information as possible, the feature map needs to be applied on the maximum strategy. According to the experience, a d threshold of 0.5 is applied to the feature map to generate the binary map; that is, the focus map is divided by the following formula:

Journal of Healthcare Engineering
where is M is the focus map, T is the binary map, is the τ threshold. According to our experience in the experiments, it is found that when the threshold equals 0.5, it is good enough for medical image fusion and it is suggested to be in the range from 0.4 to 0.7 for the other multi-modal image fusion.

e Consistency Verification and Fusion.
ere may be some misclassified pixels in the above binary map, so it is necessary to remove these mistakes. e traditional method to deal with this problem is to use the threshold scheme, but it is easy to result in some unexpected artifacts around the boundary between the focused and defocused regions. erefore, the guided filter [28], which is the effective edgepreserving filter to retain the structural information, is employed.
ere are two free parameters in the guided filtering algorithm: the local window radius r and the regularization parameter ε. In this paper, r is set to 8 and ε is set to 0.1. More details about its implementation can be found in [29]. Finally, the fused high-pass subbands can be obtained by the following weighted formula: D(x, y))B H (x, y), (2) where F H is the high-pass subband of the fused image, D is the decision map, and A H and B H are the corresponding high-pass subbands of the image to be fused, respectively.

e Fusion of the Low-Pass Subband.
e fusion of the low-pass subband is based on the matching degree of the SIFT [29,30]. Suppose fdesc 1 (i) and fdesc 2 (j) are the SIFT descriptor from the low-pass subbands of the two images to be fused, where i ∈ (1, m), j ∈ (1, n), and m and n are the total number of the SIFT descriptor, respectively. en, compute the distance dist(i, j) between fdesc 1 (i) and fdesc 2 (j), and sort all the distances. Let 2nd BigDist(i, j) be the second largest value; if dist(i, j) < 2nd BigDist(i, j), the two SIFTs are called matched.
If fdesc 1 (i) and fdesc 2 (j) are matched, record their location, respectively. If the locations are also the same, it means that both of the content and the location of the region that computed the SIFT descriptors are the same [20]. Finally, the SIFTdescriptors that meet the above conditions are recorded to generate a matching degree map match map, where 1 < � i < � n. e low-pass subband of the fusion result can be obtained by the following formula: where "∼" means the negation, F L is the low-pass subband of the fused image, and A L and B L are the corresponding lowpass subbands of the image to be fused, respectively.

Results and Discussion
In this section, experiments in six groups are carefully done to show the performance of the proposed fusion method. Before the experiments, a CNN model is firstly trained under the public medical data set LIDC, Whole Brain Atlas, and the nature data set ImageNet. All the data sets are downloaded and pre-processed to be the same size of 256 × 256. To get the parameters of the CNN model, 2000 medical images from LIDC, 3000 nature images from the ImageNet, and 200 medical images from the Whole Brain Atlas are, respectively, used to produce the sub-model and the final model is integrated based on the three sub-models. e experimental    platform is the INSPUR big data processing server NF5280M5, Intel Xeon CPU, 128 GB RAM.
Four famous medical image fusion methods, i.e. the Pulse Coupled Neural Network-based method (noted as PCNN) [31], the convolutional sparse representation based method (noted as CSR) [32], the Shearlet based method (noted as Shearlet) [33], and the Deep Convolutional Neural Network-based method (noted as DCNN) [34], are employed to prove the efficiency of the proposed fusion method (Proposed for short). All the parameters are set as the same as what they are reported in the corresponding literature. e decomposition level of SIST is set to 4 and the filters are all "maxflat." After decomposition of each level, 32, 32, 16, and 16 high-pass subbands are obtained.
ere is no gold standard for evaluating image fusion at present. e usual approach is to use subjectively visual comparisons and objectively quantitative comparisons.
is convention is also followed in our paper. Standard deviation (SD for short), entropy (En for short), mutual information (MI for short), and Q AB/F are used to be the  Journal of Healthcare Engineering objective evaluation measurements. SD measures the degree of single pixel value relative to the mean value. En shows how much information the image itself contains. MI shows how much information the fused image captured from the source images; Q AB/F measures the edge information transferred from source image to fusion image. e greater the value of these measurements, the better the fusion results [35].
In Figure 4, the three groups in the first row are the gray CT and MRI images, and the three groups in the second row are the color CT and PET images from the patients of anaplastic astrocytoma and mild Alzheimer's disease, respectively. In each data set, the number of slices is 20, 15, and 21, respectively. To save the space, parts of the fusion results are shown in Figures 5 and 6, and the average of the objective evaluation is shown in Tables 1 and 2, respectively.   ough the information expressed by the above fusion images is better than the source image, the fusion results are different. By comparing the arrows of different colors in Figures 5 and 6, it shows that the edge of PCNN method is obviously blurred, and the individual character details (labeled by the blue arrows) and contour features (labeled by the yellow arrows) have been lost. For the Shearlet and DCNN method, the result is clear enough for its good learning ability, but the detail and texture are not good (labeled by the red and green arrows). e main reason is that it is directly learned in the pixel level. In contrast, the detail and texture structures in the fusion results obtained by the proposed method are much clearer, and the ghosting phenomenon can be effectively eliminated. It can be seen that the method proposed is better than CSR and Shearlet method in detail processing. Particularly, comparing the yellow arrow in Figure 6 for the DCNN and the proposed method, it obviously indicates that the information in the results obtained by the proposed method can be kept as what they are like in the source data. In addition, from the objective evaluation in Tables 1 and 2, it can be seen that the objective value of the proposed method is much higher than other methods under the four indicators, which further verifies that more feature information can be effectively captured and fully transferred into the fusion results by the proposed method, showing better visual sensing. All the results prove that more detail features from the source images can be captured and transformed well into the final results by the proposed method.

Conclusion
Based on the SIST and the CNN, this paper proposes a medical image fusion method, which makes full use of the multi-resolution and multi-directional characteristics of SIST, and also combines the self-learning advantages of CNN. According to the careful objective analysis and subjective comparison, experiments show that the target information and contour features can be well displayed in the final results. Besides, the artifacts and distortions can be effectively suppressed. Compared with other famous fusion methods such as the PCNN-based method, DCNN-based method, sparse representation-based method, etc., the proposed method can get better fusion results.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.