Video Denoising Based on a Spatiotemporal Kalman-Bilateral Mixture Model

We propose a video denoising method based on a spatiotemporal Kalman-bilateral mixture model to reduce the noise in video sequences that are captured with low light. To take full advantage of the strong spatiotemporal correlations of neighboring frames, motion estimation is first performed on video frames consisting of previously denoised frames and the current noisy frame by using block-matching method. Then, current noisy frame is processed in temporal domain and spatial domain by using Kalman filter and bilateral filter, respectively. Finally, by weighting the denoised frames from Kalman filtering and bilateral filtering, we can obtain a satisfactory result. Experimental results show that the performance of our proposed method is competitive when compared with state-of-the-art video denoising algorithms based on both peak signal-to-noise-ratio and structural similarity evaluations.


Introduction
Recently, as the rapid development of digital imaging technology, digital imaging devices have been widely applied in many fields, including computational photography, security monitoring, robot navigation, and military reconnaissance. However, video signals are often contaminated by all kinds of noise during acquisition and transmission, such as optical noise, component noise, sensor noise, and circuit noise. The noise in video signals not only damages the original information and results in unpleasant visual effect, but also affects the effectiveness of further coding or processing such as feature extraction, object detection, motion tracking, and pattern recognition. So, noise reduction in contaminated video sequences should be implemented.
Many video denoising methods have been proposed in the past decade, most of which perform in the spatial domain, temporal domain, or their combination [1][2][3][4][5][6]. Methods in spatial domain often produce limited results because they do not take advantage of spatiotemporal correlations of neighboring frames. Methods in temporal domain consider the correlations of neighboring frames, but they are only appropriate for still video. Additionally, the results have artifacts or smear phenomenon when objects motion exist. By combining the spatial domain with temporal domain, impressive results can be produced. However, these methods generally require a huge amount of computation. With the emergence of new multiresolution tools, such as the wavelet transform [7,8], video denoising methods performing in transform domain were proposed continually [9][10][11][12]. Now, the transform domain techniques in general, especially the wavelet-based video denoising methods, have been shown to outperform these spatiotemporal video denoising methods. Moreover, methods that combine spatiotemporal domain and transform domain were also proposed [13][14][15][16], which could produce perfect denoising effect. Similarly, this kind of methods also require huge amount of computation.
However, although video denoising technology has made great progress, most of these methods are unable to obtain ideal effect for large noisy video sequences in low light, which is urgently needed in many fields, especially in the security monitoring field. In this field, the monitoring devices are fixed in some places in general, so the captured video sequences have fixed background. In practical applications, it often requires to see the characteristic both of still and moving objects in the video sequences clearly. This requirement can be met easily in the day time. However, in the night time, because of the low light condition, captured video sequences are contaminated by noise badly. To some extent, existing video denoising methods can reduce the noise of 2 The Scientific World Journal contaminated video sequences, but this is far from enough to meet the requirement.
In this paper, a novel video denoising method based on a spatiotemporal Kalman-bilateral mixture model is proposed. Firstly, we perform an appropriate average filtering on current noisy frame to reduce the influence of noise, which we call prefiltering. This step is useless to the final denoising result, but preparative to the motion estimation. Then, take advantage of the strong spatiotemporal correlations of neighboring frames, block-matching based motion estimation is performed by comparing current pre-filtered frame with previously denoised frames. Based on motion estimation results, current noisy frame is processed in the temporal domain by using Kalman filter [17] on the one hand. It is noteworthy that different blocks of the noisy frame have different filtering strength according to their block-matching results. In the Kalman filtering, motion blocks have quite weak filtering strength to keep their motion characteristic, while still blocks have strong filtering strength to reduce the noise. On the other hand, current noisy frame is also processed in the spatial domain by using bilateral filter [18], which aims at reducing the noise globally. Finally, by weighting the two denoised frames from Kalman filtering and bilateral filtering, we can obtain a satisfactory result, in which the still region is largely from Kalman filtered result and the motion region is almost from bilateral filtered result. Experimental results show that the performance of our proposed method is effective over current competing video denoising methods.
The remainder of the paper is organized as follows. Section 2 reviews related work. Section 3 describes our proposed spatiotemporal Kalman-bilateral mixture model. Section 4 provides quantitative quality evaluations of the denoising results. In Section 5, experiments are implemented and the experimental results are shown. Finally, Section 6 concludes this article.

Related Work
Buades et al. [2] firstly proposed the Non Local Means (NLM) method. This method replaced a noisy pixel by the weighted average of pixels with related surrounding neighborhoods, and finally could produce quite satisfactory denoising results. However, high computational complexity makes this method impractical. Later, Karnati et al. [3] improved the NLM algorithm. They replaced the window similarity by a modified multiresolution based approach with much fewer comparisons rather than all pixels comparisons. In their method, mean values of the variable sized windows were computed efficiently using summed image (SI) concept, which requires only 3 additions. Finally, the computational speed was increased by 80 times. Based on the NLM algorithm, many methods were proposed for video denoising [4][5][6]13]. Mahmoudi and Sapiro [4] introduced filters that eliminated unrelated neighborhoods from the weighted average to accelerate the original NLM algorithm and applied it for video denoising. Yin et al. [5] proposed a novel scheme by using the mean absolute difference (MAD) of the current pixel block and the candidate blocks both in spatial and temporal domain as a preselecting criterion. Rather than one single pixel, this scheme reconstructed a block with different number of pixels according to the statistic property of the current pixel block, which dramatically lowered the computational burden and kept good denoising performance. Dabov et al. [13] proposed an effective video denoising method based on highly sparse signal representation in local 3D transform domain. They developed a two-step video denoising algorithm where the predictive search block-matching was combined with collaborative hard-thresholding in the first step and with collaborative wiener filtering in the second step. Finally, state-of-the-art denoising results were achieved. Moreover, Guo et al. [19] proposed a recursive temporal denoising filter named multihypothesis motion compensated filter (MHMCF). This filter fully exploited temporal correlation and utilized a number of reference frames to estimate the current pixel. As a purely temporal filter, it well preserved spatial details and achieved satisfactory visual quality.
In addition, there are still many video denoising methods performing in transform domain [9][10][11][12][14][15][16]. Zlokolica et al. [9] introduced a new wavelet based motion reliability measures and performed motion estimation and adaptive recursive temporal filtering in a closed loop, followed by an intra-frame spatially adaptive filter. Mahbubur Rahman et al. [10] proposed a joint probability density function to model the video wavelet coefficients of any two neighboring frames and then applied this statistical model for denoising. Jovanov et al. [11] reused motion estimation resources from the video coding module for video denoising. They proposed a novel motion field filtering step and a novel recursive temporal filter with appropriately defined reliability of the estimated motion field. Luisier et al. [12] proposed an efficient orthonormal wavelet-domain video denoising algorithm. This method took full advantage of the strong spatiotemporal correlations of neighboring frames and could outperform most state-ofthe-art wavelet-based techniques. Yu et al. [14] integrated both the spatial filtering and recursive temporal filtering into the 3-D wavelet domain and effectively exploited both the spatial and temporal redundancies. Varghese and Wang [15] applied motion estimation to enhance the correlations between temporal neighboring wavelet coefficients and proposed a spatiotemporal Gaussian scale mixture model for natural video signals. Maggioni et al. [16] separately exploited the temporal and nonlocal correlation of the video and constructed 3-D spatiotemporal volumes by tracking blocks along trajectories defined by the motion vectors. In addition, other video denoising methods, such as the method by using low-rank matrix completion [20], were also proposed recently and achieved good results.
However, most existing video denoising methods cannot achieve satisfactory results when the video sequences are contaminated badly in low light. In this paper, we propose a spatiotemporal Kalman-bilateral mixture model, which can reduce the noise in large noisy video sequences that are captured with low light.

Proposed Spatiotemporal Kalman-Bilateral
Mixture Model Figure 1 illustrates the diagram of our proposed spatiotemporal Kalman-bilateral mixture (ST-KBM) model. The denoising of current noisy frame involves not only the frame itself, but also a series of past denoised frames. Firstly, prefiltering is performed on current noisy frame. The purpose of this operation is to reduce the influence of noise as possible and prepare for next motion estimation. Motion estimation is performed between the current noisy frame and past denoised frames, and the estimation results are used to guide the Kalman filtering on current noisy frame. In addition, bilateral filtering is also performed on current noisy frame. So, after above processing, there are two denoised frames, one comes from Kalman filtering and another comes from bilateral filtering. Finally, by weighting the two denoised frames, we can obtain a satisfactory result.

Motion Estimation.
Motion estimation itself is a complex problem. Generally, motion estimation is performed directly.
When the video has relatively little noise, estimation results will be accurate. However, as the increase of noise, the precision of motion estimation becomes quite low. With the influence of large noise, precision motion estimation is becoming difficult. So, we perform average filtering on the current noisy frame to restrain the influence of noise as possible before motion estimation, which is called prefiltering. After the prefiltering step, the large noise is restrained by a huge margin while the motion in the video remains well. In this case, although the frame has become quite fuzzy, motion estimation is not affected. Note that the prefiltering procedure is only implemented for motion estimation, rather than contributing for the image-signal denoising. Then, take advantage of the strong correlations between adjacent frames, motion estimation based on block-matching is performed by comparing current pre-filtered frame with past denoised frames. Block-matching (BM) [21] is a particular matching approach that has been extensively used for motion estimation in video compression. Here, we use it to calculate whether motion exists in the block.
An illustrative example of block-matching is given in Figure 2. Firstly, divide current pre-filtered frame and past denoised frames into a number of blocks which have fixed size × . Then, we compare the block in current prefiltering frame with blocks that have the same position in past denoised frames, respectively, and use ℓ 2 -distance as the measure whether motion exists in the block, which is called motion measure. The block distance can be calculated as where ‖ ⋅ ‖ 2 denotes the ℓ 2 -norm, V( current ) and V( past, ) are the intensity gray level vectors of the th block in current prefiltering frame and that in the th past denoised frame, respectively. After calculating the block distances between current prefiltering frame and each past denoised frame, respectively, final motion measure of the th block in current prefiltering frame can be gain by averaging them as follows: The averaged block distance measure the extent that motion exists in the block of current prefiltering frame. The larger the value is, the greater the likelihood is. Therefore, by calculating all of the block distances in current prefiltering frame, we can get global motion estimation.

Motion Estimation Based Kalman Filtering in Temporal
Domain. The discrete Kalman filter [17] is a set of mathematical equations that provides an efficient computational solution of the least squares method. It can estimate the state of a dynamic system from a series of incomplete noisy measurements by using a form of feedback control. This procedure consists of two consecutive stages: prediction and updating. The prediction stage projects forward the current state and error covariance estimates to obtain a priori estimate for the next time step in time. The updating stage incorporates a new measurement into the priori estimate to obtain an improved posteriori estimate.
The prediction equations can be presented as follows: In the above equations, the superscripts "−" and "+" in the equations denote "before" and "after" each measurement, respectively. + −1 and + −1 represent the estimated state matrix and state covariance matrix of last state, respectively. − and − represent the priori estimates of state matrix and state covariance matrix for current state.
represents the state transition matrix which determines the relationship between the present state and the previous one. Matrix relates the control input to current state. −1 represents the covariance matrix of process noise.
In our case, we try to estimate current video frame based on the last one. So, the state matrix in above equations is just the video frame matrix. In the video sequences, there is not any control input, which means = 0. For the priori estimates for current state, we assume it is the same as last state. So, we can obtain following equations: The process noise in the video sequences is just resulted by the motion. So, for any pixel ( , ) in the th block of current noisy frame, we define −1 ( , ) = , which keeps the covariance of motion region larger than that of still region. The updating equations can be presented as follows: The first task during the updating stage is to compute the Kalman gain, Kg , which is known as the blending factor to minimize the posteriori error covariance. In the above equations, − and − are the priori estimates calculated in prediction stage. Matrix describes the relationship between the measurement vector, , and the posteriori state vector, + . is the covariance matrix of measurement noise. + is the posteriori estimates of state covariance matrix for current state.
In our case, and + represent current noisy and denoised frames, respectively.
is the unit matrix. The measurement noise just represents the noise in the video sequences. So, we can obtain the following equations: After Kalman filtering, we can obtain a denoised frame, in which the still region is denoised quite well. However, the moving region still has much noise because Kalman filter retains the information of this region intact. Therefore, for the motion region, we use the bilateral filter to reduce its noise as possible. [18] as a noniterative means of smoothing images while retaining edge detail. It involves a weighted convolution in which the weight for each pixel depends not only on its distance from the center pixel, but also its relative intensity. So, for any pixel ( , ) in the frame, its filtered intensity value ( , ) can be calculated as follows:

Bilateral Filtering in Spatial Domain. The bilateral filter was introduced by Tomasi and Manduchi
In above equation, , represents the neighbourhood centered in the pixel. coefficient depended on the intensity different from the center pixel. and are the variation coefficient of the two weighting coefficient, which control their degree of attenuation.
Only reducing the noise in the moving region of denoised frame from Kalman filtering is complicated. So, we apply the bilateral filter on whole current noisy frame. In this case, both the still region and moving region are denoised. Then, by weighting the two denoised frames from Kalman filtering and bilateral filtering, an integrated denoised frame can be obtained, in which the still region is from Kalman filtering and the moving region is from bilateral filtering.

Weighted
Average. After Kalman filtering and bilateral filtering, we have two denoised frames. One is from Kalman filtering, in which the still regions are well denoised but the motion regions remain the noisy information intactly. Another is from bilateral filtering, in which the motion regions are denoised to some extent. So, we integrate the two denoised frames by weighting them based on motion estimation results. The weight is based on Gaussian distribution, and for any pixel ( , ) in the th block, its weight value, ( , ), can be calculated as follows: Based on the above equation, the motion and still regions can be further distinguished effectively. As shown in Figure 3, the larger the value of motion estimation is, the smaller the weight is. is used to control the degree of attenuation.
Then, the weighted denoised frame can be calculated as follows  Here, represents the weight matrix calculated by (10). kalman and bilateral represents the denoised frame matrices by Kalman filtering and bilateral filtering, respectively. is just the desired weighted frame matrix. After weighted average, both the motion region and still region of the weighted frame have been denoised, as shown in Figure 4.

Validation Criteria
For providing quantitative quality evaluations of the denoising results, two objective criteria, namely the PSNR and the SSIM [22][23][24], are employed. PSNR is defined as where is the dynamic range of the image (for 8 bits/pixel images, = 255) and MSE is the mean squared error between the original and distorted images. SSIM is first calculated within local windows using SSIM ( , ) = (2 + 1 ) (2 + 2 ) where and are the image patches extracted from the local window from the original and noisy images, respectively. , 2 , and are the mean, variance, and cross-correlation computed within the local window, respectively. The overall SSIM score of a video frame is computed as the average local SSIM scores. PSNR is the mostly widely used quality measure in the literature, but has been criticized for not correlating  well with human visual perception [25]. SSIM is believed to be a better indicator of perceived image quality [25]. It also supplies a quality map that indicates the variations of images quality over space. The final PSNR and SSIM results for a denoised video sequence are computed as the frame average of the full sequence.

Experiments and Results
In order to evaluate the performance of our proposed ST-KBM algorithm, we compare it with some state-of-theart video denoising algorithms, such as ST-GSM [15] and VBM3D [13]. The original codes of these two algorithms can be downloaded online [26,27].
In the experiments, four video sequences are selected from the publicly available video sequences [28], which have fixed background. The noisy video sequences are simulated by adding independent white Gaussian noises of given variance 2 on each frame. Table 1 shows the PSNR and SSIM results of proposed ST-KBM, ST-GSM, and VBM3D for the four video sequences at five noise levels. When the noise level is relatively low, the proposed ST-KBM algorithm works well but still has a gap with ST-GSM and VBM3D. However, when the noise level is high, it performs better than ST-GSM and VBM3D for most of the test sequences. In particular, the SSIM of ST-KBM is much better than other two algorithms.
In Figure 5, we show the PSNR and SSIM from frame 200 to 300 of the test video sequences corrupted by noise with = 100. With the comparison to PSNR, our proposed ST-KBM algorithm performs slightly better than ST-GSM and VBM3D. However, for SSIM, it outperforms ST-GSM and VBM3D obviously, which means that the denoised video sequences by using ST-KBM algorithm have better visual quality. Figure 6 demonstrates the visual effects of the The Scientific World Journal 9 three video denoising algorithms. In particular, we show the frame 105 extracted from the Salesman sequence, together with a noisy version of the same frame, and the denoised frames obtained by the three video denoising algorithms. It can be seen that our proposed ST-KBM algorithm is obviously effective at suppressing background noise while maintaining the structural information of the scene. This is further verified by examining the SSIM quality maps of the corresponding frames. The results show that our proposed ST-KBM algorithm is perfectly effective to the large noisy video sequences and can achieve state-of-the-art denoising performance.
Moreover, to further demonstrate the practicability of proposed ST-KBM algorithm, we implement practical experiments, as shown in Figure 7. The natural noisy video sequence is captured in very low light, and the real information is damaged badly. It is worth mentioning that the noise in the sequence is mixed, including white Gaussian noise, Possion noise, and other kinds of noise, which means noise reduction is more difficult. Obviously, objects in ST-KBM denoised sequence, such as the resolution charts and color charts, have clearer shape than those in ST-GSM and VBM3D denoised sequences. The denoising results show that our proposed ST-KBM algorithm is also quite effective for the mixed noise and can produce better visual effect than ST-GSM and VBM3D.

Conclution
In this paper, we have presented a ST-KBM model for large noisy video signals that have fixed background, and applied it to the restoration both of simulated noisy video sequences by additive white Gaussian noise and natural noisy video sequence in low light. Thanks to the operation of prefiltering, the motion estimation by comparing current prefiltered frame with previously denoised frames is performed effectively. Then, Kalman filter and bilateral filter are applied for current noisy frame, respectively. Finally, by weighting the denoised frames from Kalman filtering and bilateral filtering, a satisfactory result is obtained. The experimental comparisons with state-of-the-art algorithms show that our proposed ST-KBM is competitive for large noisy video sequences that have a fixed background in terms of both subjective and objective evaluations.