Low-Complexity Saliency Detection Algorithm for Fast Perceptual Video Coding

A low-complexity saliency detection algorithm for perceptual video coding is proposed; low-level encoding information is adopted as the characteristics of visual perception analysis. Firstly, this algorithm employs motion vector (MV) to extract temporal saliency region through fast MV noise filtering and translational MV checking procedure. Secondly, spatial saliency region is detected based on optimal prediction mode distributions in I-frame and P-frame. Then, it combines the spatiotemporal saliency detection results to define the video region of interest (VROI). The simulation results validate that the proposed algorithm can avoid a large amount of computation work in the visual perception characteristics analysis processing compared with other existing algorithms; it also has better performance in saliency detection for videos and can realize fast saliency detection. It can be used as a part of the video standard codec at medium-to-low bit-rates or combined with other algorithms in fast video coding.


Introduction
With the rapid developments of multimedia information processing and communication technology, video encoding has become the basic core technology of digital television, video conferencing, mobile media, 3D video coding, and so forth. During the past decades, in order to obtain video codec with low complexity, high quality, and high compression ratio, various technologies have been proposed for fast video coding [1].
Studies have shown that human visual system (HVS) is sensitive to the video scene perception and assigns different visual importance to different regions [2,3]. Researchers have used advantages of visual attention in various multimedia processing applications such as image retargeting and video coding; the researching of saliency detection model for perceptual video coding is a hot topic. One of the key processing steps is to perform low-complexity calculations and obtain region of interest (ROI) in accordance with visual perception characteristics timely and effectively.
Up to now, saliency detection algorithms are widely used in extracting ROI in videos for various multimedia processing applications [4][5][6][7][8]. Moving zone detection technique for pixel precision is able to detect the moving foreground area, but its complexity in calculation makes it not applicable in real-time encoding. Liu et al., [6] proposed the moving zone detection algorithm, but the algorithm mainly employs motion vector information, making it even not applicable in effective detection on moving object in global motion zone. Yuming et al.,in [7] a video saliency detection algorithm based on feature contrast is proposed, but the computational efficiency needs to be further improved.
Those saliency detection algorithms mentioned previously are facing a common problem: they are not only time-consuming but computation-consuming as well. Those algorithms do not pay more consideration to the effect of the real-time coding performance because of the additional computation for visual perception analysis. Complex saliency detection algorithm will increase the computational burden of video encoder, which is not conducive to the video coding standards in real-time multimedia communication application.
In this paper, a fast saliency detection algorithm based on low-level encoding information and HVS is proposed. In order to simplify the visual perception analysis progress, this algorithm correlates the encoding information in video bitstream with the visual perception characteristics. The spatial and temporal saliency detection is carried out by means 2 The Scientific World Journal of MV. The prediction modes and other auxiliary coding information can save an amount of computing time in feature extraction for saliency detection. As almost no additional computation is increased to the video codec, the saliency detection computation complexity is lower compared with other existing algorithms. The saliency detection results are satisfied with the compared algorithms. So the proposed algorithm can reach the balance between the saliency detection accuracy and computational complexity. This paper is organized as follows. Section 2 is an overview of the proposed framework and describes in detail each one of its subsystems. Section 3 gives the evaluation of the proposed algorithm, and in Section 4, some important conclusions are obtained and the further work is also introduced.

The Proposed Saliency Detection Algorithm
Motion is a highly salient feature which grabs one's attention and keeps it locked on important features and objects. Interest in motion perception has a long history and it can be considered as a relatively well established discipline [9]. Motion perception is one of the most important visual processing mechanisms. The visual information related to temporal motion would generate stronger response in HVS. HVS always pays more attention to the objects with smooth movement (as shown in Figure 1). Spatial contrast is the most basic visual processing mechanism in HVS; it is the prerequisite for HVS perception of spatial shape such as texture and object (as shown in Figure 2).
The response intensity of temporal motion visual information caused by HVS is larger compared with that of spatial motion visual information caused by HVS [10]. In the proposed algorithm, temporal saliency detection is performed firstly, and then spatial saliency detection is adopted in order to optimize the visual perception characteristics analysis results.

Temporal Saliency Analysis and Detection.
As we know, moving objects always have larger MV in video frame. It can be found in Figure 3 that the coding regions with larger MV happen to be the ROI (such as head, face, shoulders, and arms); the coding regions with smaller MV or zero MV are always in the static background which could only arouse lower attention of HVS. To sum up, as a relatively high consistency exists between MV and visual attention, MV can be regarded as the temporal characteristics of visual perception. Many existing algorithms are used in order to get the motion feature for the motion saliency detection. These algorithms are only effective for videos with a static background. In an ideal case, foreground moving object can generate nonzero MV. As there are no relative movements in the background region, it should produce zero MV in static background. However, in reality, nonzero MV noise would be generated randomly due to external change of illumination and internal change of video encoding parameters (such as change of quantization steps, motion search scheme, and rate distortion optimization algorithm). As a result, corresponding MV detection mechanism should be proposed to get rid of the interference of MV noise.
Besides the MV noise interfering in temporal saliency detection, the translation MV interfering from the background should be considered. As shown in Figure 4, in Foreman sequence, the foreground object is the human's head and shoulder. However, due to the rightward rotation of the camera, buildings on the right-hand side are moved into the scene, creating a globally distributed MV. In the same way, the horizontal displacement of camera causes obvious MV along the road side and the parked car in bus sequence. However, HVS is only interested in the moving bus on the road. Analogously, in Stefan sequence, the most interesting object is the tennis ball athlete which generated a lot of motion vector. But due to the moving of camera, there are amount of MV appeared on bleachers as well. Under these conditions with horizontal motion, the information of MV does not match the visual attention. Motion detection which is merely based on the size of MV can lead to errors in the temporal saliency judgment. As a result, corresponding MV detection mechanism needs to be formulated as well, in order to remove the interference caused by MV errors which are generated due to horizontal movement.
In order to improve the temporal detection accuracy, the MV noise filtering and the attenuating translation MV interference error should be added. At the same time, the complexity of processing procedure should be controlled strictly. Otherwise it will influence the real-time performance of the saliency detection algorithm.   The Scientific World Journal remove redundant information on spatial domain or temporal domain, in order to achieve a large amount of information with a small number of bits. In video coding framework based on blocks, the coded object is usually divided into several macro blocks or subblocks; then the macroblocks or subblocks that belong to an object should tend to have a similar motion vector or predictive coding mode and have similarity in structure [11]. Research shows that the moving objects in video sequence have the motion continuity and integrity. Motion continuity is reflected in there being a strong correlation between macro blocks in the current frame and the previous frame with corresponding position. Motion integrity reflects there existing great structural similarity between adjacent macro blocks in the same frame.
Therefore, based on the video sequence motion continuity and high temporal correlation, the MV of the macro block (MB) with the same position in the previous frame can provide very important prior information for the current MV of encoding MB.
In Figure 5, ( , ) represents the current coded MB, represents the current encoded video frame, and ( , ) is the position coordinates. ⃗ is the MV of ( , ) . −1 ( , ) represents the MB in the previous frame with the same position coordinates as current coded MB and ⃗ is the MV of −1 ( , ) . Through using large amounts of test sequences and statistics based on the H.264/AVC standard (JM18.7), it can be found that ⃗ and ⃗ have a high correlation.
Take Akiyo sequence and Foreman sequence as representations for gentle motion and active motion of the two kinds of video sequences; the quantization parameter (QP) is set as 28 and 32, respectively, through using full search prediction method, statistics the joint probability ( ⃗ | ⃗ ) of ⃗ and ⃗ . The statistical results are shown in Table 1.
From Table 1 statistics data, it can be found that if ⃗ = 0, the probability of ⃗ = 0 is more than 60%. If ⃗ ̸ = 0, the probability of ⃗ ̸ = 0 and belong to the ⃗ (1 ± 10%) is nearly 80% for gentle motion sequence and more than 65% for active motion sequence. If ⃗ ̸ = 0, the probability of ⃗ = 0 is less than 20%.
The simulation results show that ⃗ can be taken as an important basis for determining ⃗ being MV noise or not. As there exists strong motion continuity and relativity of the moving object in a video sequence, in order to reduce the error judgment rate, in this paper, based on the average MV of reference region in previous frame, the basic principle of MV noise filtering is proposed as follows.   If ⃗ is generated in the current encoding MB ( , ) , there is a high probability that a MV with similar direction and size exists in the reference region rr of the corresponding position in the previous frame. If there is no MV in rr , ⃗ should be treated as MV noise and be filtered out. So how to determine the reference area rr is the key factor affecting the MV filtering results.
(2) Define Reference Region . As shown in Figure 6, −1 ( , ) is the MB which has the same position coordinates as MB ( , ) in the previous frame. The rectangular area surrounded by dashed lines is defined as reference region rr .
According to the direction of ⃗ (horizontal motion, vertical motion, and oblique motion), rr is determined as follows.
(i) with Horizontal Motion. As shown in Figure 7, if one takes horizontal motion towards the right direction of ⃗ as an example, in the previous reference frame −1 , find MB-1, which has the same position coordinate with the current encoding MB signed ( , ).
Firstly, take MB-1 as the starting point, then perform horizontal motion of macro blocks in the opposite direction of V , and get MB-2 signed ( , − ).
Secondly, centered at MB-2, make vertical extension of macro blocks both upwards and downwards to obtain two The Scientific World Journal 5 Table 1: Probability of MV with the same position. If ⃗ is a horizontal motion towards the left direction, use the same method to determine the rectangular reference region B; rr is designated and surrounded by four macro blocks which are MB-5, MB-6, MB-7, and MB-8.
In the above description, the position coordinates of MB-3 to MB-8 are given in Table 2.
In Table 2, is defined as | ⃗ | denotes the MV magnitude of the horizontal direction ⃗ . denotes the width of the current coding block.
with Vertical Motion. If ⃗ make the vertical downward motion, the processing steps to determine reference region C rr are as follows.
Firstly, take MB-1 as the starting point, then perform vertical motion of macro blocks in the opposite direction of ⃗ and get MB-2 signed ( − , ).
Last, determine the rectangular reference region C; rr is designated and surrounded by four macro blocks which are MB-3, MB-4, MB-5, and MB-6 (as shown in Figure 8).
If ⃗ make vertical upward motion, use the same method to determine the rectangular reference region D; rr is designated and surrounded by four macro blocks which are MB-5, MB-6, MB-7, and MB-8.
In the above description, the position coordinates of MB-3 to MB-8 are given in Table 3.

Macro blocks Position coordinates
In Table 3, is defined as 6 The Scientific World Journal

Macro blocks Position coordinates
ℎ denotes the height of the current coding block.
(iii) with Oblique Motion. If ⃗ make the oblique motion, determine the reference regions of E rr , F rr , G rr , and H rr in the same way according to the different motion directions of ⃗ . rr is given in Figure 9.
The position coordinates of MB-1 to MB-9 are given in Table 4.
In Table 4, and are defined as formulas (1) and (2). In the proposed algorithm, reference region rr is not stationary, and the area and position of rr are changed adaptively according to the size and direction of ⃗ .
In formula (3), rr is the averaged MV in rr ; it is defined as Here, ⃗ V rr is the MV of MB in rr , num rr is the summation times. ( , ) denotes the position coordinates of current encoding MB.
If | rr | = 0, consider ⃗ is caused by MV noise and should be filtered out. ⃗ is set as 0 and mark the current encoding block as 3, 1 ( , , MV) = 3.
If | ⃗ | ≥ | rr |, there exist obvious motion characteristics in the current encoding MB compared with the MBs in rr , and the current encoding block belongs to dynamic foreground region which should be marked as 2, 1 ( , , MV) = 2.
Else, it means the current encoding MB has similar motion characteristics as its nearby MBs in rr , and the current encoding block's temporal saliency characteristics are undetermined. So the translational MV checking should be carried out further in order to distinguish whether current encoding MB belongs to background region or foreground translation region.

Translational MV Checking.
After MV noise filtering procedure, the translation MV interference attenuating step comes into consideration.
(1) Basic Principle of Translational MV Checking. In video coding based on the block matching, first step is to get the difference value between the best matching block and the original block, namely, the prediction error. If the prediction error value is smaller, after discrete cosine transform (DCT) transform the high frequency coefficients is less, then appearing probability of all zero quantized coefficients will be higher, and when number of coding bits is fewer, the higher the compression will be, which means the current block and the predicted block has more matching higher structural similarity, and the prediction coding effect is better.
In H.264/AVC standard, if one takes 4×4 subblock coding as example, the integer DCT can be described as where is integer core transform, is the constant scaling matrix, and ⊗ represents matrix multiplication.
If a quantized coefficient ( , V) of an encoding block's ( , V) is equal to zero, it should be satisfied with the following conditions: .
The sum of absolute difference (SAD) of the 4×4 subblock can be obtained by adding prediction error absolute value of each region. Let So in the video coding standards based on the blockmatching method, SAD is commonly used as the related function to measure the degree of correlation between the current encoding block and prediction block. The smaller value of SAD means there exiting stronger correlation between the two blocks, and they are more matchable. In this paper, the foreground translational region detection is on the basis of the of pixel region change detection theory [12].
In formula (10), ( , ) represents the position coordinates of the encoding block. SAD ( , ) is the sum of absolute difference of the current encoding block and its corresponding encoded block with the same position coordinates in previous frame. The value of SAD ( , ) can be described as variation degree of encoding blocks in two adjacent video frames. SAD ( , ) can be defined as follows: Here, ( , ) is the pixel value of the current encoding block.
( , ) is the pixel value of the corresponding block in previous frame. , denote the partition dimensions of current encoding block, respectively. If the value of SAD ( , ) is high, it means that a great difference exists between the two adjacent frames. The current encoding block is considered in the foreground translational region under dynamic background condition and 2 ( , , MV) should be marked as 1.
If the value of SAD ( , ) is low, it means that a smaller difference exists between the two adjacent frames. And the current encoding block is considered in the background region and 2 ( , , MV) should be marked as 0.
(3) Setting of Self-Adaptive Dynamic Threshold . As there exists diversified motion degree in video sequences, different encoding parameters, especially quantization steps, can affect the code distortion and cause change in the value of SAD ( , ) . How to measure SAD ( , ) value becomes one of the important factors affecting the performance of the proposed algorithm. Obviously using the fixed threshold will bring judgment error. In order to reduce the detection error caused by these uncertainties mentioned above, the proposed algorithm performs translational MV interference detection with a self-adaptive dynamic threshold SAD , which can be determined by using the averaged SAD ( , ) value of all the encoding blocks in the background region in the previous frame. Let Here, represents the background region in the previous frame. ∑ ( , )∈ SAD ( , ) is the summation of all the SAD ( , ) values for the encoding blocks enclosed in . num is the summation times. (4) Temporal Saliency Flowchart. In summary, the computational formula of temporal saliency analysis and detection is as follows: where ( , , MV) = 3 means the current MV is MV noise; ( , , MV) = 2 means the current encoding block is belongs to the foreground dynamic region; ( , , MV) = 1 means the current encoding block is belongs to the foreground translational region; and ( , , MV) = 0 means the current encoding block is in the background region.
It should be pointed out that after filtering out the MV noise, ( , , MV) = 3, the current ⃗ would be set to zero; the current encoding block belongs to background region, which should be marked with 0.
The flow chart of temporal saliency detection is shown in Figure 10.
According to the calculation procedure mentioned above, the current encoding frame can be sorted into temporal visual characteristic regions with different significance, based on the low-level encoding information ⃗ of current encoding block and its motion vector relativity with adjacent blocks in rr (shown in Figure 11).
As the calculation of SAD ( , ) should be performed in interframe prediction mode decision and motion estimation, no additional calculation cost will be caused with the adoption of this method, so it is quite applicable in occasions with limited calculation resources.

Spatial Saliency Analysis and Detection.
Because HVS is also sensitive to the change of spatial domain, in order to improve the visual perception of the analysis results, it needs to detect spatial saliency; analysis of the correlation between prediction mode and spatial visual features should also be performed. In the proposed algorithm, we take H.264/AVC standard as an example and discuss the correlation between prediction mode and visual spatial attention.
All the prediction modes of H.264/AVC coder are shown in Figure 12.
It has been verified in previous studies that the optimal prediction mode decision has the following rules [13].
In I-frame encoding of H.264/AVC standard, the smooth regions are suitable for using intra 16 × 16, while regions with rich texture always select Intra 4 × 4.
In P-frame encoding, the optimal prediction mode selection depends on the matching degree of encoded MB in forecasting, and the prediction mode selection results can describe the encoded MB's content richness and are consistent with the human visual selective attention.
As HVS is relatively nonsensitive to smooth background region, in H.264/AVC standard, the smooth regions usually choose the Intra 16 × 16 in I frame encoding or use macroblock prediction mode Inter 16 (skip, 16 × 16, 16 × 8, 8 × 16) in P frame encoding.
HVS usually assigns higher visual importance to figures in foreground regions with abundant texture features and moving objects; therefore, those regions mentioned always select Intra 4 × 4 in I frame encoding or use subblock prediction mode Inter 8 (8 × 8, 8 × 4, 4 × 8, 4 × 4) in P frame encoding (shown in Figure 13).
Although it is small probability event that using Intra mode in P-frame coding, once Intra mode is selected, means there appears new information or the encoding content is varied greatly in current frame compared with previous frame. It can be found that, in Figure 13(d), there are 6 MBs (with red dots mark) which select Intra 4 × 4 as the optimal prediction mode; because the woman raised right arm suddenly, the moving arm is just the ROI with higher human visual attention.
There is high consistency between prediction mode decision results and visual attention. Therefore, in the proposed algorithm, the prediction mode is regarded as spatial characteristic of visual perception analysis.
According to the analysis above, current encoding frame can be sorted into spatial visual characteristic regions with different significance according to optimal prediction mode decision results. The spatial saliency detection computational formula is as follows: Mode , Mode represent the optimal prediction modes selected by the current MB in P-frame and I-frame coding, respectively.
If Mode ∈ {Intra}, means in P-frame coding, then Intra 4 × 4 or Intra 16 × 16 is selected as the optimal prediction mode, the spatial saliency is high, the current encoding block belongs to the human visual sensitive region, and it can be expressed as ( , , Mode) = 2.
If Mode ∈ {Inter 8} or Mode ∈ Intra 4×4, which means the current encoding block takes the subblock prediction mode (8 × 8, 8 × 4, 4 × 8, 4 × 4) as the optimal prediction mode in P-frame coding or uses Intra 4 × 4 in I-frame coding, then the current encoding block has abundance of texture feature or changes in spatial domain, the current block belongs to attention region, the spatial saliency is high, and then it can be expressed as ( , , Mode) = 1.
If Mode ∈ {Inter 16} or Mode ∈ Intra 16 × 16, which means the current encoding block is smooth and belong to the visual nonsensitive region, then the spatial changes are slight, the current block has low spatial visual characteristics significance, it belongs to nonsignificant region, and it can be signed with ( , , Mode) = 0.

Combination of the Spatiotemporal Saliency
In this algorithm, according to the calculation number of ( , , MV) and ( , , Mode), the priority level of VROI is divided into 6 grades, from high to low being 5∼0. For example, if the current MB uses Intra mode in P-frame coding ( ( , , Mode) = 2), this means that the MB is in the visual sensitive region and has the highest visual attention; it can be signed with VRIO( , , MV, Mode) = 5, and so on.
The proposed algorithm framework is as depicted in Figure 14.

The Algorithm Performance Evaluation
3.1. Test Platform. In this paper, three existing algorithms are used to do the comparison experiment [4,6,7]. The experimental environment is set as Table 5.
We use 10 typical test sequences in multiple formats which include various types (such as 176 × 144, 352 × 288, and 416×240) of video with different scenes, motion, and flatness, separately, such as videos in daytime and nighttime, sports videos, news television, broadcast, and video surveillance.

Saliency Detection Results.
According to the visual perception characteristics analysis and the saliency detection procedure mentioned previously, the VROI marking results are shown in Figure 15.
In the output saliency detection results, the MB luminance values are proportional to the priority level of VRIO( , , MV, Mode). The region with higher visual sensitive, the corresponding MB luminance value is higher. In Figure 15, MVD is the motion vector diagram, and PMD is the prediction mode diagram. The detection results have good consistency with human visual system.

Algorithm Complexity
Analysis. The computation time and the similar measure method are adopted to evaluate the performance of the algorithm [13]. In the similar measure method, Kullback-Leibler (KL) distance is used to measure the similarity between the saliency distributions at human saccade locations and random locations as Here and are saliency distributions at human saccade locations and random locations with probability density functions ℎ and , respectively.
The saliency detection algorithm with the higher KL distance can discriminate human saccade locations from the random locations more easily, and this means better performance in saliency detection for videos [14].
Statistical data in Table 6 show that the proposed algorithm can enhance the timeliness of calculation and the performance in video saliency detection markedly. Compared with [4,6,7], the calculation time for VROI detection can be saved up to 10.69%, 14.66%, and 5.29%; at the same time, the KL distance is increased by 0.50, 1.02, and 0.24, respectively.   Especially for Bus, Stefan, Foreman, and Paris sequences, which contain a large number of global motion vectors or rich texture of background, the proposed algorithm can analyze visual perception characteristics and extraction VROI fast and accurately. In Table 6, mathematical symbol "−" denotes decrease and "+" denotes increase. Δ Time (%), ΔKL are defined as follows formula in (18)  In Table 6, the average computational time is the shortest, and the average KL distance of the proposed algorithm is the largest. This means the proposed algorithm can control the computational complexity strictly and discriminate human saccade locations from random locations more quickly and accurately than the other ones. The experiment results demonstrate that the performance of the proposed algorithm is the best among these compared ones in video saliency detection.

Conclusion
In this paper, the interdependency between video encoding information and HVS characteristics is studied; it proposes a video saliency detection algorithm based on visual perception characteristics analysis and low-layer encoding information which can get from the bit-stream directly. The simulation results show that the proposed algorithm has better performance than other existing ones. It can filter out the motion vector noise, weaken the interference of translational motion vector and get rid of visual redundancy, and it can be used in the detection of visual perception The Scientific World Journal characteristics analysis and saliency detection fast and effectively. The complication of the proposed algorithm is low, and its detection results are more consistent with HVS compared with other existing algorithms. It can be used conveniently in many Internet-based multimedia applications such as video retrieval based on ROI and video quality assessment. It can also be applied to video coding standards, such as HEVC and H.264/AVC. In the future, the various multimedia applications of the proposed video saliency detection algorithm combined with fast video coding technologies can realize fast video coding based on HVS for the latest video coding standard HEVC, and saliency detection technique can be taken as part of the video standard codec at medium-to-low bit-rates.