A Rate-Distortion Optimized Coding Method for Region of Interest in Scalable Video Coding

. The support for region of interest (ROI) browsing, which allows dropping background part of video bitstreams, is a desirable feature for video applications. With the help of the slice group technique provided by H.264/SVC, rectangular ROI areas can be encoded into separate ROI slices. Additionally, by imposing certain constraints on motion estimation, ROI part of the bitstream can be decoded without background slices of the same layer. However, due to the additional spatial and temporal constraints applied to the encoder, overall coding efficiency would be significantly decreased. In this paper, a rate-distortion optimized (RDO) encoding schemeisproposedtoimprovethecodingefficiencyofROIslices.Whenbackgroundslicesarediscarded,theproposedmethoduses baselayerinformationtogeneratethepredictionsignaloftheenhancementlayer.Thus,thetemporalconstraintscanbeloosened duringtheencodingprocess.Todoitinthisway,thepossiblemismatchbetweengeneratedreferenceframesandoriginalones isalsoconsideredduringrate-distortionoptimizationsothatareasonabletrade-offbetweencodingefficiencyanddecodingdrift canbemade.Besides,anewLagrangemultiplierderivationmethodisdevelopedforfurthercodingperformanceimprovement. Experimentalresultsdemonstratethattheproposedmethodachievessignificantbitratesavingcomparedtoexistingmethods.


Introduction
With the rapid development and continuous expansion of mobile communications, mobile internet services are becoming more and more popular.As a result, mobile video applications, such as mobile video broadcasting [1], mobile video conference [2], and mobile video surveillance [3], have become an active research area in recent years.However, due to the fact that mobile devices typically have limited communication bandwidth, constrained power capacity, and various display capabilities, there are several fundamental difficulties in deploying high-quality video service for mobile devices over wireless networks.Among them, one crucial problem for mobile video application is how to browse high-resolution (HR) videos on mobile devices with small screens.Traditional approaches typically downsize the video to achieve the required resolution, which will inevitably cause a loss of perceptual information and a waste of network bandwidth.Visual perception experiments show that small display screen is always a critical factor affecting the browsing experiences.In fact, different parts of a picture do not equally attract people's attention.People are likely to pay more attention to a certain area, which is called region of interest (ROI), than to other areas of a picture.Thus, it is beneficial to optimize multimedia systems according to ROIs of video content, for example, to make ROI areas have better video quality [4] and to make background areas droppable when needed [5].
Scalable video coding extension of the H.264/MPEG-4 AVC video compression standard (H.264/SVC or SVC for short) [6] enables various functionalities to make encoded bitstreams more adaptive to dramatic variation of resource constraints, such as bandwidth, display capability, and power consumption.The base layer of SVC is compatible with H.264/AVC, so that it is easy to meet the requirement for the compatibility when upgrading video broadcasting infrastructures [7].SVC is capable of encoding original video into different layers.Typically, it generates a highquality video bitstream that contains one or more substreams, each of which corresponds to a degraded version of the original video signal with lower spatial resolution or lower temporal resolution or lower picture fidelity.Substreams can be extracted at media gateways according to network bandwidth and end device capability.
In addition to spatial, temporal, and quality scalabilities, SVC also supports ROI scalability [5].With the help of the slice group technique (also known as Flexible Macroblock Ordering (FMO) [8]), macroblocks (MBs) in ROI and background areas can be coded into ROI slices and background slices, respectively.A low bitrate substream, which contains ROI slices, can be extracted from the high-quality bitstream without any transcoding operation [6]; therefore SVC can provide ROI browsing functionality for multimedia communication systems.As illustrated in Figure 1, the base layer of a bitstream can be coded with relatively low spatial resolution or modest fidelity to provide basic video quality for devices with small screen or low bandwidth.The enhancement layer can be coded with high resolution or high fidelity.To provide higher video quality, the enhancement layers may be composed of ROI slices and background slices to provide ROI scalability with FMO, which is allowed in the base baseline line profile of H.264/SVC.With the help of spatial scalability, one may choose either the QCIF (Quarter Common Intermediate Format) base layer or the CIF (Common Intermediate Format) enhancement layer according to resource constraints such as network bandwidth, screen size, and power consumption, as illustrated in scenarios (a) and (c).Additionally, ROI scalability introduces a new scenario (scenario (b)), in which both base layer and ROI of enhancement layer are delivered and one may choose between whole scene with low resolution and zoomed ROI area with high resolution for a better experience.
To develop a video application system with ROI browsing functionality based on SVC (like the application in Figure 1), ROI areas should be detected firstly in each frame, and then they can be coded into ROI slices using FMO.The detection of ROI of a picture has been widely studied with various attention models [9][10][11].Given the fact that the ROI of a certain picture may be quite different for different people, it is practical and reasonable for a video application system to allow users to freely select some initial interested objects and then track these objects in subsequent video frames to locate ROI areas.Existing tracking methods can be used to perform the tracking operation.These methods can be categorized into pixel domain (Pel-domain) [12][13][14] and compressed domain (Com-domain) [15][16][17] approaches.Generally speaking, the Pel-domain approaches can achieve better tracking accuracy than the Com-domain ones yet with higher complexity.For the simulations in this paper, a color based Monte Carlo tracking technique introduced by Prez et al. [14] is applied for ROI tracking.
To support ROI scalability, ROI slices should be selfcontained, in other words, decodable in the absence of other slices of the same picture [18].Thus, during the encoding process, dependencies between different slices, such as dependencies introduced by intraprediction and motion vector prediction, should be prohibited [8].Besides, in order to provide acceptable visual quality in case background slices are all discarded, several additional temporal constraints are suggested to avoid using background slices in reference pictures to predict current ROI slice.The constrained motion vector method [19] presents an H.264/SVC compatible way where the MBs belonging to current ROI slice should use corresponding ROI slices in reference pictures only as reference.Besides, since the reference frames should be upsampled with a 6-tap interpolation filter for quarter-pel motion estimation (ME), the fractional pixel located within two pixels of the ROI slice boundary must also be ignored during motion estimation.However, this may significantly decrease the coding efficiency.Thus, Bae et al. [20] suggest using half-sample interpolation method for fractional pixel interpolation, where the slice boundary is treated as picture boundary.However, this method has little improvement compared with the constrained motion vector method; therefore, it has not been adopted into the H.264/SVC standard.Generally speaking, the above approaches all make strict truncation of temporal prediction, which leads to significant degradation of coding performance.Fortunately, for SVC application in Figure 1, since the base layer information is available when ROI of enhancement layer is being encoded, it is better to adopt a more flexible method to improve the coding performance of the enhancement layer ROI slice.
In this paper, an efficient ROI coding algorithm under the SVC scalable baseline profile is proposed.When the enhancement layer contains ROI slices, information of the base layer is adopted to improve the coding efficiency of them.
The framework of the proposed algorithm is illustrated in Figure 2. The ROI area of an input picture is coded as a ROI slice by the proposed enhancement layer encoder.It improves the coding efficiency by using rate-distortion optimized (RDO) mode decision, which takes into consideration the error propagation due to the loss of the background slices, instead of directly restricting the motion vectors.A new Lagrange multiplier derivation method, which is associated with the proportion of ROI area, is also derived and used in the RDO model for further improving the ROI slice encoding performance.
The remainder of this paper is organized as follows.The proposed ROI coding method is introduced in Section 2. Then Section 3 shows some experimental results to verify the benefits of the proposed RDO method and is followed by the conclusion in Section 4.

ROI Coding for H.264/SVC
In H.264/SVC, a rectangular ROI area of a picture can be coded into a separate slice using FMO technique.However, to support ROI browsing functionality, additional effort should be made to ensure that ROI slices are independent of background data.Several approaches can be used to encode ROI slices.Most of them apply constraints to temporal prediction to enable fully independent decoding of ROI slices, for example, the constrained motion vector method [10,19] and the half-sample interpolation method [20].However, in H.264/SVC, such strict constraints may severely degrade the coding efficiency for enhancement layer users.Fortunately, since the corresponding base layer data is always contained in an enhancement layer bitstream, the base layer information is available to be adopted to further improve the coding efficiency of the ROI area of the enhancement layer.In this section, the existing ROI coding approaches are introduced first, and then the proposed RDO based framework for coding ROI slices is presented.

The Constrained Motion Vector Method. In H.264/SVC, motion estimation (ME) and motion compensation (MC)
are performed using motion vectors with the accuracy of quarter-pixel luma samples.If the motion vector represents a fractional pixel position, interpolation is performed to generate predicted signal value.
As shown in Figure 3, the half-pixel samples are interpolated first from neighboring integer-pixel samples using a 6tap finite impulse response (FIR) filter [21].This means that each half-pixel sample is a weighted sum of 6 neighboring integer samples.Once all the half-pixel samples are available, the quarter-pixel samples are interpolated with neighboring half-and/or integer-pixel samples using bilinear interpolation.
The half-pixel sample "" is interpolated by  ( Since the derivation process of some fractional pixels within two pixels inside the ROI boundary (labeled as "unavailable fractional pixel" in Figure 3) depends on pixels out of the ROI region, Hannuksela et al. [19] proposed not to use them as reference during ME/MC; therefore, the dependency between ROI and background can be removed.

The Half-Sample Interpolation Method.
To loosen the restriction imposed on the ME process, the half-sample interpolation method [20] modifies the fractional pixel interpolation process to extend the pixels on slice boundaries using the 6-tap interpolation; for example, the half-sample "" in Figure 3 is generated by  = (20 ( + ) − 10 + 2 + 16) ≫ 5. (3) In the actual implementation, half-sample interpolation method is only adopted when generating reference signal for ROI areas.Original interpolation method in H.264/SVC is still used to generate reference frames of background slices so that better coding efficiency can be achieved.
The above two methods aim to prohibit using samples in the background area of the reference frame during ME/MC, which also prevent a lot of MBs from being coded with proper prediction block located in or overlapped with background area.Therefore the coding efficiency is severely reduced compared with the original coding method.

Proposed RDO Based ROI Coding Framework.
In application scenario (b) illustrated in Figure 1, when background slices of the enhancement layer are dropped, pixels in them cannot be used as reference to decode the ROI area of enhancement layer, but the base layer information is still available.So the background pixels of the base layer can be reconstructed and used to generate reference frames for ROI areas of enhancement layer using error concealment techniques.However, the mismatch between original and error-concealed reference frames may probably cause severe error propagation; therefore, the coding modes of MBs (which may use the error-concealed blocks as reference) should be selected carefully.Let  and   denote the original reference frame (ORF) and the error-concealed reference frame (named as the virtual reference frame, VRF), respectively, where   may be generated using base layer information through error concealment techniques.In the proposed RDO framework, the mismatch between  and   , together with source error introduced by quantization in the encoding loop, is considered as total distortion.The RDO evaluation for mode decision is based on this total distortion.Furthermore, the Lagrange multiplier is modified to take account of the proportion of ROI area for better performance.

Generation of Virtual Reference Frame (VRF).
Figure 4 shows the generation of intercoded VRF  at encoder side.Since the background slice is assumed to be discarded, the pixels belonging to the background slice are estimated using the base layer information with the same error concealment method that the decoder uses (in this paper, the well-known BL-skip method [22] is adopted); the pixels belonging to the ROI slice are generated by motion compensation using their own motion vectors and residuals while taking the former VRF  − 1 as reference.Then, the VRF  also serves as the reference frame of the following VRFs.The generation of intracoded VRF is similar to that of intercoded VRF except that upsampled textures are directly used to imitate the background slice.Notice that, in actual implementation, the upsampled motion vectors (MVs), residuals, and textures can be easily obtained when calculating the cost of "base layer mode." Thus, only one additional MC operation for each MB is needed to generate VRF.

Proposed RDO Mode Decision.
In the mode decision process of a macroblock, the coding mode with the minimum RD cost is selected: where  and  are distortion introduced and bits consumed by the coding mode under consideration, respectively. is the Lagrange multiplier.
For an MB in a ROI slice, the proposed mode decision scheme considers both the distortion introduced by the difference between the reconstructed MB and the original MB (termed source distortion) and the mismatch between reference MB in  and   (termed mismatch distortion).So the RD cost function  for mode decision becomes where   stands for the source distortion and   is the mismatch distortion.
For an MB in a background slice, a basic assumption is that users who receive enhancement layer background slices should also receive ROI slices.Such assumption is reasonable considering ROI slices are more important and thereby are more protected than background slices.Thus,   becomes zero, and the cost function for mode decision is now degraded to its original form: Figure 5 depicts in detail the implementation of the proposed RDO mode decision of an MB.
(1) Firstly, given a mode , the best motion vector mv best for each partition is selected using the original reference frame.And let the corresponding predictor be   .
(2) Then, calculate the source distortion   and the cost bits  through the encoding process of mode .
(3) If current MB belongs to ROI area, then find the new predictor    from previous VRF with motion vector mv best and calculate the mismatch error   = (  −   ) 2 .Note that the following selection should be made: for distortion calculation in mode decision, the previous VRF is used as reference, while for distortion calculation in ME, the previous ORF is still used.
(4) Calculate the RD cost using ( 5) or (6), and turn to step (1) for the next mode.
(5) Finally, the mode with the minimum  among all candidate modes is selected as the best mode.
The benefit of the proposed RDO method is illustrated through the rate-distortion (RD) performance comparison in Figures 6 and 7 for both spatial and quality SVC.In Figure 6, the spatial SVC bitstream contains a QCIF base layer and a CIF enhancement layer.In Figure 7, the quality SVC bitstream contains a CIF base layer and a CIF enhancement layer.The intraperiod is set to 30.Four pairs of quantization parameter (QP) are chosen for the test: for spatial SVC, QP pairs for QCIF base layer and CIF enhancement layer are (22,26), (26,30), (30, 34), and (34, 38), respectively, and, for quality SVC, the QP pairs are (30, 26), (34, 30), (38, 34), and (42, 38), respectively.The original method in H.264/SVC, which uses the ORFs without any constraints on temporal prediction, is simulated as the anchor.Three collections of data are presented in each figure, where "mdrdo" stands for the proposed RDO based mode decision method, and "mv constrain" and "half-interpolation" stand for the ROI coding methods mentioned in Section 2.1.
Coding efficiency in the following two scenarios is considered.The first is that enhancement layer slices are received completely, which means that the quality of full resolution enhancement layer (labeled "Enc full") should be considered.The second scenario is that background slices are all discarded, which means only the quality of ROI areas (labeled "Dec ROI") affects the user experience.Average bitrate savings, which are calculated via the excel add-in proposed in VCEG-AE07 [23] (lager Δ-bitrate value means worse performance), of the above three coding methods compared with "orig" method are presented.
From Figures 6(a) and 7(a), we can see that the coding efficiency of those three methods (for the whole SVC bitstream) is all lower compared with the "orig." method, which uses the perfectly decoded reference frames as prediction.However, the proposed method achieves significant improvement compared with the other two methods because of the use of a better reference for coding ROI slices.The average performance gain compared with the one termed "half-interpolation" over the tested sequences is about 5% (Figure 6(a)) and 7% (Figure 7(a)) for spatial and quality SVC, respectively.
Though the performances of those methods are inferior compared with original method when enhancement layer slices are all received, however, considering the most important ROI browsing scenario in which all the background slices may be discarded, the RD performance of those methods is much better than that of "orig." method.As illustrated in Figures 6(b) and 7(b), the bitrate saving of the proposed method is up to 50% (about 30% and 4% for spatial and quality SVC on average, resp.)compared with "orig." method.The lower gain for quality SVC compared with spatial SVC is consistent with the common knowledge that a better concealment quality will be obtained when the base layer and enhancement layer have the same resolution; thus, acceptable quality may be obtained even with "orig." method for the quality coding configuration (CIF-CIF).Still, the proposed method outperforms "mv constrain" and "half-interpolation" methods, and the average gain compared with "half-interpolation" is 3% (Figure 6(b)) and 2.5% (Figure 7(b)) for spatial and quality SVC, respectively.

The Selection of Lagrange Multiplier.
In RDO optimization, the Lagrange multiplier should be carefully selected to ensure that the most suitable modes are chosen.In this paper, a refined Lagrange multiplier selection method for background and ROI slices is proposed to further improve the RD performance of ROI slices.In H.264/AVC, the Lagrange multiplier  can be calculated as follows.
Supposing  and  in (4) to be differentiable everywhere, the minimum cost  is given by setting its derivative to zero, thus, leading to Then, the Lagrange multiplier  for single layer video coding can be solved through the rate model  single (8) and distortion model  single (9) [24]: where  and  are two constants and  is the quantization step.According to ( 8) and ( 9), the derivative of  and  can be calculated by Putting ( 10) into (7) and letting ln 2 = 12, the  for single layer is finally derived as where  is a constant, which is experimentally suggested to be 0.85 [24], though others proposed 0.68 [25].
During the development of H.264/SVC, such  is directly used in H.264/SVC reference software Joint Scalable Video Model (JSVM) [26].However, applying such , which is derived for single layer Lagrange multiplier, into multilayer scenario is inappropriate, since the correlation between layers is not considered in the Lagrange multiplier selection.To improve the overall coding performance, an encoder-only optimization contribution for RDO in SVC is presented by Li et al. [27], which is adopted into later JSVM in an optional way.According to this method, the Lagrange multiplier is derived as where  denotes the resolution ratio of the two layers and ( + Δ) and  are the quantization steps for the base layer and enhancement layer, respectively.Similarly, for a specific user who requires base layer slices together with ROI slices, the best Lagrange multiplier for ROI slices can be derived as follows.
Let the joint cost be where   and  roi are the RD cost functions for base layer and enhancement layer with ROI area, respectively,   and  roi are the contribution weights of base layer cost and ROI area cost, respectively, and   +  roi = 1.Similar to [27], the term  denotes the resolution ratio between enhancement layer ROI area and base layer and is introduced as an approximation for bitrate ratio between the ROI slice and base layer slice.The term   is the mismatch distortion (as described in (5)) introduced by difference between ORF and VRF.  is much larger than () and thus can be regarded as independent with .Therefore, put (8) and ( 9) into (13) and then set the derivative of  to zero; the Lagrange multiplier  roi can be solved as Considering the base layer Lagrange multiplier   is determined by single layer  selection method (11); put   =  ⋅ ( + Δ) 2 into (14) to derive  roi as Advances in Multimedia Note that the derived  roi is similar to (13).However, the rate ratio has a different form since the area of ROI has been taken into consideration.
Let   and   denote the quantization parameters for base layer and enhancement layer, respectively.Then, according to the relationship between quantization steps and quantization parameters, which has been defined in H.264/AVC standard [28],   ,   , and  conform to  + Δ = 2 (  −12)/6 ,  = 2 (  −12)/6 .( 16) Equation ( 14) can be simplified according to the quantization parameter difference between base and enhancement layers; namely, Δ =   −   .And finally, the modified Lagrange multiplier  roi can be obtained through the following equation for SVC enhancement layer when ROI is enabled: Similar to the RDO mode decision part, the performance of the proposed RDO method with modified Lagrange multiplier is shown in Figures 8 and 9 for spatial and quality SVC, respectively.The same coding parameters are used."Enc full" and "Dec ROI" denote the scenarios when background slices are received and discarded, respectively.The proposed "mdrdo" method with Lagrange multiplier   [27] and with the proposed  roi ( 17) is simulated separately.And the "mdrdo" method with original single layer Lagrange multiplier ( 11) is performed as the anchor.All the values in Figures 8 and 9 are negative, which means that both Lagrange multiplier modification methods have brought benefits to ROI and enhancement layer coding compared with the original Lagrange multiplier method.Compared with   , the proposed  roi achieves a better performance for ROI, and the average gain is about 6% for spatial SVC (Figure 8(b)) and 7% for quality SVC (Figure 9(b)).Since  roi is always smaller than   , the proposed  roi shifts the RDO process, whenever the background slices are discarded or kept.Then, a new Lagrange multiplier estimation algorithm is derived to improve the coding efficiency of ROI slices.Compared with the existing constraint-based methods, such as the constrained motion vector method and the half-sample interpolation method, experimental results show that the proposed method achieves significant bitrate saving while maintaining both higher objective and subjective video quality.

Figure 1 :
Figure 1: Application scenarios for SVC with ROI.

Figure 2 :
Figure 2: Architecture of ROI enabled SVC coding system.

Begin from mode mFigure 5 :
Figure 5: The proposed RD cost calculation process for a macroblock.

Figure 8 :
Figure 8: RD performance for spatial SVC with new .

Figure 9 :Figure 10 :
Figure 9: RD performance for quality SVC with new .
Bitrate savings for enhancement user of proposed method Bitrate savings for ROI user of proposed method

Figure 11 :
Figure 11: RD performance for quality SVC with proposed RDO framework.

Figure 12 :
Figure 12: Visual quality comparison under "Dec ROI" scenario for the "bus" sequence.