Efﬁcient Real-Time Video-in-Video Insertion into a Pre-Encoded Video Stream

This work relates to the developing and implementing of an e ﬃ cient method and system for the fast real-time Video-in-Video (ViV) insertion, thereby enabling e ﬃ ciently inserting a video sequence into a predeﬁned location within a pre-encoded video stream. The proposed method and system are based on dividing the video insertion process into two steps. The ﬁrst step (i.e., the Video-in-Video Constrained Format (ViVCF) encoder) includes the modiﬁcation of the conventional H.264/AVC video encoder to support the visual content insertion Constrained Format (CF), including generation of isolated regions without using the Frequent Macroblock Ordering (FMO) slicing, and to support the fast real-time insertion of overlays. Although, the ﬁrst step is computationally intensive, it should to be performed only once even if di ﬀ erent overlays have to be modiﬁed (e.g., for di ﬀ erent users). The second step for performing the ViV insertion (i.e., the ViVCF inserter) is relatively simple (operating mostly in a bit-domain), and is performed separately for each di ﬀ erent overlay. The performance of the presented method and system is demonstrated and compared with the H.264/AVC reference software (JM 12); according to our experimental results, there is a signiﬁcantly low bit-rate overhead, while there is substantially no degradation in the PSNR quality.


Introduction
Video-in-Video (ViV) insertion into a pre-encoded video sequence is a very desirable feature for various future applications, including providing various TV services for mobile device users (such as the commercial video insertion, subtitling, and advertising). However, the traditional approaches failed to date in providing an efficient solution for supporting the fast real-time insertion of overlays. The previous works related to the Video-in-Video (ViV) transcoders, such as [1][2][3] proposed by the Technion Signal and Image Processing Laboratory, use two full decoders to extract the coding domain data (i.e., motion vectors, coding modes, etc.) and to extract raw video sequences from both the compressed original video stream and inserted video content. Upon completing the extraction, the desired video content is inserted into the original video stream, and then the combined video sequence is compressed by an encoder, according to the coding domain data. According to [1][2][3], the encoder can decrease about 60% in the run time compared to the original JVT encoder (while the picture size of the inserted video content is the 1/4 size of original video stream resolution), which is the saving of about 39% in the run time of the H.264/AVC encoder and decoder CASCADE (based on the Relative CPU (RCPU) performance compared to the conventional H.264 decoder). However, this is still far from being satisfactory, and much more significant run-time reduction has to be achieved.
In this work, we develop and implement an efficient method and system for the fast real-time Video-in-Video (ViV) insertion for H.264/AVC. According to our proposed method, we efficiently insert a video sequence into a predefined location within a pre-encoded video stream for providing various content (e.g., for inserting advertisements into the TV video stream). According to our experimental results, the proposed ViV insertion method enables 2 ISRN Signal Processing achieving a significant performance (in terms of the bitrate and insertion run time) over the conventional bruteforce approaches. In addition, the proposed ViV method enables supporting multiple rectangular overlays of various sizes (e.g., 16N × 16M sizes, where N and M are integers).
According to our ViV method, the video insertion process is performed in two steps.
(a) The first step (i.e., the ViVCF encoder) includes modification of the conventional H.264/AVC video encoder to support the visual content insertion constrained format (CF), including generation of isolated regions without using the FMO slicing, and to support the fast real-time insertion of overlays. This step is computationally intensive, but it should to be performed only once even when different overlays have to be modified (e.g., for different users).
(b) The second step for performing the ViV insertion is relatively simple and substantially not computationally intensive. This step is performed separately for each different overlay.
This work is organized as follows: Section 2 describes the H.264 baseline profile ViVCF, while presenting the IPCM isolation in Section 2.1., inter-isolation in Section 2.2., Luma intra-isolation in Section 2.3., Luma intra-prediction in Section 2.4., generation of the ViV inserter profiles in Section 2.5., and ViV inserter in Section 2.6. In addition, the experimental results and conclusions are presented in Sections 3 and 4, respectively.

H.264 Baseline Profile ViVCF
In order to achieve the industry requirements, we focus the development and implementation of our proposed efficient real-time ViV insertion method and system on transferring the majority of all ViV processing to the Encoder 1 (the "Mainstream") and to the Encoder 2 (the "ROAD"/"Regionof-Advertising"), as presented in Figure 1. This simplifies the insertion process to performing direct ViV insertion operations, which in turn enables the ViV insertion process to consume less computational resources.
In the proposed scheme presented in Figure 1, the ROAD 4 × 4 region (enclosed by a thick line in the "Encoder 1" block) is intra-isolated by Intra-Pulse Code Modulation (IPCM) marcoblocks (MBs). The ROAD IPCM-coded MBs (shown in the grey color within the "ViV Inserter" block) are placed on the top and left ROAD borders for decoding the ROAD MBs independently of the original (ORIG) MBs. In turn, the ORIG IPCM-coded MBs (shown in Figure 2) are placed under the bottom and right ROAD borders for decoding the ORIG MBs independently of the ROAD MBs. The advantage of the proposed IPCM isolation is the relative easy implementation of the ROAD insertion process. By this way, the ROAD inserter is free of any complicated decoder and encoder data operations (such as MC, ME, CAVLC, CAVLD, and CABAC operations). The detailed review of the ViV inserter operations is further presented in Section 2.6.

IPCM Isolation.
The H.264/AVC standard includes the Intra-Pulse Code Modulation (IPCM) macroblock mode [4], in which the values of samples are sent directly, that is, without prediction, transformation, quantization, and the like. An additional motivation for using the IPCM macroblock mode is to allow the regions of the picture to be represented without any fidelity loss. The IPCM isolation is the simplest way to avoid corruption propagation from the ROAD to ORIG MBs, and vice versa. However, the IPCM mode is not efficient by definition (i.e., it requires 384 bytes for each 4 : 2 : 0 16 × 16 MB), so we should use it only when the usage of other MB Isolation techniques is not allowed. Thus, we should use the IPCM Isolation to validate the proposed concept of the ROAD insertion ( Figure 2 represents the general scheme of the IPCM isolation).

Inter-Isolation.
The main idea of the Inter-Isolation, that is, the usage of inter-modes, is to restrict all MBs outside the ROAD area having motion vectors inside the ROAD area. The motion estimation (ME) method that is currently implemented in the H.264/AVC JM reference software uses three functions: one function for the integer search and two other functions for the subpixel MV (Motion Vector) search (for the 16 × 16 block partition, and for other partitions separately), while in case when the integer MV points to the ROAD border, then the subpixel ME is disabled for the current MB partition. Figure 3 represents an example [5] of the motion vectors restriction in MB partitions, which originally pointed inside the ROAD area.
As a result, all those vectors were changed to repoint them outside the ROAD area. Figure 3 represents available nine 4 × 4 Luma intra-prediction H.264/AVC modes [5,6]. The arrows in Figure 4 indicate the direction of prediction in each mode. The encoder may select the prediction mode for each block that minimizes a residual between a predicted block and a block to be encoded.

4 × 4 Luma Intra-Isolation. Since the MBs in each
Intra-Slice depend one on another, we cannot allow the Mainstream encoder (Encoder 1) to choose particular Intra-Modes. Otherwise, the ROAD area will be affected at least by the left and top-neighbor MBs. For this reason, we restrict the encoder to verify some modes for the ROAD MBs, and MBs which are located below and at the right side of the ROAD area. Figure 5 represents an example of 4 × 4 Luma intraisolation process.
According to Figure 5, the following operations are performed.
(a) Applying the VERTICAL, VERTICAL-LEFT, and DIAGONAL DOWN-LEFT, intra-prediction modes for all 4 × 4 blocks to be adjusted to the RIGHT ROAD border (vertically positioned MBs, as depicted in Figure 5).   adjusted to the BOTTOM ROAD border (horizontally positioned MBs, as depicted in Figure 5).
(c) Applying the VERTICAL, VERTICAL-LEFT, DI-AGONAL DOWN-LEFT, HORIZONTAL, and HORIZONTAL-UP intra-prediction modes for the 4 × 4 block to be adjusted to the BOTTOM-RIGHT ROAD border (e.g., MB having the "54" value, as depicted in Figure 5).    Figure 6 which presents an example of 16 × 16 Luma intraisolation process as follows.

ISRN Signal Processing
(a) Applying the VERTICAL intra-prediction mode for all MBs to be adjusted to the RIGHT ROAD border (vertically positioned MBs, as depicted in Figure 6).
(b) Applying the HORIZONTAL intra-prediction mode for all MBs to be adjusted to the BOTTOM ROAD border (horizontally positioned MBs, as depicted in Figure 6). Similarly to the mainstream encoder (Encoder 1), the ROAD encoder (Encoder 2) should also isolate several MBs. For this purpose, we restrict several encoder modes for the ROAD MBs that are adjusted to the left and top image borders, as presented for example in Figure 7.
According to Figure 7, the following operations are performed.   intra-prediction modes for all LEFT ROAD 4 × 4 blocks (vertically positioned MBs, as depicted in Figure 7).
Further, we disable several 16 × 16 search directions for the MBs neighbor to the top and left image borders, as shown in Figure 8, which presents an example of the 16 × 16 Luma intra-isolation process as follows.
(a) Coding (in the IPCM) TOP-LEFT MBs for the I-slice, or inter-coding these MBs for the P-slice.
(b) Applying VERTICAL intra-prediction mode for all LEFT ROAD MBs (vertically positioned MBs, as depicted in Figure 8).
(c) Applying HORIZONTAL intra-prediction mode for all TOP ROAD MBs (horizontally positioned MBs, as depicted in Figure 8).

16 × 16 Luma Intra-Prediction.
As an alternative to the 4 × 4 Luma intra-isolation described in Section 2.3.1. above, the entire 16 × 16 Luma component of a macroblock may be predicted [4,5]. For this, four modes can be used as shown in Figure 9.

Generation of the ViV Inserter Profiles.
According to the proposed method, the profiles are generated (for the "Mainstream" and "ROAD", Figure 1) by the encoder for the enabling to perform the ROAD insertion process.

Mainstream Profiles Generation.
For achieving easy and fast operation of the ViV inserter, the mainstream encoder ("Encoder 1", Figure 1) should generate and update at least five different profiles, as illustrated in Figure 10.
The first profile (provided as a "profiler 1.dat" file for each compressed frame of the mainstream) includes a set of bit pointers, which determine what portion of a video stream should be copied or skipped. This profile also includes flags to indicate when the remained four profiles (out of five) should be used. The second profile is provided as a "profiler 2.dat" file and includes a number of NNZ (nonzero) DCT coefficients for each 4 × 4 mainstream macroblock that is adjacent to the top-left borders of the ROAD outside area. Also, the third profile is provided as a "profiler 3.dat" file and includes the 4 × 4 Luma intraprediction modes. In addition, the fourth profile is provided as a "profiler 4.dat" file and is used for the motion vectors, which should be updated according to the predefined motion vectors restrictions. Further, the fifth profile (provided as a "profiler 5.dat" file) is used for the baseline encoder mode and includes the Quantization Parameter (QP) of the left outside borders of the ROAD area.

ROAD Profiles Generation.
The ROAD encoder ("Encoder 2", Figure 1) uses the same profiles generations approach as the mainstream encoder ("Encoder 1"). The difference is in the specific macroblocks, which should be updated, as illustrated in Figure 11. Thus, in Profile 1, we specify a bit-counter position of the first ROAD macroblock (MB); this is done for each macroblock line of ROAD frame (i.e., macroblocks no. 0, 4, 8, and 12). Further, in Profile 5, we specify the quantization parameter (QP) at the end of each above macroblock line (this should be done because the QP of the ROAD and Mainstream adjacent macroblocks can be different). On the other hand, at the bottom and right borders of the ROAD, we   2.6. ViV Inserter. The following Sections 2.6.1. and 2.6.2. present the ViV inserter instructions and the implementation of the ViV inserter, respectively.

ViV Inserter Instructions.
The proposed isolation schemes and corresponding profiles generation (as described in Sections 2.1 to 2.5 above), make it possible to distinguish between the four different ViV insertion locations, as depicted in Figure 12, which presents four different cases. The "Case 1" refers to the major and general insertion scheme, where the ROAD video can be placed in majority of mainstreams zones. For this case, we can change the ROAD area location (presented by the bright-colored blocks) according to the dark-colored MBs, as also shown by arrows.
Additionally to the above general case ("Case 1"), we have other three special ROAD position cases. Thus, in "Case 2", the right border of the ROAD area is superposed with the mainstream right border by slightly changing the isolation scheme. The same approach is also observed in the remaining two cases ("Case 3" and "Case 4"): in "Case 3", the bottomborder superposition is used, and in "Case 4", the both right and bottom borders superposition is used. It is noted that the indication regarding a particular case number ("Case 1", "2", "3" or "4") is conveyed to the ViV inserter within the profile ".dat" file. In turn, the ViV inserter uses this indication for selecting the corresponding insertion scheme to be used.

Implementation of the ViV Inserter.
According to the proposed ViVCF scheme (as presented in Figure 1), the ViV inserter has four major inputs: two H.264 ViVCF coded video streams (i.e., "mainstream.264" and "ROAD.264"), and two sets of description profiles (i.e., the "profilers.dat" files provided from the Encoder 1 and Encoder 2). When the ViV inserter received the above two streams and their corresponding description profiles, it can initiate the ViV insertion process. Figure 13 below demonstrates the MB map of the typical H.264/AVC ViVCF frame.
This map represents all possible MBs (provided in various gray-scale colors), which may be affected in the isolation process. Only the affected MBs should be specially processed in the ViV insertion process. It is noted that the required ViV insertion process instructions, flags, and relevant coefficients are provided by the set of profiles within the profile ".dat" files, as described in Sections 2.1. to 2.5 above. All other (nonaffected) MBs are copied from the two H.264/AVC CF streams. The copy process is performed ISRN Signal Processing  according to the bit counters, which can be provided within the "profile 1.dat" file ( Figure 1). As a result, this video-invideo insertion scheme makes it possible to locate the ROAD video stream into any zone (portion) of the mainstream, according to Cases 1-4 presented in Figure 12 above. In turn, the ViV inserter is able to select the corresponding insertion scheme. Further, in Figure 14, we present an example of the proposed ViV inserter implementation by dividing the both given mainstream and ROAD frames into a number of virtual slices. Each virtual slice includes all required data: the start and end bit counter in the given frame, the stream type and corresponding variables, which can be updated from other profiles.    Table 5: Experimental results for "NEWS", "TEMPETE", "PARIS", "ICE", "CREW", and "BUS", "MOBILE" video sequences.  In addition, the following Figure 15 schematically illustrates an example of the ViV insertion process for the MB-Level. This process includes five major steps as follows.

ISRN Signal Processing
(a) For P SLICE: since the "mb skip run" parameter does not exist in the first ROAD MB, when the previous MB in the mainstream is coded, a corresponding parameter (i.e., the bit "1") is added.   Similarly, Table 4 presents experimental results for the "CREW" video sequence (average bit rate overhead: 2.7% compared to the bit rate of 972.8 Kbits/sec of the JM12). Table 5 summarizes experimental results (performed according to the test conditions specified in Table 1) for seven different video sequences: "NEWS", "TEMPETE", "PARIS", "ICE", "CREW", "BUS", "MOBILE".

Experimental Results
As is clearly seen from the experimental results above, according to the proposed ViV insertion scheme, there is a significantly low bit rate overhead, which varies from 0.1% to 3.8% only. Also, based on the above conducted experiments, the average PSNR quality remains substantially the same compared to JM 12: NEWS: 40. 1  It should be noted that the proposed method for performing the efficient real-time Video-in-Video (ViV) insertion can be implemented in a similar manner for the Scalable Video-Coding (SVC) schemes [8][9][10][11][12], and particularly for the Region-of-Interest (ROI) video-coding systems and applications [13][14][15][16][17][18][19].

Conclusions
In this work we presented an efficient method for the fast real-time Video-in-Video (ViV) insertion, thereby enabling efficiently inserting a video sequence into a predefined location within a pre-encoded video stream (e.g., for inserting advertisements into the TV stream). The proposed ViV insertion method and system enable achieving a significant performance over the conventional brute-force approaches, in terms of the bit rate and insertion run time, and have a significantly low bit rate overhead. Also, the proposed ViV insertion method and system enable supporting multiple rectangular overlays of various sizes (e.g., 16N × 16M sizes, where N and M are integers). According to the experimental results, there is a significantly low bit rate overhead (up to 3.8%), while there is substantially no degradation in the PSNR quality.