Performing optimal bit-allocation with 3D wavelet coding methods is difficult, because energy is not conserved after applying the motion-compensated temporal filtering (MCTF) process and the spatial wavelet transform. The problem has been solved by extending the 2D wavelet coefficients weighting method with the consideration of the complicated pixel connectivity resulting from the lifting-based MCTF process. However, the computation of weighting factors demands much computing time and vast amount of memory. In this paper, we propose a method, which facilitates the property of sparse matrices, to compute the subband weighing factors efficiently. The proposed method is employed for the 5–3 temporal filter and the 9–7 temporal filter in the lifting-based MCTF process. Experiments on various video sequences show that the computing time and memory needed for the computation of weighting factors are dramatically reduced by applying the proposed approach.
1. Introduction
Scalable video coding, which is developed to support multiple resolutions in a single stream, is an excellent solution to the compression task of the video broadcasting over the heterogeneous network [1, 2]. Compared to H.264/SVC [3], which is recently developed based on the prevailing conventional close-loop codec H.264/AVC, the multiresolution property of 3D wavelet representation based on motion-compensated temporal filtering (MCTF) is a more natural solution to the scalability issue in video coding [4–7]. However, to compete with the great success of scalable coding methods based on H.264, the MCTF-based 3D wavelet video codec must be constantly improved.
For both of H.264/AVC and the wavelet-based encoder, the coding performance is greatly dependent on the technique used to solve the optimal bit-allocation problem. The discrete cosine transform (DCT), which is an orthogonal transform used in H.264/AVC, benefits the procedure of searching the optimal bit-allocation because the variances of the distortions caused by the quantization are equal in the pixel and the frequency domains. However, the wavelet transforms used in the video/image coding (ex: 5–3 and 9–7 filters) are usually not orthogonal so the variances of the distortions in the pixel domain and the frequency domain are not the same. For image coding, the problem has been elegantly solved [8, 9] by assigning different weights to the subbands, resulting in equivalent reconstruction error variance and quantization error variance in the frequency domain. The method was directly extended for 3D wavelet coding previously [10], but the results are not satisfied because the energy difference between the pixel and frequency domains results not only from using a biorthogonal wavelet but also from the MCTF process, which imposes a different connectivity status (single-connected, multiple-connected, or unconnected) on each pixel during motion prediction. Thus, the direct extension of the methods in [8, 9] from 2D wavelet coefficients to 3D wavelet coefficients cannot solve the 3D wavelet optimal bit-allocation problem because they do not take account of the status of the pixel connectivity in the MCTF process.
To fix this problem, a method, which takes account of both the biorthogonal property of the wavelet and the pixel connectivity resulted from the motion compensation in the MCTF process, is proposed to equal the variances of the distortion in the pixel and frequency domain by assigning weighting to subbands and, therefore, solves the bit-allocation problem in 3D wavelet coding [11]. Although the results show the method is effective, lots of computing time and memory are needed during the computation of the weighting factors and it causes that the method becomes a heavy load of the encoding process. To make the computation of weighting factors more efficient, we must develop an approach to decrease the time and memory used by the process. For this purpose, a special data structure and corresponding operations are proposed to save the time and memory during the computation of the subband weighting factors in this paper.
The remainder of this paper is organized as follows. In Section 2, we briefly review the MCTF-EZBC (embedded zero block coding) [12] wavelet coding scheme. In Section 3, we give a summary of how to derive the spatial and temporal weighting factors. In Section 4, we explain the proposed method used to save time and memory during the computation of the subband weighting factors. We also report the experiment results that compare the time and memory used by the traditional method and our method in Section 5. And finally the conclusion is given in Section 6.
2. A MCTF-EZBC Wavelet Video Coding Scheme
In a MCTF-EZBC wavelet video coding scheme, the video frames are first decomposed into multiple wavelet subbands by spatial and temporal wavelet transforms, then the quantization and entropy coding are sequentially applied to the wavelet subbands. According to the order of the spatial and the temporal decompositions, the wavelet coding schemes can be categorized into two categories: 2D+T (spatial filtering first) and T+2D (temporal filtering first). However, no matter 2D+T or T+2D scheme is applied, the spatial and the temporal filterings can be described independently.
The purpose of spatial filtering is separating low and high frequency coefficients from a video frame. The spatial filtering usually consists of multiple sequential 2D wavelet decompositions. In a 2D wavelet decomposition, the input signal, which is represented by a N by N two-dimensional matrix, is decomposed into four N/2 by N/2 two-dimensional matrices, which are denoted by LL, HL, LH, and HH. For these subbands, the previous letter means that the subband contains the low (L) or high (H) frequency coefficients after the horizontal 1D wavelet transform, and the following letter means the subband contains the low (L) or high (H) frequency coefficients after the vertical 1D wavelet transform. After the decomposition, subband LL is sequentially taken as the input of the next level spatial decomposition.
If we let ℋk,0 and ℋk,1 denote the analyzing matrices in the k-th spatial decomposition, the corresponding wavelet subbands can be computed as[Fk0Fk1Fk2Fk3]=[Hk0Hk1]F(k-1)0[Hk0THk1T],
where Fk0, Fk1, Fk2, and Fk3 correspond to the LL, HL, LH, and HH subbands, respectively. If there is another spatial decomposition applied to the frame, the subband Fk0 is further decomposed, by the k+1-th analyzing matrices ℋk+1,0 and ℋk+1,1. According to this definition, the subband F00 is the frame before any spatial decomposition, and any wavelet subband after the spatial filtering can be indexed by xy, which represent the subband is the y-th subband after the x-th spatial decomposition. In Figure 1(a), the example shows the wavelet subbands with the indices after performing the spatial filtering consisting of two spatial decompositions.
An example of indexing the spatial and temporal decomposed subbands. The above figure shows the indices of the spatially decomposed subbands, in which the subband indexed by (x,y) represents the subband is the y-th after the x-th spatial decomposition. The below figure shows the indices of the temporally decomposed subbands, in which the subband indexed by (m,n) represents the subband is the n-th after the m-th spatial decomposition. Accordingly, any subband obtained by the MCTF can be indexed by (xy,mn).
(x,y)
(m,n)
To reconstruct F(k-1)0 from the wavelet subbands Fk0, Fk1, Fk2, and Fk3, the synthetic matrices 𝒢k0 and 𝒢k1 are used in the synthetic procedure, which can be represented by several matrix operations asF(k-1)0=[Gk0Gk1][Fk0Fk1Fk2Fk3][Gk0TGk1T].
Compared to the spatial filtering, which takes a frame as its input, the temporal filtering takes multiple frames (T+2D) or the subands with the same spatial indices (2D+T) as its input. The temporal filtering consists of multiple temporal decompositions. However, unlike the spatial filtering, which only needs several 2D wavelet transforms during the procedure, the temporal filtering needs multiple 1D wavelet transforms and motion estimation/compensation to specify the input coefficients of each wavelet transform. So, the temporal filtering is more complicated compared to the spatial filtering.
Usually, a temporal decomposition is implemented by the lifting structure, which consists of the predicting stage and the updating stage. As depicted in in Figure 2, the H-frames (denoted by h) are firstly obtained after the predicting stage, then the L-frames (denoted by l) are obtained after the updating stage. The lifting structure can be described by the matrix operations, in which a frame is represented by a one-dimensional matrix f. If each input frame of the temporal filtering has the size N by N and can be represented by a N by N two-dimensional matrix F, the one dimensional matrix f has the size N2 and can be mapped from F by letting f[i*N+j]=F[i,j].
An example of the lifting structure used in the temporal decomposition. The subbands h, which contain high frequency coefficients, are first obtained after the prediction stage. Then, the subbands l, which contain the low frequency coefficients, are obtained after the update stage.
The motion vectors are computed before the predicting and updating stages. We use the Pmx,y, which is a two-dimensional N2 by N2 matrix, to denote the motion vectors obtained by predicting the y-th frame from the x-th frame in the m-th temporal decomposition. Then, the predicting stage of the m-th temporal decomposition can be written ashm2i+1=fm-12i+1-(Hm[2i]Pm2i,2i+1fm-12i+Hm[2i+2]Pm2i+2,2i+1fm-12i+2),
where ℋm is the corresponding coefficient of the wavelet filter. Since the index m represents the m-th lifting stage, the term fm-1i represents the i-th frame which is obtained after the (m-1)-th lifting stage. If the 5–3 wavelet transform is used, the value of ℋm in the predicting stage is −0.5.
In the updating stage, the inverse motion vector matrix Umy,x, which has the same size as Pmx,y and can be calculated from the matrix Pmx,y [13], is used to compute the decomposed L-frame aslm2i=fm-12i+(Hm[2i+1]Um2i+1,2ihm2i+1+Hm[2i-1]Um2i-1,2ihm2i-1),
where ℋm is the corresponding coefficient of the wavelet filter. If the 5–3 wavelet transform is used, the value of ℋm in the updating stage is 1.
To understand how to represent the motion vectors with a matrix, an example is given in Figure 3. And these two matrices are constructed as follows:P2i,2i+1=[010000010000001000000010000001000001],U2i+1,2i=[000000100000001000000000000100000010].
An example of MCTF motion estimation. The types of connectivity pixels in the example are single-connected, multiple-connected, and unconnected pixels. The corresponding prediction (P) and update (U) matrices are given in (5), where P2i,2i+1[x,y]=1 indicates that the x-th pixel in frame f2i+1 is predicted by the y-th pixel in frame f2i. Note that U2i+1,2i(x,y)=1 means the x-th pixel in frame f2i is updated by the y-th pixel in frame f2i+1.
After the m-th temporal decomposition, if another temporal decomposition is needed for the temporal filtering, the L-frames generated by the current temporal decomposition are taken as its input. So, we let fmi, which is the i-th frame of the (m+1)-th temporal decomposition, as lm2i, which is the output 2i-th frame of the m-th temporal decomposition. Although the H-frames do not participate in the (m+1)-th temporal decomposition, we still arrange the indices to them by letting fmi+S=hm2i+1, in which S is the number of the input frames in the (m+1)-th temporal decomposition. So, any frames obtained by the temporal filtering can be indexed by mn, which represents it is the n-th subband after the m-th temporal decomposition.
Usually, the temporal decompositions are sequentially performed until the output has only one L-frame. Since the decomposed frame can be synthesized, only the frames which cannot be synthesized are necessary to reconstruct all the frames. In Figure 1(b), an example, in which the size of the group of pictures (GOP) is 8, shows the indices of the frames that cannot be synthesized from the decomposed frames.
To recover the original frames, the synthesis is applied to the decomposed frames. The frame which is decomposed last is synthesized first, and vice versa. In the procedure of the synthesis, the inverse updating is firstly performed asfm-12i=lm2i-(Gm[2j+1]Um2j+1,2ihm2j+1+Gm[2j-1]Um2j-1,2ihm2j-1),
where 𝒢m is the coefficient of the wavelet transform used in the inverse updating stage. If the temporal 5–3 filter is used, the value of 𝒢m is −1 in the inverse updating stage. After the inverse updating stage, the inverse predicting stage is performed asfm-12i+1=hm2i+1+(Gm[2j]Pm2j,2i+1fm-12j+Gm[2j+2]Pm2j+2,2i+1fm-12j+2).
If the temporal 5–3 filter is used, the value of 𝒢m is 0.5 in the inverse updating stage. For some wavelet filters, such as 9–7 filter, a temporal decomposition needs multiple level lifting structure and can be easily extended by cascading multiple one level lifting structures.
After the spatial and temporal filterings, the quantization and entropy coding are applied to these wavelet subbands. The coefficients of these subbands may be quantized by scalar or vector quantization, then the quantized coefficients are coded without loss by the entropy coder. The method is common in the still image compression standard, such as JPEG 2000 [14]. In the decoding process, the quantized coefficients are obtained by decoding the received bitstream, then the subbands are rescaled according to the quantization step used in the encoder. The advantage of separating the quantization and the entropy coding is that the quality of the reconstructed video can be predicted according to the quantization step.
The quantization and the entropy coding can also be combined with the bitplane coding method, such as the EZBC entropy coder [12]. In these methods, the rates allocated to the subbands are calculated first, then the entropy coder encodes the subbands with these rates. The advantage of the scheme is that the rates of the subbands can be any nonnegative integers and the performance of the bit-allocation can be improved accordingly.
No matter the quantization and entropy coding are combined or separated, the bit-allocation greatly affects the coding efficiency. However, because the energy in the pixel domain can be altered after application of the spatial wavelet transform, temporal wavelet transform, and motion estimation in the MCTF process, the bit-allocation is prevented from achieving optimum. To preserve the energy between the pixel and the wavelet domain, we derive the weighting factors of the decomposed subbands. We explain how to derive these weighting factors in the next section.
3. Subband Weighting
The weighting factor indicates how much a unit quantization power in the subband contributes to the overall distortion in the reconstructed GOP, that is, for the subband indexed by (xy,mn), which means the subband is the spatial y-th subband after the x-th spatial decomposition, and the temporal n-th subband after the m-th temporal decomposition (an example of the index method is depicted in Figure 1), a weighting γ(xy,mn) is given to satisfyD=∑(xy,mn)∈Sγ(xy,mn)×Dxy,mn,
where D is the variance of the distortion in the pixel domain, Dxy,mn is the variance of subband's distortion in the wavelet domain, and S is the set including the indices of the subbands which cannot be synthesized during the reconstruction. The weighting factor γ(xy,mn) can be computed byγ(xy,mn)=α(xy)×β(mn),
where α(xy) is the spatial weighting factor and β(mn) is the temporal weighting factor with respect to the subband indexed by (xy,mn).
The spatial weighting factors can be directly computed according to the error propagation model mentioned in [15]. However, to compute the temporal weighting factors, the effect of the motion compensation must also be considered, and the approach is mentioned in [11]. In the following paragraph, we explain how to derive the spatial and temporal weighting factors, respectively.
3.1. Spatial Subband Weighting
In this section, we review the method used to derive the weighting factors of spatial wavelet transforms in [15]. According to (2), the reconstructed subband F(k-1)0 can be written asF(k-1)0=Gk0Fk0Gk0T+Gk0Fk1Gk1T+Gk1Fk2Gk0T+Gk1Fk3Gk1T.
Let ΔF denote the error matrix resulting from the quantization of an image F; according to (10), we haveΔF(k-1)0=Gk0ΔFk0Gk0T+Gk0ΔFk1Gk1T+Gk1ΔFk2Gk0T+Gk1ΔFk3Gk1T.
Using the property of Kronecker products,V=AUBT⟹v=(A⊗B)u,
where v, u are column vectors constructed row-wise from the matrices V, U, respectively. Applying this identity to (11), we now haveΔc(k-1)0=(Gk0⊗Gk0)Δck0+(Gk0⊗Gk1)Δck1+(Gk1⊗Gk0)Δck2+(Gk1⊗Gk1)Δck3,
where Δc is the column vector constructed row-wise from the matrices ΔF. The reconstruction mean square error (MSE) of an image F isσc2=1N2E{ΔcTΔc}.
The equation can be solved under the high bit rate assumption. At a high bit rate, it is assumed that the quantization errors of wavelet subbands are white and mutually uncorrelated [16]. So, the following identities for the vector representation of errors in subbands x and y are obtained:E{ΔxTΔx}=E{tr(ΔxΔxT)}=N2σx2,E{ΔxTΔy}=E{tr(ΔxΔyT)}=0,whenx≠y,
where σx2 is the MSE of any element in the vector Δx. By substituting (13) into (14) and applying the identities in (15), we obtainσc(k-1)02=1N2E{Δc(k-1)0TΔc(k-1)0}=1N2E{tr(Δc(k-1)0Δc(k-1)0T)}=σck02tr((Gk0⊗Gk0)(Gk0⊗Gk0)T)+σck12tr((Gk0⊗Gk1)(Gk0⊗Gk1)T)+σc022tr((Gk1⊗Gk0)(Gk1⊗Gk0)T)+σck32tr((Gk1⊗Gk1)(Gk1⊗Gk1)T)=α′(k0)σck02+α′(k1)σck12+α′(k2)σck22+α′(k3)σck32.
The above derivation shows that the MSE measured before the wavelet transform is the weighted sum of subbands' MSE after the wavelet transform. Note that the weighting factors are determined only by the filters. According to the discussion in the previous section, the subband indexed by (k0) is further decomposed in the (k+1)-th spatial decomposition. So, the spatial weighting factors of a multilevel decomposition case can be solved by the recursive substitution. To summarize, the MSE of the original frame can be computed asσc002=α′(10)σc102+α′(11)σc112+α′(12)σc122+α′(13)σc132=α′(10){α′(20)σc202+α′(21)σc212+α′(22)σc222+α′(23)σc232}+α′(11)σc112+α′(12)σc122+α′(13)σc132={∑k=1s∑i=13[∏j=1k-1α′(j0)]α′(ki)σcki2}+{∏j=1sα′(j0)σcs02},
in which s is the total number of the spatial wavelet transforms. So, the multilevel spatial weighting factor α(xy) can be computed from one-level spatial weighting factor α(xy) asα(xy)={∏j=1x-1α′(j0)α′(xy),forx∈{1,2,3,…,s},y∈{1,2,3},∏j=1sα′(j0),forx=s,y=0.
3.2. Temporal Subband Weighting
We follow the scheme proposed in [11] to explain how to derive and compute the temporal subband weighting factors. The temporal 5–3 filter, which is usually used in the wavelet video coding, is used to explain the procedure. And the derivation of the weighting factors with respect to the temporal 9–3 filter, which is the other filter usually used in the wavelet video coding, is detailed in [11].
We first explain how to compute the weighting factor β′(mn), and the weighting factor β′ is expected to satisfy∑i=0S-1σfm-1i2=∑i=0S-1β′(mi)σfmi2,
where S is the total number of the input subbands in the m-th temporal decomposition.
If the temporal 5–3 filter is used in the MCTF process, the filter coefficients ℋm used in the predicting and the updating stages are −0.5 and 1, respectively. To simplifying the notation, ℋmPm is denoted by Pm′, and ℋmUm is denoted by Um′, then (3) and (4) can be written ashm2i+1=fm-12i+1-(Pm′2i,2i+1fm-12i+Pm′2i+2,2i+1fm-12i+2),lm2i=fm-12i+(Um′2i+1,2ihm2i+1+Um′2i-1,2ihm2i-1).
Let Δf represents the quantization error resulting from lossy source coding, and f+Δf denotes the reconstructed frame. From (21), we can obtain the reconstruction error of the 2i-th frame as follows:Δfm-12i=Δlm2i-Um′2i+1,2iΔhm2i-1-Um′2i+1,2iΔhm2i+1.
Substituting (22) into (20) for Δfm-12i and Δfm-12i+2, we obtain the following reconstruction error of the (2i+1)-th frame:Δfm-12i+1=Δhm2i+1+Pm′2i,2i+1Δfm-12i+Pm′2i,2i+1Δfm-12i+2=Pm′2i,2i+1Δlm2i+Pm′2i,2i+1Δlm2i+2-Pm′2i,2i+1U02i,-Δhm2i-1+(I-Pm′2i,2i+1Um2i+1,2i-Pm′2i,2i+1Um2i+1,2i+2)Δhm2i+1-Pm′2i,2i+1U02i+2,+Δhm2i+3.
Equations (22) and (24) represent a motion-dependent error propagation model for a one-level MCTF process using the 5–3 temporal wavelet filter. The reconstructed MSE of the 2i-th frame can be derived as follows: by (14),σfm-12i2=1N2E{Δfm-12iTΔfm-12i}=1N2E{tr(Δlm2iΔlm2iT)+tr(Um′2i-1,2iΔhm2i-1Δhm2i-1TUm′2i-1,2iT)+tr(Um′2i+1,2iΔhm2i+1Δhm2i+1TUm′2i,2i+1,2iT)-2tr(Um′2i-1,2iΔhm2i-1Δlm2iT)+2tr(Um′2i+1,2iΔhm2i+1Δhm2i-1TUm′2i-1,2iT)-2tr(UM′2i+1,2iΔhm2i+1Δlm2iT)}.
By defining 𝒯(𝒜)=tr(𝒜·𝒜T) and using the high bit rate assumption that derives the identities in (15), the last three cross-terms are zero and (24) becomesσfm-12i2=σlm2i2+T(U2i+1,2i)σh2i-12+T(U2i-1,2i)σhm2i+12=ω(fm-12i∣lm2i)σlm2i2+ω(fm-12i∣hm2i-1)σhm2i-12+ω(fm-12i∣hm2i+1)σhm2i+12.
Following the same derivation, the reconstruction error of the (2i+1)-th frame isσfm-12i+12=T(P′2i+1,2i)σlm2i2+T(P′2i+1,2i+2)σlm2i+22+T(P′2i-1,2iU′2i+1,2i)σhm2i-12+T(I-P′2i-1,2iU′2i-1,2i-P′2i+1,2i+2U′2i+3,2i+2)σhm2i+12+T(P2i+3,2i+2U2i+1,2i+2)σhm2i+32=ω(fm-12i+1∣lm2i)σlm2i2+ω(fm-12i+1∣lm2i+2)σlm2i+22+ω(fm-12i+1∣hm2i-1)σhm2i-12+ω(fm-12i+1∣hm2i+1)σhm2i+12+ω(fm-12i+1∣hm2i+3)σhm2i+32.
By applying derivations similar to those in (25) and (26) to each input frame, we can relate the reconstruction errors before and after the temporal decomposition by a linear relation:[σfm-12i-22σfm-12i-12σfm-12i2σfm-12i+12σfm-12i+22σfm-12i+32σfm-12i+42⋮]=Wm→m-1[σhm2i-32σlm2i-22σhm2i-12σlm2i2σhm2i+12σlm2i+22σhm2i+32⋮]=Wm→m-1Xm→m-1[σfm2i-32σfm2i-22σfm2i-12σfm2i2σfm2i+12σfm2i+22σfm2i+32⋮],
where Wm→m-1=[ω(fm-12i-2∣hm2i-3)ω(fm-12i-2∣lm2i-2)ω(fm-12i-2∣hm2i-1)0000⋯ω(fm-12i-1∣hm2i-3)ω(fm-12i-1∣lm2i-2)ω(fm-12i-1∣hm2i-1)ω(fm-12i-1∣lm2i)ω(fm-12i-1∣hm2i+1)00⋯00ω(fm-12i∣hm2i-1)ω(fm-12i∣lm2i)ω(fm-12i∣hm2i+1)00⋯00ω(fm-12i+1∣hm2i-1)ω(fm-12i+1,lm2i)ω(fm-12i+1∣hm2i+1)ω(fm-12i+1∣lm2i+2)ω(fm-12i+1∣hm2i+3)⋯0000ω(fm-12i+2∣hm2i+1)ω(fm-12i+2∣lm2i+2)ω(fm-12i+2∣hm2i+3)⋯0000ω(fm-12i+3∣hm2i+1)ω(fm-12i+3∣lm2i+2)ω(fm-12i+3∣hm2i+3)⋯⋮⋮⋮⋮⋮⋮⋮⋮],Xm→m-1=[0000⋯1000⋯1000⋯0000⋯0000⋯0100⋯0100⋯0000⋯0000⋯0010⋯0010⋯0000⋯⋮⋮⋮⋮⋯⋮⋮⋮⋮⋯].
The matrix 𝒳m→m-1 is used to adjust the positions of the subbands. To measure the consequence of quantizing a subband after the temporal decomposition, we should aggregate the errors induced in all the subbands before the temporal decomposition. This is exactly the summation of the corresponding column of the matrix 𝒲m→m-1𝒳m→m-1 for the subband. For example, the weighting factor of the reconstruction error resulting from subband fm2i+1 is the summation of the values in the 2i+1 column of 𝒲m→m-1𝒳m→m-1 and denoted by β′(m(2i+1)). The following equation shows how to compute the weighting factor β′:[β′(m0)β′(m1)β′(m2)β′(m3)β′(m4)β′(m5)β′(m6)⋮]=[∑q(Wm→m-1Xm→m-1)q1∑q(Wm→m-1Xm→m-1)q2∑q(Wm→m-1Xm→m-1)q3∑q(Wm→m-1Xm→m-1)q4∑q(Wm→m-1Xm→m-1)q5∑q(Wm→m-1Xm→m-1)q6∑q(Wm→m-1Xm→m-1)q7⋮].
If the temporal filtering consists of total t temporal decompositions, the variances of the distortions of the frames in the pixel domain can be computed as[σf002σf012σf022σf032σf042σf052σf062⋮]=(∏m=1tWm→m-1Xm→m-1)[σft02σft12σft22σft32σft42σft52σft62⋮]=(∏m=1tWm→m-1Xm→m-1)[σft(t0)2σft(t1)2σft((t-1)2)2σft((t-1)3)2σft((t-1)4)2σft((t-1)5)2σft((t-1)6)2⋮].
Please note the index of subband fi can be changed from i to (mn), which represents the subband is obtained as the n-th subband after the m-th temporal decomposition (see Figure 1(b)). So, the temporal weighting factor β(mn) used in (9) can be computed asβ(mn)=∑q(∏m=1tWm→m-1Xm→m-1)qn.
Since the spatial and temporal weighting factors are both derived, the performance of the bit-allocation method can be improved accordingly. In our previous work [11], we compared three bit-allocation methods, which consider spatial weighting factors, spatial-temporal weighting factors without motion vectors, and spatial-temporal weighting factors with motion vectors, respectively. And the the results showed the bit-allocation method considering spatial-temporal weighting factors with motion vectors greatly outperforms the other two methods. However, the computation of the best weighting method costs lots of computing resources. So, the method, which reduces the needed resources, is proposed in the next section.
4. Fast Computation of the Weighting Factors
As our discussion in the previous section, (16) is used to compute the spatial weighting factors, and (24) and (25) are used to compute the temporal weighting factors. However, a lot of computing resources are needed to obtain these weighting factors. Let the size of an input frame is N by N. In (16), (24), and (25), if we use matrices to represent 𝒢⊗𝒢, P, and U, the size of each matrix is N2 by N2. If m bytes are used to store a coefficient in a matrix, a matrix would take N2×N2×m bytes in storage. For a CIF frame (of size 352×288), the matrix representation needs at least 39204 megabytes for only one matrix. Thus, using the matrix representation to compute the weighting factors is extremely inefficient since it needs to store many N2 by N2 matrices.
Another issue that handicap using the matrix representation for weighting factor is the computing time to perform matrix addition and multiplication. If we use the conventional matrix operations, which are defined as follows:(AB)[x,y]=∑i=1nA[x,i]B[i,y],(A+B)[x,y]=A[x,y]+B[x,y],
the time complexity of the matrix multiplication in computing the weighting factors is bounded by O(N6). For a CIF frame, it costs approximately 1015 real number multiplications for performing one matrix multiplication. Fortunately, these matrices are usually very sparse, that is, only few coefficients in these matrices are not zero. By taking advantage of the property, we use an efficient data structure to store the required data in computing the weighting factors.
Thus, we propose to use arraylist representation. An arraylist consists of a one-dimensional array comprised of pointers to linklists. Figure 4 is an example to demonstrate the arraylist structure. In the proposed data structure, two arraylists, which, respectively, record the positions and values of the nonzero coefficients, are used to represent a sparse matrix of large size. In Figure 4, the matrix P is an abbreviation of the matrix P2i,2i+1. The arraylists Prow,pos and Prow,val, respectively, record the positions and values of nonzero coefficients in the matrix P. Because the matrix P has 6 rows, both arraylists have 6 elements. For the arraylist Prow,pos, the x-th linklist links to the locations of nonzero elements at the x-th row of the matrix. The arraylist Prow,val has a similar structure at that of Prow,pos, instead of linking to locations of nonzero elements, Prow,val links to the values of the nonzero elements. That if the position of the nonzero coefficient is recorded in Prow,pos[x,y], we can deduce that the x-row and y-column element of the matrix P is a nonzero coefficient, and whose value is stored in Prow,val[x,y].
An example of the arraylist representation. The matrix is P2i,2i+1 in (5). According to whether row (above) or column (below) is considered first, the arraylist representation can be categorized into row representation and column representation. The arraylists in the left are used to store the locations of nonzero coefficients, and the arraylists in the right are used to store the values of nonzero coefficients.
If the matrix is sparse enough, the proposed arraylist data structure can greatly reduce the memory because only the nonzero locations and values are stored in the arraylists. If we assume that there are n nonzero coefficients in each row of an N2 by N2 matrix. To construct an arraylist, we need a one-dimensional array of N2 elements, and N2 linklists, one for each element in the arraylist. Each linklist has n elements. Because our structure uses two arraylists, it needs 2(n+1)N2m bytes, where m is the number of bytes used to store a coefficient. Thus, our data structure has reduced to size to represent an N2 by N2 matrix from mN4 bytes to 2(n+1)N2m bytes. The proposed data structure in many examples can reduce the memory to store a of size 3522×2882 matrix from 39204 megabytes to about 1 megabytes.
After solving the matrix storage, we use the matrices in (5) to explain how to perform matrix addition and multiplication with the proposed data structure. To simplify the notation, P and U are used to represent P2i,2i+1 and U2i+1,2i in (5), respectively. In matrix addition, we first create two new arraylists Crow,pos and Crow,val to store the result of the addition with Crow,pos=Prow,pos,Crow,val=Prow,val.
For all x∈{1,2,3,…,N2}, we move the linklists Urow,pos[x] and Urow,val[x] to the tails of the linklists Crow,pos[x] and Crow,val[x], respectively. Then, all linklists in Crow,pos are examined to search if any two elements have the same value. If the two elements, denoted by Crow,pos[x,m] and Crow,pos[x,n] with m<n, have the same value, then the value of Crow,val[x,m]+Crow,val[x,n] is assigned to the element Crow,val[x,m]. Finally, the element Crow,val[x,n] is removed for the consistency of the data structure. If no elements of a linklist in Crow,pos have the same value for all the rows, the matrix addition with respect to the proposed data structure is completed. The results of the matrix P+U are stored in Crow,pos and Crow,val. Figure 5 demonstrates the matrix addition by using the proposed data structure.
An example of adding two matrices with the arraylist representation. First, two arraylists Crow,pos and Crow,val are created by cloning Prow,pos and Prow,val. For allx∈{1,2,3,…,N2}, we move the linklists Urow,pos[x] and Urow,val[x] to the tails of the linklists Crow,pos[x] and Crow,val[x], respectively. For any linklist in Crow,pos, if any two elements, which are denoted by Crow,pos[x][m] and Crow,pos[x][n], have the same value, the elements Crow,pos[x][n] and Crow,val[x][n] are removed (if m<n), then value of the element Crow,val[x][m] is assigned as Crow,val[x][m]+Crow,val[x][n]. This procedure continues until no elements in a linklist have the same value. Then, the arraylists Crow,pos and Crow,val are the arraylist representation of P+U.
To obtain the matrix multiplication PU with the proposed data structure, the coefficients (PU)[x,y] are needed for all x,y∈{1,2,3,…,N2}. We create two empty arraylists Mrow,pos and Mrow,val, each of which arraylists has N2 linklists to store the result of matrix multiplication. To compute (PU)[x,y], we examine the linklists Prow,pos[x] and Ucolumn,pos[y] to search for the elements of the same value, then we multiply their corresponding values in Prow,val[x] and Ucolumn,val[y]. The value of (PU)[x,y] is the summation of the products obtained by the process. To store the value of (PU)[x,y], the values y and (PU)[x,y] are appended to the tail of the x-th link lists of Mrow,pos and Mrow,val, respectively. The result of performing matrix multiplication with the specified data structure is depicted in Figure 6.
An example of adding two matrices with the arraylist representation. To compute (PU)[x,y], we find the elements with the same value in the linklists connected to Prow,pos[x] and Ucolumn,pos[y]. If any two elements are found, the values of these two elements are multiplied. Then, the summation of the products is the value of (PU)[x,y]. To store the matrix (PU), two empty arraylists Mrow,pos and Mrow,val are created. Two elements are inserted to the arraylists to store the coefficient (PU)[x,y]. The element which has the value y is inserted to the linklist connected to Mrow,pos[x]. And the element which has the value (PU)[x,y] is inserted to the linklist connected to Mrow,val[x]. The arraylist representation of the matrix PU is stored in the arraylists Mrow,posand Mrow,val after (PU)[x,y] for all x,y∈{1,2,3,…,N2} are computed are inserted to the arraylists.
If we assume that a row of the matrix has n nonzero coefficients, to compute the coefficients of a row with respect to the matrix addition needs a matching algorithm between two sets with n elements, several (<n) real-number additions for the matched elements, several moving (<n) and deleting (<n) operations on the linklists. Compared to the time complexity O(N4) of the conventional matrix addition, the time complexity O(nN2) of the our method is much more efficient, if the matrix is sparse enough. Meanwhile, for the matrix multiplication PU with the proposed data structure, for any x,y∈{1,2,3,…,N2}, the value of (PU)[x,y] is necessary. And (PU)[x,y] can be computed by a matching algorithm between two sets with n elements, several (<n) real-number multiplications, and a summation of these real-number. Compared to the time complexity O(N6) of the conventional matrix addition, the time complexity O(nN4) of the proposed method is much lower if the matrix is sparse enough.
In addition to the matrix addition and multiplication, the operator 𝒯 defined by 𝒯(𝒜)=tr(𝒜·𝒜T) is also needed during the computation of the subband weighting factors. However, the value of 𝒯(𝒜) (of size N2 by N2) can be computed by summing all squares of the coefficients in the matrix 𝒜. For the conventional matrix representation, to compute the value of 𝒯(𝒜) needs N4 real number multiplications and about N4 real number additions, and for the proposed data structure, it needs nN2 real number multiplications and about nN2 real number additions, where n is the average number of nonzero coefficients in a row of the matrix.
Although the arraylist representation is more complicated than the matrix representation, the proposed method can reduce the memory as well as the computing time while the operated matrices are quite sparse. As a consequence, because most of the matrices used to compute the weighting factors are very sparse, the proposed arraylist representation is extremely suitable for the weighting factor computing process.
5. Experiment Results
We now demonstrate the efficacy of applying the arraylist representation to compute the weighting factors. In the experiment, we compare the time and memory used by the matrix representation and the arraylist representation during the computation of the weighting factors. The conventional matrix representation uses a two-dimensional array to store a matrix and the operations in (34) to calculate the addition and multiplication between two matrices. The arraylist representation uses two arraylists to store a matrix, and the operations is performed as the description in the previous section.
A T+2D video codec is used in the experiment. Because a GOP has 32 CIF frames, a five-level MCTF process using the 5–3 or the 9–7 temporal wavelet filter is applied to the GOP. Then, each frame of a video sequence is decomposed by applying a three-level spatial wavelet transform using either the 5–3 or the 9–7 wavelet filter. To search the motion vectors used in the temporal filtering, the motion estimation using a full-search with integer-pixel accuracy on the dyadic wavelet coefficients is applied. The block size is 4×4, and the search range for both the horizontal and vertical dimensions is [-16,15].
After the MCTF process, the motion vectors and the wavelet filters used in the spatial and temporal filterings are used to compute the weighting factors. We demonstrate and compare the results on two different schemes. Scheme 1 computes the weighting factors with the conventional matrix representation. Scheme 2 computes the weighting factors with the proposed arraylist representation. The equipment used in the experiment is a work station with double quadcore CPUs and total 12 GB memory. Both 5–3 and 9–7 temporal filters are experimented for various 1/4QCIF (88×72) sequences.
Figures 7 compares the time consumed by Scheme 1 and Scheme 2. The consumed time is measured by subtracting the time stamp between the beginning and the end of computing the weighting factors. We observe that the time consumed by Scheme 2 is much less than that consumed by Scheme 1.
The time consumed to compute the weighting factors. The x-axis identifies Scheme 1 (the traditional matrix representation), or Scheme 2 (the proposed arraylist representation) is applied to compute the weighting factors. The y-axis is the logarithm of the time consumed by the schemes. The size of the video sequence is 1/4QCIF(88×72), and the motion vectors is obtained from T+2D MCTF. Each GOP has 32 frames. The experiment is executed on the work station with double quadcore CPUs and total 12 GB memory. The consumed time is measured by subtracting the times tamp between the beginning and the end of the process that computes the weighting factors.
We also compare the amount of memory used by these two schemes. Figure 8 illustrates the amount of memory used by Scheme 2 is much less than that used by Scheme 1. The amount of used memory is obtained by averaging the monitored size of memory used to compute each weighting factor.
The amount of the memory used to compute the weighting factors. The x-axis identifies Scheme 1 or Scheme 2 is applied in the process, and the y-axis identifies the amount of the memory used during the process. The experiment is executed on the same equipments and settings as those used in Figure 7. The amount of memory used by the process is measured by averaging the monitored size of the memory used to compute each weighting factor.
No matter the time or memory used to compute the weighting factors, the experimental results show the proposed arraylist representation greatly reduce them compared to the conventional matrix representation.
6. Conclusion
In this paper, we proposed the arraylist representation for a matrix and the proposed method reduces the time and memory consumed used by the computation of weighting factors. We employed the proposed method on a T+2D structure and show that the time and memory needed to compute the weighting factors are greatly reduced compared to the conventional matrix representation. In a future work, we will expand the method to deal with the motion vectors obtained from the subpixel motion estimation to support more wavelet encoders.
AndreopoulosY.MunteanuA.BarbarienJ.Van Der SchaarM.CornelisJ.SchelkensP.In-band motion compensated temporal filtering20041976536732-s2.0-324278017710.1016/j.image.2004.05.007ZhangQ.GuoQ.NiQ.ZhuW.ZhangY.-Q.Sender-adaptive and receiver-driven layered multicast for scalable video over the internet20051544824952-s2.0-1714439514510.1109/TCSVT.2005.844454SchwarzH.hschwarz@hhi.hg.deMarpeD.marpe@hhi.fhg.deWiegandT.wiegand@hhi.fhg.deOverview of the scalable video coding extension of the H.264/AVC standard20071791103112010.1109/TCSVT.2007.905532OhmJ.-R.Three-dimensional subband coding with motion compensation1994355595712-s2.0-002849965910.1109/83.334985ChoiS.-J.WoodsJ. W.Motion-compensated 3-D subband coding of video1999821551672-s2.0-0033079171RusertT.HankeK.OhmJ.-R.Transition filtering and optimized quantization in interframe wavelet video coding5150Proceedings of the SPIE Visual Communications and Image Processing (VCIP '03)2003682694ChenP.WoodsJ. W.Bidirectional MC-EZBC with lifting implementation20041410118311942-s2.0-544423136710.1109/TCSVT.2004.833165UsevitchB.A tutorial on modern lossy wavelet image compression: foundations of JPEG 2000200118522342-s2.0-003544552510.1109/79.952803TaubmanD. S.MarcellinM. W.2002Kluwer AcademicXiongR.XuJ.WuF.LiS.ZhangY.-Q.Optimal subband rate allocation for spatial scalability in 3D wavelet video coding with motion aligned temporal filtering5960Proceedings of the SPIE Visual Communications and Image Processing (VCIP '05)200538139210.1117/12.631441ChengC.-C.PengG.-J.HwangW.-L.Subband weighting with pixel connectivity for 3-D wavelet coding200918152622-s2.0-5804920966610.1109/TIP.2008.2007067HsiangS. T.WoodsJ. W.Embedded image coding using zeroblocks of subband/wavelet coefficients and context modelingProceedings of the IEEE International Symposium on Circuits and Systems (ISCAS '00)May 20006626652-s2.0-0033716212OhmJ.-R.Van Der SchaarM.WoodsJ. W.Interframe wavelet coding—motion picture representation for universal scalability20041998779082-s2.0-634423359310.1016/j.image.2004.06.004GormishM. J.LeeD.MarcellinM. W.JPEG 2000: overview, architecture, and applicationsProceedings of the IEEE International Conference on Image Processing (ICIP '00)Sepetember 20002932UsevitchB.Optimal bit allocation for biorthogonal wavelet codingProceedings of the Data Compression Conference (DCC '96)19963873952-s2.0-0029707654JayantN.NollP.1984Englewood Cliffs, NJ, USAPrentice Hall