Robust and Reversible Audio Watermarking by Modifying Statistical Features in Time Domain

Robust and reversible watermarking is a potential technique in many sensitive applications, such as lossless audio or medical image systems. This paper presents a novel robust reversible audio watermarking method by modifying the statistic features in time domain in the way that the histogram of these statistical values is shifted for data hiding. Firstly, the original audio is divided into nonoverlapped equal-sized frames. In each frame, the use of three samples as a group generates a prediction error and a statistical feature value is calculated as the sum of all the prediction errors in the frame. The watermark bits are embedded into the frames by shifting the histogram of the statistical features. The watermark is reversible and robust to common signal processing operations. Experimental results have shown that the proposed method not only is reversible but also achieves satisfactory robustness to MP3 compression of 64 kbps and additive Gaussian noise of 35 dB.


Introduction
With the rapid development of the Internet technology, publication and dissemination of digital multimedia become more and more convenient.However, the authenticity and security of the digital multimedia are a challenge for the media owner [1].Digital watermarking technology is an efficient approach to protect the copyright of the digital media.Reversible watermarking is one of the watermarking technologies used for data hiding.Reversible watermarking enables embedding secret data into host media and allow extraction of the original media and the secret data [2][3][4].It is very useful in sensitive applications such as medical image system, military image, and lossless audio [5].Although there are so many reversible watermarking methods, most of them are designed in a lossless environment and cannot resist any type of attacks.As a result, the original media or the secret data cannot be recovered after the watermarked media go through some changes [6].
In some cases, such as the copyright protection of the digital media, the embedded data is expected to be robust to some attacks such as lossy compression or additive noise.To this end, researchers pay more attention to robust reversible watermarking.Robust reversible watermarking is that the original media and the embedded data can be both recovered correctly when the watermarked media remain intact, and the embedded data can still be extracted without error even when the watermarked media go through some attacks [7].Until now, a few robust reversible image watermarking methods have been proposed, which can be classified into two groups: (i) Blind watermarking scheme: in [7,8], Vleeschouwer et al. proposed a blind extraction scheme based on the patchwork theory and modulo-256 by using the grayscale histogram rotation.This work is robust against JPEG compression, but the watermarked image has lower visible quality due to the reason that the watermark embedding procedure will cause salt-andpepper noise in the watermarked image.Besides, the payload is low.To handle the salt-and-pepper noise problem, Zou et al. proposed a scheme by shifting the absolute mean values of the integer wavelet transform (IWT) coefficients in a chosen subband [9], and Ni et al. proposed a scheme by modifying the histogram of a robust statistical quantity in the spatial domain [10].Since the embedding process may introduce the 2 Advances in Multimedia error bits, the error correction coding (ECC) has been used.Besides, these two methods suffered from the unstable robustness and incomplete reversibility according to [11].In [12] (ii) Nonblind watermarking scheme: in [13], a nonblind scheme based on wavelet-domain statistical quantity histogram shifting and clustering (WSQH-SC) is proposed.A pixel adjustment is presented at first to avoid the overflow and underflow, and a location map is used to record the changed pixels.This method achieved good robustness against JPEG, JPEG2000, and additive Gaussian noise, but it is not blind since the locations of the changed pixels need to be saved as a part of side information and transmitted to the receiver side in order to recover the original image.
In [14], the Slantlet transform (SLT) was applied to image blocks, modifying the mean values of the HL and LH subband coefficients to embed the watermark bits, and a second stage of SLT transform is applied to the LL1 subband, embedding another watermark bit into the HL2 and LH2 subband.Because the coefficients and the mean values are fractional with more decimal places, the mean information was taken as side information to be sent to the receiver side for the recovery of the original cover image.In order to solve the nonblind extraction question in [14], the authors in [15] used IWT on images and randomly selected 10 coefficients from all the 16 coefficients in a block to compute the amplitude mean of the block, so that the mean information can be embedded into the image itself for blind extraction.
In [16], Coltuc and Chassery proposed a general framework for robust reversible watermarking by multiple watermarking.First the watermark is embedded into the cover image with a robust watermarking method and then a reversible watermarking method is adopted to embed the information (which is used to restore the original cover image) into the robust watermarked image.Suppose  and  1 are the original image and the robust watermarked image after embedding a watermark , respectively.The embedding distortion,  =  −  1 , is compressed and embedded into the robust watermarked image with the reversible watermarking method.At the receiver side, if there are no attacks, the robust watermarked image  1 and the difference  can be extracted since the embedding process is reversible.Then the original image  can be recovered by  =  +  1 .Furthermore, the watermark can be extracted.If the watermarked image goes through a JPEG compression operation, the robust watermark can still be extracted.This framework is very instructive and achieves higher payload and good robustness against JPEG compression.
In [17], a robust reversible audio method based on spread spectrum and amplitude expansion is proposed.A robust payload is embedded at first using the direct-sequence spread-spectrum modulation, with the sequence determined from the amplitude expansion in time and frequency of integer modified discrete cosine transform (MDCT) coefficients.Then a reversible payload is embedded into the apertures in the amplitude histogram that result from amplitude expansion of the integer MDCT coefficients to recover the host audio.This method achieves robustness against some signal processing like MP3 compression and additive noise, and if the watermarked audio remains intact, the host audio can be recovered perfectly.
In this paper, we propose a novel robust and reversible audio watermarking scheme based on statistic feature and histogram shifting in time domain.By shifting the histogram of the statistic features in time domain, the proposed algorithm achieves good robustness and reversibility at the same time.
The rest of the paper is organized as follows.The foundation work is introduced in Section 2. The proposed watermarking algorithm is described in Section 3. Experimental results are presented in Section 4. Section 5 concludes this paper.

Algorithm's Principle
This section will introduce the foundation works of the proposed robust reversible digital audio watermarking scheme.Firstly, a robust statistic feature of time domain is introduced; then how to modify the statistic feature to embed the watermark bit is briefly described.

Robust Statistic
Feature.Consider a time-discrete digital audio signal ; the host signal is first divided into nonoverlapped equal-sized frames.We take  samples per frame; for example,  samples as a frame and three samples as a group are shown in Figure 1.For a sample group (  ,   and   ), the prediction value of the middle sample x is calculated by using two immediate samples as where ⌈⌉ means rounding the elements of  to the nearest integer towards infinity.The prediction error of x is Since the samples in a group are often highly correlated, the prediction error   is expected to be very close to zero.For a frame with  samples, /3 prediction errors can be computed.The sum of all the prediction errors in a frame, denoted by , is called the statistic feature in this paper.The statistical feature of a frame is calculated as where   is the prediction error of the th group in the frame.The basic idea of the proposed algorithm is based on this statistic property.

Watermark Statistic Feature.
For each frame, one watermark bit is embedded by shifting the value of the statistic feature.The shifting operation is done by modifying the samples in a frame.Taking track 1 (which is downloaded from the website [18]) as example clip, Figure 2 shows the distribution of  values by using 300 samples as a frame and three samples as a group.The rule to modify the statistic value is referred to histogram shifting method.At first, we scan all frames and find out the maximum of the absolute  values, denoted by  max .Then, a threshold  is set to a positive integer bigger than  max .As a result, all  values are within the range [−, ].For example, from Figure 2 we can get that  max is 446, so threshold  can be an integer such as 500.The watermarking rule is to keep the statistic feature within [−, ] if the watermark bit is "0" while the statistic feature is shifted away from zero by a shifting quantity  +  if the watermark bit is "1."To achieve stronger robustness, parameter  is a threshold which is usually set bigger than .To reduce the embedding distortion, if the embedded watermark bit is "1" and the original statistic feature belongs to [0, ), the statistic feature is shifted to the region [ + , 2 + ]; if the embedded watermark bit is "1" and the original statistic feature belongs to (−, 0), the statistic feature is shifted to the region [−2 − , − − ].In such a way, the bit-0 region and the bit-1 region are separated by the robust regions ( ,  + ) and (− − , −).For example, Figure 3 shows the distribution of  values after embedding watermark by using clip track 1.
The modifying rules are as follows.
If the embedded bit is "0," keep the frame unchanged.If the embedded bit is "1," the samples in the frame are modified by where    is the th sample in the th frame.The index  is in [1, 𝑆] and  is the number of the samples in a frame.The integer value  is the shifted quantity of a sample, At the receiver side, if the watermarked audio remains intact, the watermark bits can be extracted by where   is the to be hidden th bit.The original audio can be recovered as

Proposed Algorithm
The embedding and extraction processes are presented in detail as follows.
3.1.Watermark Embedding.Figure 4 shows the proposed watermark embedding process.The watermark is embedded with the following five steps.
Step 1. Divide the original audio  into nonoverlapping frames sized  samples.
Step 3. Set the threshold values  and  ( >  max and usually  > ).
Step 4. If the watermark bit is "0," nothing is changed.If the bit is "1," shift the statistic feature value with a shifting quantity  +  to embed the watermark bit by modifying the samples in the frame with value  referring to (4).
Step 5. Combine the frames to get the watermarked audio.

Watermark Extraction.
If the watermarked audio goes through some attacks (such as MP3 compression, additive noise, resampling, or requantization), the watermark can still be detected.To improve the accuracy of the watermark extraction, three extraction methods and a majority voting system are adopted to identify the extracted watermark by computing the distorted statistical feature values   .
(i) Extraction 1. Redefine the bit-0 region as [− − /2,  + /2] and the watermark extraction as (ii) Extraction 2. Redefine the bit-0 region as [− − /3,  + /3] and the watermark extraction as (iii) Extraction 3. -means clustering algorithm is introduced to extract bits.Figure 5 shows the distribution of the  values after MP3 compression, and the watermark can be extracted by The majority voting system works as Eventually, three extraction methods and a majority voting system are adopted to extract watermark.Figure 6 shows the proposed watermark extracting process.If the watermarked audio remains intact, the watermark can be extracted correctly and the original audio can be recovered as the following steps.
Step 1. Divide the watermarked audio   into nonoverlapping frames sized  samples.
Step 3. Extract the watermark with three extraction methods and identify the watermark with the majority voting system by referring to ( 8)- (11).
Step 4. The original audio can be recovered by modifying the samples in the frame with value  referring to (7).
Step 5. Combine the frames to get the original audio.
If the watermarked audio goes through some attacks, the original audio cannot be recovered exactly, so we focus on the watermark extraction, and the watermark is extracted as follows.
Step 1. Divide the watermarked audio   into nonoverlapping frames sized  samples.
Step 2. Calculate the statistic feature values of the frames   by referring to (1)-(3).
Step 3. Extract the watermark with three extraction methods and identify the watermark with the majority voting system by referring to ( 8)- (11).

Experimental Results
In this section, 7 WAV audio file of the sample rate of 44.1 KHz and 16 bits per sample (tracks 1, 2, 3, 4, 5, 6, and 7 [18]) are used as example clips to evaluate the performance of the proposed algorithm.The payload of our method only depends on the length of a frame ; for a time-discrete digital audio signal  in length , the pure payload can be calculated by In the experiment, the watermark is a pseudo-random sequence in length of 1000 bits.The imperceptibility is first analyzed by the SNR standard at different threshold values and different sample numbers per frame.Then, robustness testing against MP3 compression, additive noise (AWGN), resampling (44.1-16-44.1 kHz), and requantization (16-8-16 bits) are reported by using the software CoolEditPro v2.1.

Imperceptibility Test.
The imperceptibility is measured by the embedding distortion.In the proposed scheme, the distortion is caused by the shifting quantity on the samples depending on thresholds ,  and the length of a frame .Since  is set at first, we only investigate the influence of  and  on SNR.
Figure 7 plots the relationship between SNR and the threshold  for different clips at the same threshold  and .From this figure we can conclude that with the increase of , the SNR value drops.The reason is that the larger  is, the larger shifting quantity is used, so the larger embedding distortion is caused.As a result, SNR value drops.
Figure 8 plots the relationship between SNR and the length of a frame  for different clips at the same thresholds  and .We can see from this figure that the larger  is, the higher SNR value is achieved.The reason is that with the increase of , the shifting quantity for every single sample drops, so the SNR values rise due to the fact that the embedding distortion is reduced.Consequently, the frame length  will influence Maximum Embedding Capacity and SNR value directly, the Maximum Embedding Capacity is higher when  is smaller according to (12), and the SNR value is higher when  is larger according to Figure 8.To consider the balance between Maximum Embedding Capacity and SNR value, we have found that the value of  within the range of 300 to 600 can usually achieve a satisfactory result after a set of experiments.

Robustness Testing.
To test the robustness of the proposed scheme, a set of experiments has been taken on tracks 1-7.Table 1 shows the results, in which RP means resampling (44.1-16-44.1 kHz) operations while RQ means requantization (16-8-16 bits) operations.We can observe from this table that all the example clips can achieve the robustness against MP3 compression at 64 Kbps.For track 1, the watermark bits can be correctly extracted under the No change Combine the frames to obtain the original audio Some samples in the frame are subtracted by value 훽 and Some samples in the frame are added by value 훽 and  MP3 compression of 48 Kbps.The robustness against additive noise is also satisfactory.Even with the noise intensity at 25 dB, the BER (bit error rate) values are less than 10% except for track 1. Besides, the watermark robustness against resampling and requantization operations is perfect, and the hidden bits can be recovered without errors.As shown in Figure 3, the robustness of the proposed method is originated from the robust region.The robust region depends on threshold .The larger  is, the larger the robust region is, and the stronger robustness is. Figure 9   supports the conclusion.Figure 9 shows the bit error rate (BER) at different threshold  for the same audio with same threshold .The lower BER means that the stronger robustness is achieved.We can find that as threshold  increases, the BER drops, and the robustness rises.Take track 1 as example clip; Figure 10 shows the bit error rate of the extracted watermark with different thresholds  against additive noises with the same  and  ( 500,  = 300).We can see that the larger  is, the smaller bit error rate is, and the better robustness is.As threshold  increases, the robustness becomes stronger.In the application, we can adjust the parameter  to achieve ideal robustness.On the  other hand, with the increase of , the SNR value drops.To consider the balance between SNR value and robustness, we have found that the value of  within 3000 to 5000 can usually achieve a satisfactory result after a set of experiments.
To evaluate the effect of the frame length  on the robustness performance, a set of experiments has been taken on track 1, track 6, and track 7. Table 2 lists the results.We can observe that, for the same audio with the same  and , such as  = 500 and  = 3000 for track 1, as  increases, the robustness against MP3 compression will strengthen, but for track 6 and track 7, as  increases, the robustness against  For fair comparison with the method in [17], we use the same host signals (tracks 32, 35, 65, 66, and 69) downloaded from sound quality assessment material (SQAM) collection [19].Table 3 shows the robustness testing results against MP3 compression and additive noise (AWGN) operations.We can observe that the method in [17] can carry 216 bits and resist the MP3 compression at 128 Kbps while the proposed method can resist the MP3 compression at 64 Kbps with 1000 bits embedded.In addition, in our method the BER under the AWGN of 35 dB is less than that of the method in [17].In other words, the proposed method can provide larger embedding capacity and obtain stronger robustness against MP3 compression and AWGN attacks.The imperceptibility is evaluated by using the ODG standard.The closer the ODG value to 0, the better the imperceptibility.For the table it is noted that the imperceptibility of the proposed method is better except for the clips track 35 and track 66.The reason is that  max values of track 35 and track 66 are bigger.As a result, thresholds  and  are also larger and more embedding distortion will be caused.

Conclusions
In this paper, we proposed a robust and reversible audio watermarking method by shifting the histogram of the statistical feature values in time domain.The statistical feature is calculated as the sum of the prediction errors in a frame.Since the audio clip has a larger number of samples and each frame can hold enough elements, the statistical feature is robust to those common signal processing operations.Considering that the distribution of the statistical feature values may be distorted to some extent, three extraction methods and the majority voting system are designed for the watermark detection.Experimental results have shown that thousands of bits can be reversibly embedded and the watermark bits can resist MP3 compression of 64 kbps and additive noise of 35 dB.Comparing with the existing excellent method, the proposed method can embed more watermark bits and achieve stronger robustness.

Figure 1 :Figure 2 :
Figure 1: The use of  samples as a frame and three samples as a group.

Figure 3 :
Figure 3: The distribution of the  values of track 1 after embedding watermark.

Figure 5 :
Figure 5: The distribution of the statistic feature values of track 1 after MP3 compression at 64 kbps.

Figure 7 :
Figure 7: Relationship between SNR and threshold .

Figure 8 :
Figure 8: Relationship between SNR and the frame length .

Figure 9 :
Figure 9: Relationship between MP3 bit rate and threshold .

Figure 10 :
Figure 10: The BER values at different AWGN with different threshold .
drops, so the effect of the frame length  on the robustness against MP3 compression is unstable.The influence on AWGN is little.
, Zeng et al. enhanced the scheme of Ni et al. by introducing two thresholds and a new embedding mechanism.This method is blind and reversible.For a satisfactory performance, the two threshold values have to be carefully searched for different cover images.

)
2.3.Prevention ofOverflow/Underflow.For a 16-bit digital audio, the permission range of the sample value is [−2 15 , 2 15 ].Watermark embedding will modify the sample values with the value , so the overflow or underflow does not occur if the original sample values belong to [−2 15 + , 2 15 − ].In fact, as the value  is very small, the original sample values of most normal audio belong to [−2 15+ , 215− ].Therefore, in the proposed method, there is no overflow or underflow in most cases.Ifthe audio cannot meet this condition, we can record the location and modify the sample value to the range [−215+ , 215− ]; then the location can be saved as side information and embedded into the audio.

Table 1 :
Performance of the proposed method.

Table 2 :
Performance of the proposed method with different length of a frame  on track 1.