Speech Watermarking for Tampering Detection Based on Modifications to LSFs

. There have been serious issues concerning the protection of speech signals from malicious tampering. Digital watermarking has been paid much attention in solving this problem. This paper proposes a tampering detection approach based on speech watermarking by modifying the line spectral frequencies (LSFs). Watermarks are embedded into LSFs that derived from linear prediction (LP) analysis with dither modulation-quantization index modulation (DM-QIM). Minor modiﬁcations to LSFs introduced by quantization not only enable the watermarks to be inaudible to human auditory system but also provide the possibility of robustness against meaningful processing and fragility against tampering. We evaluated the proposed approach with objective evaluations with respect to inaudibility, robustness, and fragility. The results indicated that the proposed approach for tampering detection not only satisﬁed inaudibility but also provided good robustness against meaningful processing and fragility against malicious tampering.


Introduction
e development of digital technologies has greatly stimulated the widespread use of multimedia information.ese technologies, however, also enable the digital signals to be delivered in a detached manner crossing time and space, which facilitates unforeseen operations (tampering) to be performed.In particular, advanced speech analysis/synthesis methods (e.g., STRAIGHT [1]) and their applications such as voice conversion [2,3] and speech morphing [4] are capable of producing high fidelity of tampered speech.Since tampering the speech may cover up the fact and mislead the listeners, problems become particularly serious in forensic investigations [5,6], where the evidence sometimes needs to be recovered from digital media and served as the basis for judicial proceedings.Correspondingly, it is quite necessary to investigate whether there is tampering happened to the speech, to ensure the integrity and originality of the speech.
Digital watermarking [7,8] has drawn more and more attention in the past few years in speech protection.It can hide digital data at low energy level to the host signals while keeping the perceptual quality undistorted.Watermarking methods should satisfy four main requirements: (1) inaudibility, (2) blindness, (3) robustness, and (4) confidentiality.e importance of a particular requirement may vary upon the applications [9,10].For tampering detection, an additional requirement is vitally important, i.e., fragility.Fragility indicates that the embedded watermarks will easily be destroyed once a slight modification has been made to the watermarked signals [11].Correspondingly, fragile watermarking has the ability to identify where tempering has occurred.However, speech signals usually need to be processed by speech codecs or other meaningful processing, and it seems unable for fragile watermarking to survive from these processing due to its fragility [12].In practice, effective watermarking methods for tampering detection should satisfy two conflicting requirements: robustness against meaningful processing and fragility against malicious tampering.Only then can the watermarking methods provide reliable and effective protection of speech signals.
Digital watermarking for speech signals is more challenging compared with image watermarking, due to the extreme sensitivity of the human auditory system.Nonetheless, many successful watermarking algorithms for speech signals have been proposed.Sarreshtedari et al. [13] proposed to embed the compressed version of speech into original signal for tampering detection.Celik et al. [14] proposed a robust speech watermarking method by introducing small changes to pitch (fundamental frequency).Stability of such features under low data rate compression makes the method effective for semifragile authentication.Karnjana et al. [15,16] proposed a scheme based on singularspectrum analysis to detect the acoustic feature-based tampering.Wu and Jay Kuo [17] implemented a fragile speech watermarking based on odd/even modulation and exponential scale quantization.
e pseudorandom noise was embedded as the watermarks in the discrete Fourier transform (DFT) magnitude domain by roughly approximating the MPEG audio psychoacoustic model.Another method proposed by Narimannejad and Mohammad [18] was based on phase quantization of the sinusoidal model, in which the watermarks were hidden via phase quantization of sine wave.Unoki and Hamada [19] proposed a digital audio watermarking method based on the characteristic of human cochlear delay. is approach was also successfully applied to speech signals for tampering detection [20].
We previously proposed a speech watermarking approach based on LSFs [21] and DM-QIM [22].e quantization step of DM-QIM [23] was reasonably controlled to achieve a good balance between inaudibility and robustness.e evaluation results also suggested that the proposed approach could satisfy inaudibility and robustness.We also found that the proposed approach was very sensitive to processing that could change the shapes of waveform or the values of watermarked signal.
is characteristic inspired us to investigate if the proposed approach could be used as fragile watermarking for tampering detection.In this paper, we developed the proposed approach for speech tampering detection.e rest of this paper is organized as follows.Section 2 talks about the proposed tampering detection scheme based on the proposed watermarking approach, including watermark embedding process, detection process, and the identification of tampering.Subsequently, in Section 3, evaluations concerning inaudibility, robustness, and fragility are carried out.In the last section, we give a summary of this paper.

Scheme of Tampering Detection
e overall scheme for tampering detection consists of three main parts: embedding, detection, and tampering identification.Figure 1 illustrates a block diagram of this scheme.e original signal x(n) and watermarks s(m) are used to construct the watermarked signal y(n).Whether the received signal has been tampered with or not can be inferred from the detected watermarks,  s(m), with nonblind or blind detection.
2.1.Embedding Process. Figure 1(a) shows the block diagram of embedding process.e main process can be divided into frame segmentation, linear prediction (LP) analysis, parameter (LSFs) extraction, watermark embedding, LP synthesis, and frame connection.We select line spectral frequencies (LSFs) as the carrier of watermarks since LSFs that converted from LP coefficients are less sensitive to noise.We embed watermarks to LSFs as follows.(i) e original signal, x(n), is first segmented into nonoverlapping frames, and the frame number is indexed by "m." (ii) Each frame is analysed by p-th order LP analysis, and then we can extract the LP coefficients, a mk (k � 1, 2, . . ., p), and LP residue, r m .(iii) e LP coefficients, a mk (k � 1, 2, . . ., p), within one frame are converted to LSFs, f mk (k � 1, 2, . . ., p). e obtained LSFs, f mk (k � 1, 2, . . ., p), are expressed in the angle domain.All LSFs within one frame satisfy the ordering property from 0 to π as 0 ) for frame m is first duplicated to p times for all the LSFs in current frame, and then all LSFs are quantized with one of the DM-QIM quantizers Q 0 and Q 1 in equations ( 1)-( 3), depending on the value of s(m): where f mk in equation ( 1) is the LSF to be quantized, f mkw is the quantized value of f mk after using Q w (w � "0" or "1"), the Δ in equations ( 2) and ( 3) is the quantization step, "[•]" stands for the rounding function, and b i (i � 0 and 1) denotes the dither vectors corresponding to Q 0 and Q 1 to embed "0" and "1." (v) e modified LSFs, f mkw , are converted back to LP coefficients, a mkw .(vi) Current frame is then synthesized with LP coefficients, a mkw (obtained in step (v)) and the residue r m (obtained in step (ii)).(vii) e whole watermarked signal y(n) is finally reconstructed with all watermarked frames using nonoverlapping and adding function.Figure 2 demonstrates an example of watermark embedding using quantization step 2.0 °. e positions of LSFs on the half unit circle can reflect the formants' information of the speech signal.e LP order, p, was ten; thus, ten LSFs were calculated.In this case, five formants were estimated.Watermark "1" was embedded into each LSF with equations ( 1) and (3).Since the quantization step of 2.0 °was small, the original LSFs as well as positions of formants just slightly shifted from their previous positions.Accordingly, the sound quality was not seriously distorted., where watermarks should be detected without using the original signal x(n).

Nonblind Detection.
e detailed procedures for (i) nonblind detection in Figure 1 Mathematical Problems in Engineering original signal, x(n), and watermarked signal, y(n), are segmented into nonoverlapping frames, where the frame number is indexed by "m." (ii) A p-th LP analysis is applied to the frames of x(n) to obtain LP residue, r m .(iii) e LP coefficients, a mk (k � 1, 2, . . ., p), can be calculated using LP residue r m of original frame and the current watermarked frame.(iv) e LP coefficients, a mk (k � 1, 2, . . ., p), are converted to LSFs, f mk (k � 1, 2, . . ., p).Since we embed the same bits to all LSFs of one frame in the embedding process, there exists a possibility that not all the LSFs can be correctly detected.us, we use majority decision to decide the embedded bit.According to the block diagram in Figure 3, each LSF within one frame is requantized with both quantizers in equation ( 4).We calculate the distances, d mkw (k � 1, 2, . . ., p, w � "0" and "1"), between two quantized results, f mkw (w � "0" and "1"), and the obtained LSF, f mk (k � 1, 2, . . ., p), using equation (5).Each LSF can indicate one embedded bit ("0" or "1") using the quantizer that provides a shorter distance using equation (6).We sum up the value of all detected bits to L with equation (7), and the final decision on the embedded bit of current frame is obtained by comparing the value of L and p/2 with equation ( 8): Δd mkw � f mkw − f mk , w � 0 and 1, k � 1, 2, . . ., p, (5) difference between them is on how the LSFs of one frame are obtained.In blind detection, the LP coefficients, a mk (k � 1, 2, . . ., p), are directly calculated from each watermarked frame.e LSFs, f mk (k � 1, 2, . . ., p), are converted from these LP coefficients.Embedded bit is calculated using the same equations ( 4)- (8) as in the blind detection.

Tampering Identification.
Using the above nonblind and blind detection, we can detect watermarks,  s(m), from watermarked signal, y(n).As to verify if there is tampering happens to the watermarked signal, y(n), before it is received at the receiver side, we should compare the original watermarks, s(m), and the detected watermark,  s(m), and find the mismatches.According to Figure 1(c), this can be simply figured out by bit exclusive-OR.If there are no mismatches, the received signal, y(n), is the original signal, x(n), with no tampering occurred; otherwise, each mismatch indicates the possible tampering of the corresponding frame.

Evaluations
We carried out three experiments with respect to (1) inaudibility, (2) robustness against codecs and meaningful processing, and (3) fragility against tampering, to evaluate

4
Mathematical Problems in Engineering the performance of the proposed approach [8].e ATR dataset (B set) consisting of 12 speech stimuli (Japanese sentences, 20 kHz, 16 bits) [24] was used to evaluate the proposed method.is dataset is also widely used to investigate the speaker properties, e.g., speaker individuality and the acoustic/phonetic features.erefore, it is quite suitable to evaluate the tampering detection performance of the proposed method.Each stimulus was clipped to 8.1 sec duration and embedded with watermarks at different bit rates.e bit rates in our experiments were set to 4, 8, 16, 32, 64, 128, 256, 512, and 1024.e embedded watermark was a 122 × 77 bitmap image in Figure 4.
e LP order is important for the performance of the proposed method.High LP order is beneficial to follow the details of the spectral contour while low LP order can provide global frequency information only.Under low-order LP analysis, each LSF carries more information compared with those under high-order LP analysis.As a result, the sound distortion brought by quantizing LSFs of loworder LP analysis will be severe.On the other hand, most processing will bring distortions to the watermarked signal; if LP order is so high to follow all the spectral details, any distortion will disturb the LSF deviation, which obstructs the watermark detection.In this case, LP order should be low to achieve robustness.According to the above analysis, we selected suitable LP order for the proposed method based on preliminary experiments.e LP order was finalized as 10 to balance inaudibility and robustness performance.e quantization step in QIM also affects the performance of the proposed method, which in a trade-off among the conflicting requirements of inaudibility and robustness.A small quantization step provides better sound quality of the watermarked signal; however, the robustness will be degraded.In this work, we chose 1.0 °as a suitable quantization step to achieve good balance between inaudibility and robustness.

Evaluations for Inaudibility.
e log-spectrum distortion (LSD) [25] and perceptual evaluation of speech quality (PESQ) [26] were adopted to check the inaudibility of the proposed methods.e LSD is distance measure (in decibel (dB)) of the two spectra between the original signal and watermarked signal.LSD of 1 dB was usually chosen as the criterion, and a lower value indicated less distortion.e PESQ recommended by ITU-T recommendation P.862 is a family of standards for automated assessment of the speech quality.e results of PESQ are Objective Difference Grades (ODGs), which are graded from − 0.5 (very annoying) to 4.5 (imperceptible), corresponding to Mean Opinion Score (MOS) values.e ODG of 3.0 (slightly annoying) was set as the criterion, and a higher value indicated better quality.Figure 5 shows an example of embedding watermark "1" into one frame of the original signal using quantization step of 1.0 °. e waveforms and the spectra of the original signal and the watermarked signal are shown in the top two panels, and the differences between them are shown in the bottom panel.One can see that the differences between the original signal and the watermarked signal in both the time domain and frequency domain were negligible, which indicated that the proposed method introduced almost imperceptible distortion to the human auditory system.e objective evaluation results are provided by LSD (Figure 6(a)) and PESQ (Figure 6(b)).
e straight blue dashed lines in each subfigure indicated the criteria for LSD ( ≤ 1 dB) and PESQ ( ≥ 3.0 ODG).Since the embedding processes of the nonblind and blind detection were the same, we got the same evaluation results for LSD and PESQ.As we can see, sound quality got worse when bit rate increased.Nevertheless, for all bit rates from 4 bps to 1024 bps, the watermarked signals could satisfy the criteria for both LSD and PESQ.ese results indicated that for the quantization steps of 1.0 °, the proposed approach (with nonblind and blind detection) could satisfy inaudibility for all bit rates.

Evaluations for Robustness.
Robustness of the proposed approach was evaluated from two aspects: (a) robustness against different speech codecs and (b) robustness against general processing.We adopted bit detection rate to measure the robustness, and a higher bit detection rate suggests a better robustness.e calculation of bit detection rate is defined in equation (9), where s(m) represents embedded watermarks,  s(m) represents detected watermarks, and W is the length of watermarks.e symbol "⊕" denotes the operation of "exclusive-OR," that is, if the bit values of s(m) and  s(m) are the same (s(m) � 1 and  s(m) � 1, or s(m) � 0 and  s(m) � 0), "s(m)⊕ s(m)" equals 0; otherwise, "s(m)⊕ s(m)" equals 1:

Robustness against Speech Codecs.
In general, speech codes can be classified into waveform-based and parameterbased schemes.Watermarking methods are thus required to satisfy both kinds of speech codecs.We chose two typical speech codecs G.711 (waveform-based) and G.729 (parameter-based) to evaluate the robustness of the proposed approach.Figure 7 presents bit detection results for normal detection without any modifications (Figure 7(a)), detection after G.711 (Figure 7(b)), and detection after G.729 (Figure 7(c)).e straight blue dashed lines in each subfigure indicated the criteria for bit detection rate of 90%.As we can see from Figure 7(a), the nonblind approach had almost 100% bit detection rates for all bit rates, while for the blind detection, the bit detection rates were a little lower.For bit detection after G.711 and G.729, the nonblind approach in Figures 7(b

6
Mathematical Problems in Engineering against these speech codecs.However, the bit detection rates dramatically reduced in blind detection approach (Figure 7).We investigated the reason why there are such big differences in bit detection between the nonblind approach and the blind approach.In the nonblind approach, the original LP residue can be used to calculate the LSFs in detection process.us, the obtained LSFs in this approach were almost the same with those in the embedding process, which facilitates the watermark detection.However, for blind approach, the LSFs are directly derived from the watermarked signal without the assistance of the original LP residue.As we know, the LP analysis calculates LP coefficients (LSFs) based on the squared error criterion, and thus, the LSFs derived directly from the watermarked signal are different from those modified LSFs in the embedding process.As a result, the watermarks cannot be detected accurately (Figure 8).

Robustness against General
Processing.We carried out robustness evaluation of the proposed approaches against several meaningful processing, which are listed below: (1) Scaling by 0.5 and 2.0 (2) Resampling at 12 kHz and 24 kHz (3) Requantization with 8 bits and 24 bits (4) Spectrum modification by short-time Fourier transform (STFT) e 16 × 16 bitmap image (i.e., watermarks) in Figure 9(a) was embedded to the original signal at 4 bps.In this case, each embedded bit was able to account for 0.25 s speech segment when locating the tampering.In fact, 0.25 s is too short to make a meaningful tampering of speech content.erefore, embedding bit rate of 4 bps is able to locate the tampering in time domain at sufficient precision in practical.We processed the middle segment of watermarked signals with the above processing and detected the embedded watermarks from watermarked signals.Watermarks should be correctly detected if the watermarking method is robust against these processing.
Figure 9 illustrates all the nonblind and blind detection results.e nonblind detection had better performance than the blind detection.e bit detection rate for each subfigure has been listed in Table 1.For the nonblind detection, the proposed approach had very good robustness against general speech processing, since almost all the bit detection rates were over 90% except for requantization with 8 bits.For the blind detection, the bit detection rates were slightly lower, as shown in Figure 9(b).e bit detection rates under, e.g., Figures 9(c), 9(d), 9(f ), 9(h), and 9(i), were still satisfactory.Nevertheless, the proposed blind approach was not robust again resampling at 12 kHz and requantization with 8 bits.
e main reason for this was that the resampling or requantization at lower rate introduced distortions to the watermarked signals.

Nonblind detection Blind detection
Robustness tests  ese results suggested that the proposed approach with nonblind detection and blind detection was fragile against these tampering.erefore, it was easy to identify tampering with such results.

Conclusions
is paper proposed a watermarking-based tampering detection approach for speech signals.Watermarks are embedded by modifying the line spectral frequencies (LSFs) using dither modulation-quantization index modulation (DM-QIM).We evaluated the proposed approach by carrying out three objective evaluations, i.e., inaudibility, robustness, and fragility.
e evaluations results suggested that the proposed approach could satisfy inaudibility and provided good robustness.Furthermore, it was also fragile against malicious tampering.erefore, it is effective for speech tampering detection.

Figure 1 (
b) outlines two schemes of detection: (i) nonblind detection (top side) in which both original signal x(n) and watermark signal y(n) are available and (ii) blind detection (bottom side)

Figure 5 :
Figure 5: Waveform and spectra differences between the original signal and the watermarked signal (LP order: 10th; quantization step: 1.0 °).(a) Waveform of the original signal, (b) waveform of the watermarked signal: embed "1," (c) difference in waveform, (d) spectrum of the original signal, (e) spectrum of the watermarked signal: embed "1," and (f ) difference in spectrum.

Figure 6 :
Figure 6: Inaudibility performance of the proposed approach measured by (a) LSD and (b) PESQ.

( 3 )
Reverberation (time: 0.3 sec) (4) Filtering with low-pass filter (order: 32nd; normalized cutoff frequency: 0.99) (5) Filtering with high-pass filter (order: 32nd; normalized cutoff frequency: 0.01) (6) Concatenation with original speech.e detected images are shown in Figure10, and we calculated the bit detection rate of each image (Table2).It is found that the bit detection rates after tampering were
Fragility.Similar to robustness evaluation, we embedded the same 16 × 16 bitmap image in fragility evaluation.We manually modified the middle segment of the watermarked signals with malicious tampering listed below and then checked whether the embedded watermarks were destroyed:

Table 1 :
Bit detection rates for robustness tests.

Table 2 :
Bit detection rates for fragility tests.