Parameterization of LSB in Self-Recovery Speech Watermarking Framework in Big Data Mining

The privacy is a major concern in big data mining approach. In this paper, we propose a novel self-recovery speech watermarking framework with consideration of trustable communication in big data mining. In the framework, the watermark is the compressed version of the original speech. The watermark is embedded into the least significant bit (LSB) layers. At the receiver end, the watermark is used to detect the tampered area and recover the tampered speech. To fit the complexity of the scenes in big data infrastructures, the LSB is treated as a parameter. This work discusses the relationship between LSB and other parameters in terms of explicit mathematical formulations. Once the LSB layer has been chosen, the best choices of other parameters are then deduced using the exclusive method. Additionally, we observed that six LSB layers are the limit for watermark embedding when the total bit layers equaled sixteen. Experimental results indicated that when the LSB layers changed from six to three, the imperceptibility of watermark increased, while the quality of the recovered signal decreased accordingly. This result was a trade-off and different LSB layers should be chosen according to different application conditions in big data infrastructures.


Introduction
In recent years, the rapid development of Internet and mobile phones has resulted in thousands of exploded data.Even though it is convenient to get information, it is possible for digital data to be replaced with fake information, potentially by an adversary, or even lost as a result of poor communication conditions.Therefore, the question of how to best guarantee data integrity and recover the tampered data has become an important problem in big data mining infrastructures [1][2][3].Watermarks, defined as the art of embedding secret message into the original signal, are effective ways to solve this problem [4,5].
The self-recovery watermarking techniques are firstly popular in the image domain [6][7][8] and the pioneer study on image watermarks dates back to the last century [9].There are a variety of different methods for watermark embedding and data recovery, such as discrete cosine transform [10,11], multiple watermarks [12], and source-channel coding [13].
With an increasing amount of audio and speech data, the security and privacy of speech become an urgent problem.While the self-recovery methods are less explored in the speech domain, because human audio systems are more sensitive than human visual systems, it is necessary to design more accurate schemes to recover tampered speech data.Traditional research has focused on detecting the tampered area but not further recovering the tampered speech [14], which limits the application.A fragile segment-based watermarking scheme for speech detection and recovery is proposed in [15].The algorithm can both detect the tampered area and recover the lost data, but there are two shortcomings: the tampering coincidence problem and the watermark data waste problem.To solve the two shortcomings at the same time, a novel method using reference-sharing mechanism [16] is proposed in [17].Moreover, Reed-Solomon codes are fully utilized to design an effective speech self-recovery scheme [18].In addition, many works focus on the various approaches to the watermark embedding based on its intrinsic characteristics, such as synthesized echoes [19,20], spread spectrum techniques [21,22], and patchwork watermarking methods [23,24].
To conform the complicated scenes in big data mining infrastructures, this paper discusses the influence of the LSB layers used for watermark embedding.Our work is based on a speech self-recovery framework proposed in [17].In [17], six LSB layers are used for watermark embedding by experience.In this paper, the LSB layer is treated as a parameter and the relationship with other parameters is discussed.By exploring the quantitative relationship between LSB layers, the maximum quantized bits, and the hash bits, the best choices of other parameters are then deduced by the exclusive method when the LSB layers change.We also observed that three to six LSB layers should be chosen for watermark embedding when the total bit layers equal sixteen.When fewer than six LSB layers are used, the imperceptibility of the watermark and the quality of the recovered signal change in opposite directions.Different LSB layers should be chosen according to different big data infrastructures.Moreover, when we enhance the tampered rate, fewer reserved areas could provide efficient reference bits, which may cause worse quality of the recovered signal.
There are three contributions in this paper: First, once LSB layer is fixed, the best choices of other parameters are deduced using the exclusive method.Second, there is a finding that six LSB layers are the limit for watermark embedding, which has been verified through experiments.Third, in conclusion, the trade-off between the imperceptibility of the watermarked speech signal and the quality of the recovered speech signal is discussed; different LSB layers should be chosen to balance it in different big data infrastructures.
The remainder of the paper is organized as follows.The framework for the speech watermark embedding and tampered speech recovery is introduced in Section 2, which also covers the parameterization of LSB and the relationship with other parameters.Experimental results are presented in Section 3. Section 4 discusses how to choose LSB layers according to different big data infrastructures and introduces various aspects of the proposed scheme.Section 5 includes concluding remarks.

The Speech Self-Recovery Framework
This entire speech watermarking scheme can be divided into two sections: the watermark embedding procedure and the tampered area recovery procedure.The parameterization of LSB is also mentioned in the section.The details are as follows.
2.1.Watermark Embedding Procedure.Assume an original 16-bit 8 kHz speech signal has  samples.In the algorithm, a frame consisted of 64 neighbor samples, so there are totally ⌈/64⌉ frames.If  is not the multiple of 64, add several zeros at the end of the signal until  can be divided by 64.These frames are then permuted randomly according to a secret key, which is known to both sides with consideration of privacy.A frame group consists of 16 neighbor frames in the random permutation.The total number of frame groups is ⌈/1024⌉.The embedding and the recovery procedure are both carried out in one frame group.
Out of 16 bits,  LSB layers are dedicated to watermark embedding, while the remaining 16 −  most significant bit (MSB) layers are unchanged during the entire procedure.The watermark consists of two parts: reference bits and hash bits, which will be introduced below.
In each frame of a frame group, the amplitude of the original signal is divided by 16 to obtain the compressed information, which is a 64-dimensional vector: where    ( = 1, 2, . . ., 64,  = 1, 2, . . ., 16) is the compressed information and  is the index of each frame.The vectors are then randomly permuted according to a secret key to form a vector whose dimension is 1024: where different subscripts are used to indicate the random permutation of frames.Next, calculate 368 reference values in each frame group in the following linear manner: where  is a random matrix sized 368 × 1024 and the Euclidean norm of each row is 1.To generate , the first step is to produce a matrix  0 sized 368 × 1024 whose elements are derived from an independent identical distributed Gaussian distribution with zero mean.Then the elements of matrix  can be obtained as follows: where (, ) and  0 (, ) are the elements of  and  0 , respectively.According to the central limit theorem, the reference values approximately follow Gaussian distributions with zero mean.There are 368 reference values and 16 frames in each frame group, so each frame carries 23 reference values randomly.The reference-sharing mechanism is used here to avoid both the coincidence problem and the watermark data waste problem effectively.As long as the 16 frames in a group have not been all tampered, it is possible to achieve high recovered quality.
To meet the storage constraint, the float reference values should be changed into integers.So the next step is to quantize the reference values: where Each reference value is converted into an integer within [− max ,  max ] and can be represented by  bits (the maximum quantized bits): Thus, there are totally 23× reference bits in one frame group.
For each frame, the index of the frame is represented by 64 bits, which are called position bits.There are 64 × (16 − ) bits in MSB layers in a frame, which are called MSB bits.Then 64 position bits, 64 × (16 − ) MSB bits, and 23 ×  reference bits are put into a hash function to produce  hash bits.To guarantee the privacy,  label bits are randomly generated and the exclusive-or results between hash bits and label bits are calculated as check bits: where ℎ  (1), ℎ  (2), . . ., ℎ  () are hash bits of the th frame, (1), (2), . . ., () are label bits which are the same in each frame, and   (1),   (2), . . .,   () are check bits of the th frame, respectively.
In each frame, 23 ×  reference bits and  check bits are embedded into  LSB layers of the frame as watermark, which are used for detecting and recovering the tampered speech at the receiver end.The 16 −  MSB layers remain unchanged to ensure the invisibility.The watermark embedding procedure is shown in Figure 1.

Tampered Area Recovery
Procedure.After receiving a speech signal that may have been tampered, the first step is to divide the received signal into several frames and frame groups according to the secret key, which is known to both sides.
For each frame, the 64 × (16− ) MSB bits, the 64 position bits, and the 23 ×  reference bits are extracted from the received speech.They are put into the same hash function to obtain  hash bits.The label bits are calculated by exclusive-or operator: where ℎ  (1), ℎ  (2), . . ., ℎ  () are hash bits that are calculated at the receiver end,   (1),   (2), . . .,   () are check bits which are extracted from the received signal, and  i (1),   (2), . . .,   () are label bits of the th frame, respectively.Due to the property of the exclusive-or operator, the label bits of each frame should be the same if there has been no tampering at all.If a frame has been tampered, the label bits of the frame are different.Even though the receiver does not know the label bits specifically, the tampered area can also be detected by comparing the label bits of each frame.In addition, based on the property of hash function, the probability of a tampered frame being falsely judged as reserved is 2 − , which is extremely low when  is large enough.This means that it is virtually impossible for false detection to occur.The reference values can be correctly extracted to recover the tampered speech.
After detecting which frames have been tampered, the next step is to recover the tampered content.In a frame group, if 16 frames are all reserved, the speech recovery is needless.If 16 frames have all been tampered, the recovery fails.Otherwise, assuming that there are  (1 ≤  ≤ 15) tampered frames in a frame group, only the reference values in 16 −  reserved frames can be used: where ( 1 ), ( 2 ), . . ., (  ) are the reference values embedded into the reserved frames and  () is a matrix with rows taken from  corresponding to reserved frames.So (10) can also be rewritten as where   and   are compressed information of the reserved frames and the tampered frames, respectively, and  (,) and  (,) are matrices whose columns are those in  ()  corresponding to   and   , respectively.Note that the reference values extracted from LSB layers are quantized, which could not be directly used for recovery.The reference values that have not been quantized can be estimated by Let where  max and  are the same as that in the embedding procedure.In other words,   is the estimate of  at the receiver end, which is a median value of the corresponding interval.The estimate results in errors are related to the choices of  max .The larger the value of  max is, the thinner the interval is, resulting in fewer errors.So (11) can be rewritten as In (14),  (,) and  (,) are already known to the receiver according to the secret key, which is known to both sides;   can be calculated in the reserved frames.Moreover,   ( 1 ),   ( 2 ), . . .,   (  ) can be estimated according to (13).In other words, only   is unknown and can be obtained by solving (14) at the receiver end.The compressed sensing and compositive reconstruction can be used to solve (14).
Next,   and   are combined to obtain a vector   whose dimension is 1024, which is the recovered compressed information of the original speech signal.Finally, the amplitude of the original signal can be obtained by multiplying 16.The entire procedure of recovery is showed in Figure 2.

Parameterization of LSB in the
Framework.The best choices of parameters corresponding to different LSB layers are deduced in this subsection.From the perspective of quantity, the watermark consists of reference bits and hash bits, so the total bits of LSB are equivalent to the sum of reference bits and hash bits: To ensure the imperceptibility of the watermark, the LSB layers  should be less than 8, which is the half of all the bit layers.There are three variables in (15) and all their choices are shown in Table 1, in which the values before and after "/" stand for the corresponding values of each variable.
The exclusive method is used to detect the best choices of each parameter.
Firstly, at the receiver end, the reference values are estimated using (13), which generally causes errors that relate to the values of  max : the larger the value of  max is, the thinner the segmentation is, which indicates a better estimate.In addition, the relationship between  max and  follows (7).Thus, the larger the value of  is, the larger the value of  max is, which indicates fewer errors.In other words, the value of  should be large enough to ensure small error, so the conditions of  = 1, 2, 3, 4, 5 are exclusive.
Secondly, the hash bits are used for detecting the tampered area.Because of the property of the hash function, the probability of a tampered frame being falsely judged as reserved is 2 − when  hash bits are used.Thus, the value of  should be large enough to reduce the false judged probability.In conclusion, the conditions of  =3, 6, 8, 11, 13 are exclusive; the corresponding falsely judged probability is too high for our framework.
Finally, when the LSB layer is fixed, the larger the value of  is, the larger the value of  max is, which indicates fewer errors.So smaller values of  should be exclusive in this step.Consequently, the conditions of  = 6, 9, 12, 14, 15, 17, 20 are

Experimental Results
Both objective and subjective experiments were carried out in this section.In the experiments, the above theoretical conclusions of the best choices of parameters were used.The values of the signal to noise ratio (SNR) of the watermarked speech signal and the recovered speech signal were calculated to deduce several useful conclusions.Moreover, the waveform of the original signal, the watermarked signal, and the recovered signal was shown in this section.In the subjective experiments, the listening tests were carried out to effectively verify the invisibility of the watermarked speech signal.

Objective Experimental Results
. A 16-bit 8 kHz [17,18] sampled speech signal with the length of 5 seconds was chosen as a sample in our experiments.The above theoretical results regarding the choices of maximum quantized bits and hash bits were used in the corresponding experiments.When the LSB layer was 8, 7, 6, 5, 4, 3, the values of SNR of the watermarked signal and the recovered signal were calculated, respectively.Due to the randomness of the algorithm, all experiments were carried out ten times and the means of the results were shown in Table 3.Additionally, the SNR of the watermarked signal and the recovered signal was shown in Figure 3.The LSB layers Several conclusions can be observed from Table 3 and Figure 3.If more LSB layers are dedicated for watermark embedding, the values of SNR of the watermarked speech signal decrease because of more changes in the original one.Moreover, as more LSB layers are used, the values of SNR of  (13).
If more than six LSB layers are used, the values of SNR of the recovered speech signal decrease when the number of LSB layers increases.This is because fewer MSB layers remain unchanged in the procedure, which leads to larger changes in the original signal.If more than six LSB layers are used, both the values of SNR of the watermarked and the recovered speech decrease as the number of LSB layers increases.In summary, the highest value of SNR of the recovered speech signal is achieved at six.According to this conclusion, fewer than six LSB layers should be chosen.
Our results are compared with that in [17], in which the six LSB layers are used.This condition is included in our framework.The comparison is shown in Figure 4. Though when the five to three LSB layers are used for watermark embedding the values of SNR of the recovered speech are a little lower than that in [17], the values of SNR of the watermarked speech are much higher.In other words, we extend the results in [17] by using different LSB layers for watermark embedding.
In addition, when the tampered rate increases, the values of SNR of the recovered speech decrease.This is because fewer reference values in the reserved area can be used for recovery.Through listening tests, when the value of SNR is larger than 7.2, the recovered signal is understandable, though there are big differences in naturalness.When the value of SNR is smaller than 7.2, the recovered speech signal is incomprehensible; these conditions are treated as failures.In other words, when the tampered rate is smaller than 30%, our framework could recover the tampered speech signal successfully.
To further verify the above discussion regarding the quality of the watermarked speech and the recovered speech, the first 20% of the signal was set as mute.The waveform of  the original signal is shown in Figure 5(a), and the waveform of the watermarked signal and the recovered signal with LSB layers of 8, 7, 6, 5, 4, and 3 is shown in Figures 5(b), 5(c), 5(d), 5(e), 5(f), and 5(g), respectively.Moreover, the spectrograms of the corresponding waveform are shown in the Appendix.Figure 5 indicates that when the LSB layers change from eight to three, the watermarked speech signal is more similar to the original one.Furthermore, the quality of the recovered speech signal firstly improved and then reduced.If six LSB layers are used, the quality of the recovered speech is highest.

Subjective Experimental Results
. Subjective listening texts were carried out in order to test the perception of the watermarked speech signal.In these subjective listening  tests, five sentences were randomly chosen from the CASIA-863 speech synthesis database.Ten 22-to 25-year-old participants, all with normal hearing ability, were trained to effectively evaluate the watermarked speech quality.
The subjective difference grade (SDG) is one of the most widely used subjective methods for evaluating the quality of a watermarked speech signal.The SDG ranges from 5.0 to 1.0 (from imperceptible to very annoying, as shown in Table 4).In subjective listening tests, the original and the watermarked speech signals were given to ten participants; they could classify the differences according to Table 4.The average SDG scores of five sentences and ten participants for watermarked speech signals were shown in Table 5.From the test results, we observed that the mean opinion score (MOS) ranged from 4.9 to 5.0 for all watermarked signals, indicating that all watermarked speech signals were almost imperceptible.As fewer LSB layers were used, the MOS is enhanced.This indicated that the fewer numbers of LSB layers resulted in higher quality of the watermarked speech signal.We also employed the ABX method, another subjective quality assessment technique, to evaluate the quality of the watermarked speech signals.In tests, the original speech signal A and the watermarked signal B were presented to ten participants.A third speech signal X, which was either A or B, was presented to the participants in random order.The participants were asked to identify what X was.The correction percentage was used to evaluate whether the watermarked speech signal was perceptible or not.If the result is 50%, which was the probability of the random guess, it is suggested that the differences between the original signal and the watermarked one were imperceptible.The evaluation results were shown in Table 5.The results showed that the correct percentage ranged from 46% to 55%.This indicated that the watermarked speech was almost imperceptible.With fewer LSB layers, the ABX test result approached 50% generally, indicating that fewer LSB layers resulted in higher quality of the watermarked speech signal.

Discussion
If fewer than six LSB layers are used, when the number of LSB layers increases, the imperceptibility of the watermark and the quality of the recovered signal change in opposite directions.For the watermarked signal, using more LSB layers results in worse watermarked signal quality.Though there is no obvious difference in the MOS and ABX subjective listening results, the values of SNR of different watermarked signals are obvious.For the recovered signal, more LSB layers result in improved recovered quality.This can be seen in higher values of SNR and the waveform of the recovered speech signal.
Once the number of LSB layers reaches six, both the imperceptibility of the watermark and the quality of the recovered signal decline as more LSB layers are used.Consequently using more than six LSB layers is not recommended for watermark embedding in the framework when the framework is used in big data scenes.
In conclusion, six to three LSB layers are recommended in big data mining infrastructures.The specific choices of parameters depend on the requirement in real applications.Because of the complexity in big data applications, it is important to choose different LSB layers for watermark embedding and tune the trade-off between the imperceptibility of the watermark and the quality of the recovered speech.Concretely, if the watermark's imperceptibility is highly required, while the recovered speech's quality is less required, fewer LSB layers are recommended; otherwise, more LSB layers should be chosen.Additionally, while the tampered rate increases, less reserved areas could provide efficient reference bits, causing lower values of SNR of the recovered speech signal.Therefore, when tampering may not occur or when the tampered rate is estimated as low, such as in a stable communicating environment, fewer LSB layers are suitable.When the communicating environment is terrible and the tampered rate is estimated as high, a larger number of LSB layers should be chosen.Once the LSB layers are chosen, other parameters are also determined as shown in Table 2.In other words, choosing different LSB layers makes it possible to extend the applications of the self-recovery speech watermarking framework to fit the complicated scenes in big data infrastructures.
As mentioned before, only the erasure of the watermarked speech was carried out in our experiments.In fact, our proposed framework is suitable for any other falsifications.This is because hash bits are used for detecting the tampered areas.Once a frame is treated as a tampered one, its label bits are different from others.Then the reference values of the tampered frame are useless in the following steps of the recovery procedure.
In the above experiments, continuous tampering was carried out.It should be noted that discrete tampering is

Figure 3 :
Figure 3: SNR of watermarked and recovered signal.

Figure 4 :
Figure 4: SNR of watermarked and recovered signal versus tampered area with different LSB layers.

Figure 5 :
Figure 5: The waveform of the original, watermarked, and recovered signal.(a) The original signal.(b) The watermarked and recovered signal when LSB = 8.(c) The watermarked and recovered signal when LSB = 7.(d) The watermarked and recovered signal when LSB = 6.(e) The watermarked and recovered signal when LSB = 5. (f) The watermarked and recovered signal when LSB = 4. (g) The watermarked and recovered signal when LSB = 3.

Table 1 :
All choices of parameters.

Table 2 :
Best choices of parameter.

Table 2 .
Once the LSB layers are chosen, the other parameters can be obtained by looking up this table.

Table 3 :
The SNR of watermarked and recovered signal.
recovered speech signal are not monotonous.If fewer than six LSB layers are used, the values of SNR of the recovered speech signal increase when the number of the LSB layers increases.This is because more bits are used for expressing the reference values, which leads to less error in estimate

Table 4 :
The standard of subjective evaluation.

Table 5 :
The evaluation of watermarked signal.