Audio Watermarking Scheme Based on Singular Spectrum Analysis and Psychoacoustic Model with Self-Synchronization

This paper proposes a blind, inaudible, and robust audio watermarking scheme based on singular spectrum analysis (SSA) and the psychoacoustic model 1 (ISO/IEC 11172-3). In this work, SSA is used to analyze the host signals and to extract the singular spectra. A watermark is embedded into the host signals by modifying the singular spectra which are in the convex part of the singular spectrum curve so that this part becomes concave. This modification certainly affects the inaudibility and robustness properties of the watermarking scheme. To satisfy both properties, the modified part of the singular spectrum is determined by a novel parameter selection method based on the psychoacoustic model. The test results showed that the proposed scheme achieves not only inaudibility and robustness but also blindness. In addition, this work showed that the extraction process of a variant of the proposed scheme can extract the watermark without assuming to know the frame positions in advance and without embedding additional synchronization code into the audio content.


Introduction
Since the last decade, music sharing via the Internet has caused the music industry to lose annual sales of more than 3 billion US dollars [1] because the Internet is a good distribution system; that is, it distributes audio signals widely and very rapidly.In addition, all digital products have special characteristics; that is, they are expensive to produce for the first copy, but cheap to reproduce for duplicates [2].One potential solution for protecting the digital content is audio watermarking [3].Also, audio watermarking has been proposed as a solution for other purposes, such as ownership protection, content authentication, broadcast monitoring, information carrier, and covert communication [4][5][6][7][8].
The audio watermarking system consists of two processes: the embedding process and the extraction process, as illustrated in Figure 1.The first process embeds the watermark into the host audio signal.The second process extracts the watermark from the watermarked signal.Normally, the embedding process is frame-based.Therefore, the extraction process requires the frame positions in order to extract the watermark.The frame position requirement raises the frame synchronization problem.This problem is to be discussed in great detail in Section 3. Audio watermarking systems can be characterized by a number of properties [4].Among them, there are five important properties [3,9].
(i) Inaudibility.It is the property that the watermark does not affect the perceptual quality of the host signal.
(ii) Robustness.It is the ability to extract the watermark correctly when attacks are performed on the watermarked signals.
(iii) Blindness.It is the ability to be independent of the host signal in the extraction process.The system is blind when the extraction process does not require the host signal to be compared, in order to correctly extract the watermark.If the extraction process requires the host signal, as illustrated by the dashed line in Figure 1, it is nonblind.
(iv) Confidentiality.It is the property that keeps the watermark secret.
(v) Capacity.It is the quantity of the hidden information that is embedded into the host signals.
These required properties normally conflict with each other.Some techniques that obtain high robustness may suffer in inaudibility [10].Some techniques are good at inaudibility but do not meet the blindness property [11].Some with high capacity are not robust [12].The method based on the least-significant-bit coding [13] obtains good inaudibility but loses on the robustness.The phase-coding method [14] achieves the inaudibility but fails the capacity.The phase modulation method [15] survives the inaudibility but does not pass the blindness property.The trade-off between the inaudibility and the robustness can be found in the methods based on adaptive phase modulation as well [16].The method based on cochlear delay characteristics [17] is robust and inaudible.However, it has a significant trade-off between the inaudibility and the capacity.In addition, the blind cochlear-delay-based scheme reduces the sound quality of the watermarked signals, compared with nonblind ones.The methods based on echo hiding [18,19] are blind and robust, but they perform poorly in inaudibility and confidentiality.The spread-spectrum-based technique is good in robustness but is poor in inaudibility and capacity [20].These examples show that balancing among the required properties has always been a difficult task.
A literature review of audio watermarking has suggested that the schemes based on Singular-Value Decomposition (SVD) are robust [10,11,[21][22][23][24][25][26].In general, the SVD-based scheme extracts the singular values from the host signals and slightly changes some of those values with respect to the watermark bit.It is robust because the singular values are unchanged under common signal processing [27].However, the balance between inaudibility and robustness for some audio signals needs to be further improved due to the fact that it has never taken the human perception into consideration.
The motivation for this work has started from the idea of exploiting the advantages of the SVD-based method and combining it with a human perceptual model.We turn to the SSA, which is SVD-based, and adopt it as the main analysis tool.We choose SSA because when a signal is analyzed, the singular values can be interpreted and have the physical meanings [28].The physical meanings are of importance because they help us understand a relationship between the SSA and the perceptual model.Recently, we proposed the audio watermarking schemes based on the SSA [28,29].Also, we showed the benefits of using the SSA over the SVD.To verify the effectiveness of the SSA-based scheme, we used the differential evolution to adjust the balance.The results were quite successful [29].However, as the search space was very large, therefore, the embedding process was time-consuming.
This work aims to show that SSA equipped with the perceptual model also gives the good balance between inaudibility and robustness.This work proposes a novel audio watermarking based on SSA and the human perceptual model.Also, it proposes a new method for automatic frame detection; that is, the frame positions are not required in the extraction process.
The rest of this paper is organized as follows.The proposed scheme and necessary background information are detailed in Section 2. Section 3 shows that we can slightly modify the proposed scheme to make it a self-synchronized one.The performance evaluation and experimental results are given in Section 4. The observations from the experiments are made and discussed in Section 5. Last, the whole work is summarized in Section 6.

Proposed Scheme
The proposed scheme is mainly based on the SSA-based audio watermarking scheme proposed by Karnjana et al. [28,29].The first two subsections are part of the embedding process, and the last two subsections are part of the extraction process.The proposed scheme with the self-synchronization is provided in Section 3.

Embedding Process.
The embedding process consists of two major parts, as shown in Figure 2. The first part is a core structure.In this core structure, the basic SSA is mainly used to analyze the host signals and to extract the singular spectra.The basic SSA has experimentally proved to be useful for extracting meaningful information from signals [30,31].The second part, which is shown in the gray box, is the parameter selection method based on a psychoacoustic model.In this work, we adopt the psychoacoustic model 1 (ISO/IEC 11172-3) [32] to the proposed scheme.The brief details of the psychoacoustic model and the parameter selection method are provided in the next subsection.
The core structure of the embedding process consists of six steps which are described as follows.
(1) The host audio signal is segmented into nonoverlapping frames.The number of frames is equal to the number of the watermark bits since one bit is embedded into one frame.Let  = [ 0  1  2 ⋅ ⋅ ⋅  −1 ]  denote a frame of size , where  is greater than 2.
Remark that the embedding capacity is the sampling frequency of the host signal divided by .
(2) The trajectory matrix X which represents each frame  is constructed.The construction of X is done as follows: where , called a window length of the matrix formation, is the only parameter of the basic SSA and not greater than , and  =  −  + 1.
√  = √  + 0.9 × (√  − √  ) . ( When the watermark bit is 0, the singular values are left unchanged.In this step, there are two parameters,  and , and these parameters are determined by the parameter selection algorithm based on the psychoacoustic model. (5) The modified trajectory matrix is constructed by SVD reversion, and then it is hankelized.The hankelization of a modified trajectory matrix Y to a signal  = [ 0  1  2 ⋅ ⋅ ⋅  −1 ]  is defined as follows: where   is an element at the row  and column  of the matrix Y,  * = min (, ),  * = max (, ),   2. The psychoacoustic model 1, which is deployed in the MPEG-1 Layer 1, is adopted in order to deliver a signalto-mask ratio (SMR) of the analyzed signal, and then the SMR is used as a criterion for selecting the parameters  and .
Basically, the psychoacoustic model 1 is built based on three psychoacoustic principles: the absolute threshold of hearing, the simultaneous masking, and the upward spread of masking [33].It consists of five steps [33,34], as shown in Figure 3.According to the standard ISO/IEC: 11172-3, the overview of the process is summarized as follows.First, the FFT and the power spectral density (PSD) of the signal are calculated, and then the PSD is normalized with the maximum sound pressure level (SPL) of 96 dB.Next, the PSD is used to identify the tonal (more sinusoid-like) and nontonal (more noise-like) components of the signal.This identification is used for the calculation of the masking levels due to the tonal and nontonal maskers.Then, the irrelevant maskers are removed by applying two psychoacoustic principles in the following manner.The maskers which are lower than the absolute threshold of hearing are removed, and only the strongest masker within a distance of 0.5 Bark is kept.Subsequently, these survival maskers are used to calculate the individual masking levels.Finally, we combine all masking levels to calculate the global masking level.The output of the psychoacoustic model is SMR.The SMR is defined as the difference between the SPL of the global masking level and the PSD of the analyzed signal.Figure 4 shows an example of the SMR (red line) of one frame.
In perceptual audio codings, such as MP3 compression, the SMR is used to allocate the quantization bits.The frequency components with the lower SMR are assigned with smaller numbers of bits since the human auditory system is less sensitive to those frequency components.In this work, the SMR is used as guidance to determine the appropriate parameters because embedding the watermark into the components with the low SMRs helps to improve the inaudibility.The algorithm that we use to deliver the parameters  and  consists of the following five steps: (1) We first calculate the SMR of each frame.According to the standard, the frame size used for this calculation is 512 samples.Note that the frame size from the segmentation of the core structure is not necessary to be the same as that of this psychoacoustic model.
(2) We use the SMRs obtained from the previous step to calculate the average SMR of the host signal.
(3) We identify the frequency band [ 1 ,  2 ] with the average SMR lower than a predefined value, .If there are more than one band, the band with the lowest frequency is selected.If the frequency bandwidth  2 −  1 is wider than a predefined bandwidth, it is limited to the predefined bandwidth.In our simulation, the predefined bandwidth is 10 kHz.An example is shown in Figure 5.
(4) For each frame, we map the selected band to a singular-value interval.In this step, we have to find the relationship between the frequencies and the singular-value indices because the output of the psychoacoustic model, the SMR, is expressed as a function of frequency.When the basic SSA is used to decompose a signal, the singular values of the matrix representing the signal can be interpreted as the scale factors of the oscillatory components of the signal [28].After analyzing each oscillatory component by the Fourier transform, we found that a frequency band of each oscillatory component is quite narrow compared with the signal bandwidth, as shown in Figure 6.We associate the index of the singular value with the peak frequency of its oscillatory components.Figure 7 shows an example of the relationship between the frequencies and the singular-value indices.Thus, to map the frequency range [ 1 ,  2 ] to the interval [, ], we first find the local minimum which is closest to  1 , and set  the index of this local minimum.Then, we find the local maximum that is closest to  2 which must be on the right side of , and then set  the index of this local maximum.An example of this mapping is shown in Figure 8.Note that different frames may have different intervals from one another, and the word frame in this step means the frame from the segmentation process.
(5) Finally, the parameters  and  for embedding the watermark are selected using the arithmetic mean of boundaries of all intervals.

Extraction Process.
The plot of singular spectra is normally convex, as shown in Figure 9.However, after the watermark bit 1 is embedded into an interval [ + 1,  − 1] of the singular spectrum of a host frame, the embedding process causes the concave part on the interval of the singular spectrum of the reconstructed, watermarked frame [29], as shown in Figure 10.We exploit this property to extract the watermark bit.The extraction process consists of five steps, as shown in Figure 11.The details of each step are as follows.
(1) We segment the watermarked signal into nonoverlapping frames.At this stage, we assume that we know the frame positions and the frame size.
(2) We construct the trajectory matrix in the same way we do in the embedding process.
(3) We perform the SVD operation on the trajectory matrix to obtain the singular spectrum.
(4) If the parameters  and  are not provided, automatic parameter estimation, which is illustrated in the gray box of Figure 11, is used to estimate them.The details of this parameter estimation are given in the upcoming subsection.
(5) We approximate all singular values of [ + 1,  − 1] using the quadratic equation, () =  2 +  + , where  is the singular value and  is the index of the singular value.Since the coefficient  of the quadratic formula indicates the rate of change of the singular values, the sign of the coefficient  is used to determine the watermark bit.A minus sign indicates concavity or the watermark bit 1, and a plus sign indicates convexity or the watermark bit 0.

Automatic Parameter Estimation.
To automatically estimate the parameters  and , we use the fact that when watermark bit 1 is embedded into a frame, there exists a concave part in the singular spectrum plot.In other words, when watermark bit 1 is embedded into the frame, we can find some pairs of indices  and , where  < , such that the singular values of the interval [, ] are mostly above the line segment connecting two singular values √  and √  .Thus, this automatic parameter estimation estimates the parameters from the width of the concave part.
We first define the concavity density as a measurement of degree of the concavity.Given a singular spectrum {√ 1 , √ 2 , . . ., √  }, the concavity density  , of singular values from √  to √  is defined as follows.
where () is the function defining the line connecting √  and √  .
Starting from the first singular value ( = 1), a sequence of the singular values that is used to calculate the concavity density is shifted to the right by one singular-value point at a time to determine the set of the concavity density { 1, ,  2,+1 , . . .,  ,+−1 , . . .,  −+1, }.An example of the positive and the negative concavity density of two sequences of the singular values is shown in Figure 12.
Figure 13 shows an example of the concavity density curve of the singular spectrum in Figure 12 when a sequence of the singular values used to calculate the concavity density has a length of 30.It can be seen that the positive density roughly corresponds to the concave part of the singular spectrum.However, the concavity density depends upon the choice of the length of the sequence used to calculate the concavity density.In this work, we get around the problem by using the average density at the different lengths.Then, the averagedensity curve is refined as follows.First, any negative-density value is ignored because it implies convexity.Second, any positive density curve that is narrower than Γ×(−), where Γ is a user-defined real number around 1, is neglected because, practically, we can set the minimum value of  −  in advance.
Subsequently, the indices at the rising and falling edges of the consequent density curve, together with an offsetting  constant, are used to estimate the parameters  and  for the given frame.Finally, the parameters  and  for the watermarked signal are calculated by averaging the estimated parameters  and  from all frames.The averaging algorithm is depicted in Figure 14 and detailed as follows.
Let û, and l, for  = 1, 2, . . .,   denote the estimated parameters of the frame .The subscripts  indicate that there can be more than one concave interval detected within one frame.The maximum number of intervals detected within the frame  is denoted by   .
The general idea of the averaging algorithm is as follows.Given two integral intervals [   2 , and  2 are integers and  1 −  1 ≤  2 −  2 , we say that there is an overlap between those two intervals if we define the overlap degree  as where max (⋅) and min (⋅) are the maximum and minimum functions, respectively.Given a set  of the estimated parameter-interval [û , , l, ], we can expect that the set  must contain many overlapping intervals [û , , l, ].By the same token, we know that there is no overlap between intervals [û , 1 , l, 1 ] and [û , 2 , l, 2 ] when  1 ̸ =  2 .Then, the averaging algorithm is just the process of recursively grouping the overlapping members of the set .The following is the procedure used in the averaging algorithm.([û p,q , lp,q ], [û r,s , lr,s ]) Figure 14: Averaging algorithm used in the automatic parameter estimation.
(1) We assign the frequency weight to each interval [û , 2 , l, 2 ] in the set .Initially, the frequency weight is set to 1.
(2) We calculate the overlap degree  of a pair of estimated parameter-intervals.If  is greater than a predefined value  * , the two intervals are merged to create a new interval.Then, the two old intervals are removed.The frequency weight of the new interval is the sum of the frequency weights of the two old intervals.The average of the lower bounds and that of the upper bounds of the old intervals are used for the new interval.
(3) Step ( 2) is repeated until set  has no overlapping members.
(4) The interval with the highest frequency is chosen as the estimated parameters û and l.If there are multiple intervals with the highest frequency, the estimated parameters û and l are randomly chosen from them.

Self-Synchronization
The embedding and extraction processes as described in previous section are frame-based.That means that the host signal is divided into frames, and one watermark bit is embedded into one frame.Thus, to correctly extract the watermark, the extraction process must know the frame positions.The assumption that the extraction process knows the frame positions in advance may not be practical in some situations.For example, an attacker can attack watermarked signals by cutting a few audio samples.This causes the extraction process to work improperly.This is known as a cropping attack.How the frame positions are acquired is the frame synchronization problem.
There are two solutions to solve the frame synchronization problem [11].The first solution is by binding the watermark with some invariant audio features of the host signal [35] or performing self-synchronization [36][37][38].The second solution is by embedding the frame synchronization code into the host signal [39,40].
From experiments, we found that the proposed scheme can automatically detect the watermarked frames.In order to do that, we need to modify the scheme slightly.To fully grasp the idea behind the new rules, let us start with the basic findings from this work.
Consider an audio signal with three frames of equal length , where the watermark bit 1 is embedded in its middle frame by the method described in Section 2.1, as illustrated in Figure 15.The starting and the last indices of audio samples of the middle frame  , are denoted by  and  +  − 1, respectively.According to the embedding and extraction processes, if we use the frame  , to construct the trajectory matrix, then we can detect the concave pattern in the singular spectrum plot.If  is an integer which is less than , then the frames  , and  −, are overlapping.We discovered that the singular spectrum curve of the trajectory matrix constructed by the frames  −, also has a concave part if the overlapping region is large enough.A similar effect occurs to the frame  +, as well.In general, if we construct matrices from frames  , for  = 0 to 2, there are many matrices that we can interpret as having the watermark bit 1 embedded.Those matrices are the ones in which  is in the same neighborhood with .This overlapping effect of embedding the watermark bit 1 is utilized in our automatic frame detection.We perform a scanning operation by first constructing the frames  , for  = 0 to the last possible frame and then extracting the watermark bit from those frames.This effect implies that we can localize the watermarked frame where watermark bit 1 is embedded by performing a scan operation.This is the reason why we need to modify the proposed scheme if we want to make it selfsynchronizing.The modification is as follows.
We first divide the frame into 4 equal subframes, where each subframe has a length of .Each watermark bit is represented by the four-bit strings of either "0100" or "0110" depending upon the watermark bit.If the watermark bit is 0, four bits of "0100" are embedded into the 4 subframes.If the watermark bit is 1, "0110" are embedded into those subframes, as illustrated in Figure 16.For example, if the watermark bits are "001", then the subframe-embedding bits are "010001000110".
=  (, ,  , ) , where  is a scan-step size,  , is the subframes , for  = 0 to 3, of the frame , and (, ,  , ) is 1 if the singular spectrum curve of the matrix constructed from the subframe  , on the interval [ + 1,  − 1] is concave; otherwise, (, ,  , ) is 0. The meaning of this operator is that the scanner Scan[⋅], which operates on  samples, scans through the frame  with step size  and returns to 0 or 1, depending upon the characteristics of the singular spectra of the scanned subframes.
We use the first appearance of "1" in "0100" and "0110" as the synchronization point of watermark bit 0 and 1, respectively.If we can detect the next concavity, we interpret it as the watermark bit 1; otherwise, it is 0. Since the first detected concavity is used as the synchronization point, to ensure that all concavities are surrounded by convexities and "0" "0" "0" "1" Watermark bit = "0" (a) "0" "0" "1" "1" Figure 16: Four bits of "0100" are embedded into 4 subsegments of a frame, which represents embedding "0" (a), and 4 bits of "0110" are embedded into 4 subsegments of a frame, which represents embedding "1" (b).that the distance between two concavities is far enough, "0" is added at the first and the last of the four-bit patterns.This is the concept behind our new proposed self-synchronization.An example of performing the subframe-scan operation according to ( 6) is shown in Figure 17.
To detect a watermarked frame, we define another scanner, which operates on 4 samples, called the framescan operation.Given a watermarked audio signal  = [ 0  1 ⋅ ⋅ ⋅  −1 ]  of length  greater than 4, the framescan operation [] scans from  0 with a scan step of Δ until it detects the first watermarked frame.
Let four rectangular windows   = (  0 ,   1 , . . .,   ⌊3/⌋ ), for  = 1 to 4, where  Table 1: Conditions for stopping frame-scan operation.Note that "∘" is "The concavity of singular spectrum can be detected," and "×" is "The concavity of singular spectrum cannot be detected." The scanner [] stops scanning and declares a watermarked frame only when the conditions described in Table 1 are satisfied.The extracted watermark bit 0 is detected if and only if the concavity of singular spectrum cannot be detected through the windows  1 ,  3 , and  4 but can be detected through window  2 .In comparison, the extracted watermark bit 1 is detected if and only if the concavity of the singular spectrum curve cannot be detected through the windows  1 and  4 but can be detected through windows  2 and  3 .Otherwise, it continues scanning with a step size of Δ.The frame-scan operation is restarted repeatedly until it reaches the end of the watermarked signal .
An example of performing the four windows on one frame is shown in Figure 18.In this figure, the second and third subframes are embedded so that the frame-scan operation can decode the pattern of   as the watermarked bit 1.

Evaluation
Twelve host signals from the RWC music-genre database (Track numbers 01, 07, 13, 28, 37, 49, 54, 57, 64, 85, 91, and 100) [41] were used in our experiments.All have a sampling rate of 44.1 kHz, 16-bit quantization, and two channels.Unless stated otherwise, the hidden information was embedded in one channel, starting from the initial segment of host signals.The frame size  was set to 2450 samples.The embedding capacity was 18 bit per second (bps).We chose this capacity because the number is not too low or not too high, and it seems reasonable for general applications.The window length  for the matrix formation was 980.One hundred and fifty bits of the watermark were embedded in total.The audio duration of each signal was about 8.33 seconds.
The parameters  and , obtained from the parameter selection based on the psychoacoustic model 1, are shown in Table 2.The estimated parameters, obtained from the automatic parameter estimation, are shown in Table 3.We implemented the proposed scheme using an adaptive criterion for the predefined SMR level  as follows.If the maximum SMR is greater than 25 dB,  = 18.If the maximum SMR is less than 20 dB,  = 12.Otherwise,  = 15.
The proposed schemes were compared with the previously proposed schemes [28,29] and the conventional SVDbased scheme [23].There are three reasons for comparing with the conventional SVD-based scheme.First, it is one of a few blind SVD-based techniques.Second, its published results are promising.Last, both the SSA-based and SVDbased schemes belong to the same family of audio watermarking schemes; that is, they extract singular values from the host signals and embed the information into the signals by modifying those singular values.The following subsections report evaluations of the performance in the aspects of sound quality, robustness, and self-synchronization.
4.1.Sound-Quality Evaluation.Three distance measures were chosen to evaluate the sound quality of watermarked signals: the evaluation of audio quality (EAQUAL) [42], log-spectral distance (LSD), and the signal-to-distortion ratio (SDR).The EAQUAL measures the degradation of the watermarked signal, compared with the original, and covers a scale, called the objective difference grade (ODG), from −4 (very annoying) to 0 (imperceptible).
The LSD is a distance measure between two spectra.Given () and P() are power spectra of the original and the watermarked signals, respectively, the LSD is defined as the following formula: The SDR is a power ratio between the signal and the distortion.Given the amplitudes of original and watermarked signals,  org () and  wmk (), the SDR is defined as follows.
The evaluation criteria for good sound quality are as follows.The ODG must be greater than −1 (not annoying), the LSD must be less than 0.4 dB, and the SDR must be greater than 25 dB.An ODG of −1 indicates that the noise perceived in the watermarked signal is perceptible but not annoying.Based on our simulations, ODG values between 0 and −1 mean excellent in sound quality.We set the criteria for LSD and SDR to 0.4 dB and 25 dB, respectively, because we found from our preliminary experiments that either an LSD greater than 0.4 dB or an SDR lower than 25 dB can cause an annoying perception.
The comparison of the average ODGs, average LSDs, and average SDRs is shown in Table 4.The proposed scheme satisfies the inaudibility criteria and is considerably improved when it is compared with the SSA-based method [28].Compared with the conventional SVD-based method and the SSA-based method with differential evolution [29], the proposed method is less inaudible.However, the difference in the inaudibility among them is nonsignificant.Based on our listening-test experiment [29], we found that the signals that satisfy all conditions ODG > −1, LSD < 0.4 dB, and SDR > 25 dB are hardly distinguishable in terms of the sound quality.Therefore, these results show that we can use the psychoacoustic model to deliver the parameters  and , in order to improve the sound quality of the watermarked signal obtained from the previously proposed SSA-based method [28].However, the parameters determined by the differential evolution give the best performance in terms of sound quality.

Robustness Evaluation.
The effectiveness of the proposed schemes in terms of robustness is measured by the watermark extraction precision.We use the bit-error rate (BER) to represent the watermark extraction precision.Given the embedded watermark bit-string () and the extracted watermark bitstring ŵ() for  = 1 to the frame length , where ⊕ is the bitwise XOR operator.The criterion for the robust scheme is that the BER must be less than 0.1 or 10%.At this level of BER, it is possible to reduce the BER further to close to 0 by adding error correction code.Furthermore, at this level, the BER can be reduced practically and effectively by the embedding-repetition scheme.That is, a frame is segmented into several subframes, and a watermark bit is embedded repeatedly into those subframes.Then the majority rule is applied in the extraction process to decode the extracted watermark bit.Five attacks were performed on watermarked signals: Gaussian-noise addition with average signal-to-noise ratio (SNR) of 36 dB, resampling with 16 and 22.05 kHz, band-pass filtering with 100-6000 Hz and −12 dB/Oct, MP3 compression with 128 kbps joint stereo, and MP4 compression with 96 kbps.
The results from the robustness evaluation are shown in Table 5.The average BERs of the proposed schemes are less than 10% on almost all evaluation attacks except MP3 compression and the band-pass filtering (BPF).For MP3 and BPF, the average BERs are slightly above 10%.If we consider the overall average BERs, which is the average of BERs from all types of attacks, our proposed methods are still below 10% and less than that of the conventional SVD-based method.Table 6 shows the overall average of all methods.
Compared with the conventional SVD-based method, the proposed schemes are slightly less robust in the case of "no attack," "MP4," "AWGN," "RES16," and "BPF."However, the overall average BERs of the proposed schemes are better than that of the conventional SVD-based one.In general, when the BER is low enough (e.g., 10%), it can be reduced further by applying error correction code or by employing embedding repetitions.On the other hand, the proposed schemes outperform the conventional SVD-based method in the case of "MP3" and "BPF."Since the average BERs of the conventional SVD-based method in both cases are close to the chance level, they are hard to be improved further by those techniques.
Compared with the previously proposed SSA-based methods, it is less robust to some degree.Therefore, the overall performance of the proposed scheme seems to be slightly poorer than that of the SSA-based one with the  differential evolution.The explanation concerning this issue is discussed in Section 5.
When the extraction process does not assume to know the parameters  and  in advance, the average BER increases about 2%.The root-mean-square deviation of the difference between the estimated values and actual values is about 2.83.Thus, the extraction process is sensitive to the correctness of the parameter values to some degree.When it extracts the watermark with less information, the BER increases.

Self-Synchronization Evaluation.
To test the selfSynchronization, we implemented the scheme with settings shown in Table 7.Each test signal is randomly chosen as a segment of 98000 samples (about 2.2 seconds), and 10 bits of the watermark were embedded into the segment.
To detect the watermarked frame and to extract the watermark, we randomly choose the initial sample, which is before the embedded segment, for the scan operation, as depicted in Figure 19.The accuracy of the frame detection and the watermark extraction is defined as the number of correctly extracted watermark bits divided by the total number of embedded watermark bits.Since there is naturally the concavity on singular spectra, it is possible that our proposed method identifies an unwatermarked segment as a watermarked frame.In this case we will have a misidentified frame.The false positive rate is defined as the number of misidentified frames divided by the total number of frames identified by the algorithm.The test results show that the accuracy of the frame detection and the watermark extraction is 80%.The false positive detection rate is 6.42%.

Discussion
Even though the proposed scheme satisfies the robustness and inaudibility criteria, there are other aspects that need to be improved.In this section, five issues concerning the performance and limitation of the proposed scheme are discussed.The first two issues are about the performance of the proposed scheme.The next two issues are about the limitation of the currently proposed self-synchronized scheme.The last one is a general problem in terms of the confidentiality property.
First, we have shown that the psychoacoustic model can be used to determine the parameters  and .These parameters are host-signal-dependent and of importance because their values determine the balance between the inaudibility and the robustness.In our previously proposed method, Table 5: Robustness evaluation using BERs (%): comparison of the conventional SVD-based method (SVD) [23], SSA-based scheme (SSA) [28], SSA-based scheme with the differential evolution (SSA.DE) [29], proposed scheme without the automatic parameter estimation (Prop.), and proposed one with the automatic parameter estimation (Prop.APE) when attacks (i.e., MP3 and MP4 compression, white-Gaussian-noise addition (AWGN), resampling with 16  (ii) The optimal parameters from the differential evolution depends on many factors, such as the simulations included in the optimizer [29].Moreover, the cost function has two additional parameters.In this sense, using the psychoacoustic model reduces the number of scheme parameters.Overall average BER SVD [23] 14.83 SSA [28] 2.57 SSA.DE [29] 4.50 Prop.
4.76 Prop.APE 6.78However, the robustness of the proposed scheme is slightly poorer than that of the previously proposed methods.This is because only the SMR is used as the guidance for the parameter determination.The low SMR can gain inaudibility but may lose robustness because the lower SMRs associates with the lower singular-value indices.In addition, the components with the lower SMRs are more likely to be destroyed by the perceptual codings.To improve the robustness of the proposed scheme, we may include the other masking phenomena, such as the nonlinear excitatory masking, to the psychoacoustic model.This is one of our future work.
Second, different from the previously proposed scheme, this scheme does not modify the singular spectrum when the watermark bit 0 is embedded.We found that the effectiveness in terms of robustness is the same, but in terms of inaudibility, the objective scores improve slightly, as shown in Table 9.The previously proposed schemes, especially the one with the differential evolution optimization, can benefit from this fact because the optimization function directly handles the tradeoff between inaudibility and robustness.
Third, the proposed self-synchronization is timeconsuming.The extraction process with the selfsynchronization takes up to (⌊3/⌋ + 1) times that of one without.Based on our simulation, the extraction process without the self-synchronization took about 1.6 seconds to extract one watermark bit, whereas the one with the self-synchronized process took about 20 minutes.This explains why we separately simulated and evaluated the self-synchronization.
Fourth, although the synchronization rate of 80% of the proposed scheme with self-synchronization does not satisfy the criterion of BER being less than 10%, it can confirm the fundamental concepts on which the self-synchronized, proposed scheme is based.From our analysis, we found that the detection rate is determined by the algorithm that interprets the bit-string   .In the proposed scheme, our algorithm uses the simplest rectangular windows to find the pattern of   .Even in the case that the algorithm could not detect a watermark bit, we found that the string   correctly presented the concavity on the singular spectra.Therefore, some effective pattern recognition techniques could be helpful to improve the situation.
Also, the false positive detection rate indicates that the algorithm sometimes detects a watermark bit when no hidden information is embedded there.We investigated this problem by analyzing unwatermarked signals with the proposed automatic frame detection.We found that in those false positive detection cases, there is some concavities on the singular spectra.If the false positive detection is a serious concern, we can solve this problem by first detecting the natural concavity and then hiding the watermark only in the no-concavity frames.Otherwise, good pattern recognition is required due to our findings that the patterns of the string   of the natural concavity are different from those of the embedded watermark.This problem will be further investigated in the future.
Fifth, since this work has shown that we can completely blindly scan and analyze the watermarked signals to detect and extract the watermark, there is a question on the confidentiality of the watermark.As a result, if the secrecy of the watermark is a concern, we may need to encrypt the watermark with an encryption key before it is embedded into the host signals.Later, in the extraction process, a decryption key is required to decrypt the extracted, encrypted watermark to obtain the original one.

Conclusion
The main objective of this work is to show that SSA, equipped with the psychoacoustic model, can give a good balance between inaudibility and robustness, so that it can overcome the problems in the previously proposed SSA-based method [28] and the SVD-based method.Even though the overall performance of the currently proposed schemes is poorer than that of the SSA-based one with differential evolution, the processing time is reduced considerably.Integrated with the psychoacoustic model, the SSA-based audio watermarking scheme achieves three required properties of the audio watermarking system: inaudibility, robustness, and blindness.Also, this paper presented a novel method for selfsynchronization.The synchronization rate of the proposed self-synchronized scheme was about 80%.Improving the synchronization rate and reducing the computational time of the self-synchronized scheme are our future work.

Figure 5 :
Figure 5: Average SMR (red) and an example of selecting the frequency band [ 1 ,  2 ] given the SMR criterion  = 18 dB.If the frequency bandwidth  2 −  1 is wider than 10 kHz, it is limited to 10 kHz.

FrequencyFigure 6 :
Figure 6: Examples of spectra of oscillatory components: the top-left panel is a spectrum of an audio signal, and the other panels show spectral of some oscillatory components.The numbers labeled on vertical axes are the orders of the components, which are indices of their associated singular values.Compared with the spectrum of the audio signal, the spectral of oscillatory components has narrower bandwidths.

Figure 12 :Figure 13 :
Figure 12: Examples of regions with negative and positive concavity density.

Figure 15 :
Figure 15: Example of an audio clip with 3 frames and three segments from which trajectory matrices are constructed.

Figure 18 :
Figure 18: Example of performing the four windows on Scan[  ].

Figure 19 :
Figure 19: The initial sample, which is before the watermarked segment, is randomly chosen for the scan operation.

Table 2 :
Parameters  and  obtained from the parameter selection based on the psychoacoustic model 1.

Table 3 :
Estimated parameters  and  obtained from the automatic parameter estimation in the extraction process.

Table 4 :
ODGs, LSDs, and SDRs: comparison of the conventional SVD-based method (SVD), SSA-based scheme (SSA), SSA-based scheme with the differential evolution (SSA.DE), proposed scheme without the automatic parameter estimation (Prop.), and proposed one with the automatic parameter estimation (Prop.APE).
and 22.05 kHz (RES 16 and RES 22.05, resp.), and band-pass filtering (BPF)) were performed.AV and SD are average and standard deviation, respectively.
(i) The computational time is reduced considerably because the differential evolution optimization has a large search space.The comparison of the computational time is shown in Table8.To determine the parameters  and  for one signal, differential evolution takes about 13 hours, whereas the

Table 6 :
Overall average BERs (%): comparison of the conventional SVD-based method (SVD), SSA-based scheme (SSA), SSA-based scheme with the differential evolution (SSA.DE), proposed scheme without the automatic parameter estimation (Prop.), and proposed one with the automatic parameter estimation (Prop.APE).

Table 7 :
Simulation conditions for the self-synchronized SSA-based audio watermarking scheme.