Robust and Blind Audio Watermarking Algorithm in Dual Domain for Overcoming Synchronization Attacks

How to effectively resist synchronization attacks is the most challenging topic in the research of robust watermarking algorithms. A robust and blind audio watermarking algorithm for overcoming synchronization attacks is proposed in dual domain by considering time domain and transform domain. Based on analysing the characteristics of synchronization attacks, an implicit synchronization mechanism (ISM) is developed in the time domain, which can effectively track the appropriate region for embedding and extracting watermarks. *e data in this region will be subjected to discrete cosine transform (DCT) and singular value decomposition (SVD) in turn to obtain the eigenvalue that can be utilized to carry watermarks. In order to extract the watermark blindly, the eigenvalue will be quantized. Genetic algorithm (GA) is utilized to optimize the quantization step to balance both transparency and robustness. *e experimental results confirm that the proposed algorithm not only withstands various conventional signal processing operations but also resists malicious synchronization attacks, such as time scale modification (TSM), pitch-shifting modification (PSM), jittering, and random cropping. Especially, it can overcome TSM with strength from −30% to +30%, which is much higher than the standard of the International Federation of the Phonographic Industry (IFPI) and far superior to the other algorithms in related papers.


Related Works.
With the rapid development of network and computer technology, people edit, modify, store, and disseminate audio media easily by using various audio editing software [1][2][3]. While the editing software brings us convenience, it also makes unauthorized users perform a variety of infringements on the audio media, such as malicious tampering, forgery, deletion, and unauthorized distribution. Sometimes, these infringements not only jeopardize the safety of personal property and the credibility of the audio media but also may even endanger national public safety in acute cases [4][5][6]. How to effectively protect the security of those audio media has become a research hotspot in information security, communication, and some related fields. Robust audio watermarking algorithm pays much attention to improve its ability for preventing watermarks hidden in the audio from being destroyed under complex environments [7,8], so it must not only be able to withstand the conventional signal processing operations encountered when using those audio media normally but also need to be extremely resistant to many malicious synchronous attacks that may cause the structure of the audio media to change.
Synchronization attacks may cause serious damage to the structure of the audio, resulting in the extraction failure due to the inaccuracy of the embedding region [9][10][11][12], so they have become the most challenging attacks in the research of audio watermarking algorithms [13][14][15]. Hu et al. [16] proposed an audio watermarking algorithm based on lifting wavelet transform. e authors claimed that the algorithm had good robustness to some conventional signal processing attacks and synchronization attacks, and its payload capacity reached 43.07 bps when SNR was over 21 dB. However, it can be seen from the experimental results that the algorithm robustness still needs to be improved when resisting TSM. Xiang and Huang [17] designed an audio watermarking algorithm with a constant watermark synchronization mechanism according to the insensitivity of the histogram shape of audio media. Hu and Chang [18] proposed a self-synchronous audio watermarking algorithm based on discrete wavelet transformation (DWT) and DCT.
is algorithm concealed the synchronous signal in the first approximation sub-band and recalibrated the embedding position by extracting the zero-crossing point of the synchronous signal. e experimental results showed that this algorithm was effective for some synchronization attacks but poor for some signal processing attacks. Wang et al. [19] proposed a robust audio watermarking algorithm which utilized the invariance of exponential moment to enhance its robustness. However, it was poor for amplitude scaling and MP3 compression as shown in experimental results. Yuan et al. [20] put forward an audio watermarking algorithm that detected the mel-cepstrum coefficient as a synchronous signal when extracting the watermark in the DWT domain. Wang et al. [21] proposed a robust audio watermarking algorithm based on empirical mode decomposition. In this algorithm, the audio was evenly segmented into numerous fragments, and then each audio fragment was separated into two parts. One part was utilized to embed the synchronization code, and the other part was used to embed the watermark in the residue of higher-order statistics after empirical mode decomposition. If synchronization codes could not be accurately acquired, watermark extraction would fail, which was a fatal shortcoming of this algorithm. Chen et al. [22] proposed an audio watermarking algorithm that embedded the watermark into the low-frequency coefficients of the audio in the DWT domain. is algorithm enhanced its robustness by increasing the embedding depth, but this behaviour also led to the low transparency. In general, audio watermarking algorithms with the ability to resist synchronization attacks must have an effective synchronization mechanism, which can be used to track the embedding position [23,24]. However, most existing algorithms are usually robust to only one or two of these attacks, and some algorithms even lose robustness to conventional signal processing operations due to their excessive pursuit of robustness to some synchronization attacks. In addition, how to balance the overall performance of the algorithm by optimizing the parameters of the designed algorithm is also an issue with research significance.

Contributions.
Based on the above introduction, we can see that there are still many problems to be solved in antisynchronization attacks. Our contributions in this paper are as follows.
(1) An ISM is developed to effectively search for the appropriate embedding region when embedding watermarks and to automatically track the region where the watermark is located when extracting watermarks. Based on analysing the characteristics of synchronization attacks, it is found that the shape of the voiced frame almost has not changed after being subjected to TSM, so the proposed ISM takes the sample point with the largest amplitude in the voiced frame as the synchronization mark to identify the embedding region and extracting region. When embedding watermarks, the appropriate region will be searched out from the voiced frames by using ISM, and then the data in the chosen embedding region will be further operated to carry watermarks. When extracting watermarks, ISM can automatically track the region where the watermark is located. (2) GA is utilized to optimize the key algorithm parameter to balance both transparency and robustness. e data in the embedding region will be processed by DCT and SVD in turn to obtain the eigenvalue that can be used to carry watermarks. In order to extract the watermark blindly, the eigenvalue is quantized when embedding or extracting the watermark, so the quantization step is an important parameter, which directly affects the transparency and robustness of the algorithm. We propose an optimal audio watermarking algorithm using GA to further enhance the overall performance of this algorithm.
Besides, this algorithm adopts several additional measures to improve the robustness, such as twice even segmentation to the audio, and the operation that embeds the same watermark into three voiced frames. e remainder of this paper is organized as follows. In Section 1, we review some related works about the existing audio watermarking algorithms which can overcome synchronization attacks and then introduce our contributions in this proposed algorithm. Section 2 describes the proposed ISM and shows the implementation flow chart in detail. e principle of the proposed audio watermarking algorithm will be elaborated in Section 3, and this section will be divided into four subjects, including the embedding principle, the extracting principle, optimization of the quantization step, and the measure to further improve robustness. Section 4 evaluates the performance of this proposed algorithm and compares their performance with other algorithms in recent years. Finally, Section 5 draws up the conclusion and gives the possible future research task.

ISM for Tracking Embedding Region
Synchronization attacks may cause the position of the data in the audio to shift, which may lead to extraction failure because the location of the watermark cannot be obtained accurately [25]. erefore, it is very important to design an effective synchronization mechanism for tracking the embedding region. If the data in the voiced frames are modified too much, the audio may not be used normally because of the obvious degradation of audio quality, so synchronization attack usually only modifies the data in redundant frames, but not in voiced frames. TSM attacks by 10% and −10% are applied to an audio clip, respectively, and the waveform comparison is illustrated in Figure 1. It can be seen from the pictures that the absolute positions of the two voiced frames in this audio clip all have shifted on the time axis, but their shapes do not change much in the process of being stretched in Figure 1(b) or compressed in Figure 1(c), so it is relatively safe to conceal watermarks into these voiced frames.
If the watermark is only embedded in voiced frames and the embedding region in the audio is independent of the absolute position of the audio data on the time axis, it will greatly improve the algorithm's ability to withstand synchronization attacks. As long as the embedding region can be effectively tracked, the watermark will be accurately extracted. Based on the above analysis, an ISM is developed in our study, which can search out the appropriate embedding region when embedding watermarks and can effectively track the extracting region where the watermark is located when extracting watermarks. As shown in Figure 1, the regions between the two red dashed vertical lines are the embedding regions in the two voiced frames, and " * " indicates the synchronization mark which is the position of the tracked sample point with the largest amplitude. It can be observed in Figure 1 that the proposed ISM can more accurately track the appropriate embedding region under TSM. Figure 2 shows the implementation flowchart of the ISM. Assuming that the length of the voiced frame is N, the length of the region for embedding watermarks is N e and N ≥ N e . e specific implementation process can be described as follows.
(i) Step 1: extract all the voiced frames with the length of N from the audio. (ii) Step 2: search for the sample point with the largest amplitude in each voiced frame and record its position as p.

Principle of Embedding
Watermarks. e embedding algorithm mainly includes the following several parts. Firstly, the proposed ISM is used to search for the best embedding region. en, DCT is performed on the data in the embedding region to determine the frequency range for carrying the watermark. Finally, the DCT coefficients in the frequency range are processed by SVD to conceal the watermark by the quantization method. Figure 3 shows the principle diagram of the embedding algorithm.
Suppose that the binary watermark can be expressed in the following formula: where w 0 (i) ∈ 0, 1 { } and L w is the length of W 0 . In order to improve the security, W 0 should be encrypted before it is concealed into the audio.
Apply logistic mapping formula to generate a chaotic sequence c(i) with the same size as W 0 , as shown in the formulas.
Exclusive OR operation is performed on W 0 and c(i) to obtain the encrypted information W 1 , as shown in formula (3), where ⊕ stands for the exclusive OR operator. Triple key Ch(x 1 , α 0 , δ) is the unique correct key to decrypt W 1 .
Suppose that A is the original audio with K sample points, as expressed in the formula.
where a(j) is the amplitude of the j th sample point. Divide A into L 1 audio fragments, namely, A l (1 ≤ l ≤ L 1 ), and each audio fragment has K 1 sample points, K 1 � floor(K/L 1 ). en, A can be divided into two parts, namely, Ar and As, where Ar will be used for carrying the watermark, and As does not participate in the embedding process. Ar can be expressed in formula (6), and its size is e watermark W 2 will be embedded into Ar, that is to say, each audio fragment A l needs to carry L 2 bits watermark. To prevent the audio quality from decreasing too much, A l will be divided into several audio frames with the length of N, and only the voiced frame A max with the largest energy is used to carry the watermark. e proposed ISM is used to track the appropriate embedding region [p 1 , p 2 ] in A max .
We will take the embedding process that embed L 2 bits binary watermark into A max as an example to illustrate the core embedding scheme. Figures 4 and 5 show the main data and the flowchart of this core embedding scheme.
In our study, DCT is used to determine the frequency range where the watermark is located. Apply DCT on the data between [p 1 , p 2 ] to obtain the DCT coefficient A dct and then get the intermediate frequency coefficient e frequency range [f 1 , f 2 ] for embedding the watermark can be calculated according to the formulas.
where f h is the maximum cutoff frequency of the audio, and its value is usually half of the sampling rate. Divide A if into L 2 data blocks, namely, Block(v) (1 ≤ v ≤ L 2 ), and the length of the data blocks can be calculated as L svd � floor(L if /L 2 ).
Apply SVD on Block(v) to obtain the eigenvalue λ v , as shown in formula (8), where is a single element matrix, V v is an orthogonal matrix with the dimension of L svd × L svd , and S v is a row matrix with the dimension of 1 × L svd in which only the first element λ v ≠ 0 and all other elements are equal to 0.
According to the stability characteristics of SVD, the eigenvalue λ v usually does not change greatly when Block(v) changes slightly, so one bit binary watermark can be hidden into one eigenvalue. In order to realize blind extraction, cc � floor(λ v /a) will be obtained by quantization, where a is the quantization step. If the binary watermark is "0," cc will be modified to be an even number; otherwise, cc will be set as an odd number. e embedding rule is described in Table 1. en, the modified eigenvalue λ v ′ is shown in the following formula: e modified data block Block ′ (v) can be reconstructed according to the following formula: Repeat the above process to modify all eigenvalues, and L 2 bits binary watermark can be concealed into all Block(v) (1 ≤ v ≤ L 2 ). According to the process described above, each row of the binary data in W 2 can be concealed into each A l (1 ≤ l ≤ L 1 ); finally, W 2 will be completely concealed into A r . e embedding process can be described as follows.
(i) Step 1: convert W 1 into W 2 with the size of L 1 × L 2 .
(ii) Step 2: divide the original audio A into L 1 audio fragments A l with the same length. (iii) Step 3: divide A l into audio frames with the length of N and find out the voiced frame A max .

Principle of Extracting
Watermarks. e extracting algorithm is the inverse process of the embedding algorithm, and its principle is shown in Figure 6, in which the "core extracting scheme" is the most important part of the whole extracting algorithm. e process of extracting the watermark can be described as follows.
(i) Step 1: divide the carried audio A ′ into L 1 audio fragments A l ′ with the same length. (ii) Step 2: divide A l ′ into audio frames with the length of N and find out the voiced frame A max ′ . (iii) Step 3: track the region containing the watermark by using ISM in A max ′ . (iv) Step 4: apply DCT on the data in the tracked region Step 6: quantify all eigenvalues and judge their parity to obtain L 2 bits binary watermark. e extracting rule can be expressed in formula (11), Step 7: repeat Step 2 to Step 6 until all watermark bits are extracted. (viii) Step 8: decrypt and recover the watermark w 1 ′ from w 2 ′ . Figure 7 shows the flowchart of the core extracting scheme. In particular, the key parameters in the extracting algorithm should be consistent with the corresponding parameters in the embedding algorithm, including N, N e , L 1 , L 2 , b 0 , L if , and a.

Optimization of the Quantization
Step. From the watermark embedding principle mentioned above, quantization step is an important parameter, which directly affects the transparency and robustness of the algorithm. In order to balance the algorithm performance, GA is used to search the optimal quantization step intelligently. e fitness function Fitness is constructed with SNR and BER as shown in the formula.
SNR > SNR 0 , where SNR 0 is the lower threshold of transparency. e selected quantization step should not make the algorithm transparency lower than SNR 0 dB. SNR and BER can be defined in formulas.
where A and A ′ represent the original audio and the carried audio, respectively, and w(i) and w(i) ′ denote the original watermark and the extracted watermark. e population consists of C 1 chromosomes, and each chromosome with the length of C 2 , which will be encoded by using a binary encoding approach, can be converted into the quantization step. Formula (14) can be used to describe the transformation relationship between each chromosome CH r and each quantization step a r (1 ≤ r ≤ C 1 ).
where B2 D[CH r ] means converting CH r from binary to decimal. e detailed process can be described as follows.
(i) Step 1: set the parameters, including the crossover probability p c and the mutation probability p m , and then generate an initial population POP 0 . (ii) Step 2: calculate the quantization step a r according to formula (14) and then execute the embedding algorithm proposed in Section 3.1 after the carried audio is subjected to some attacks.

Mathematical Problems in Engineering
3.4. Measure to Further Improve Robustness. In order to improve the robustness of the algorithm, the same row of the binary watermark can be repeatedly embedded into three voiced frames with the highest energy in A l . When extracting watermarks, three groups of binary watermarks are extracted from the three voiced frames, respectively, and compared bit by bit to obtain a more accurate group of binary watermarks according to formula (15), where w 21 ′ (u, v), w 22 ′ (u, v), and w 23 ′ (u, v) are the three groups of binary watermarks extracted from three voiced frames, respectively.

Performance Evaluation
In this section, the performance of the proposed algorithm will be tested. In order to evaluate the performance of this proposed algorithm, the quality of the audio can be evaluated by three ways, including SNR, the object difference grade (ODG) which is one of the output values obtained from the perceptual evaluation of audio quality (PEAQ), and the mean opinion score (MOS). According to the standard of IFPI, SNR should be greater than 20 dB to make the audio have good transparency. BER can be used to evaluate the algorithm robustness. Generally, small BER means that the algorithm has strong robustness to various attacks. According to the standard of IFPI, the BER of the extracted watermark is no less than 20% when the carried audio is attacked. NC can be used to compare the similarity between the original watermark and the extracted watermark, as shown in formula (16). When NC is close to 1, the original watermark is very similar to the extracted watermark.
e experimental parameters are as follows.

Transparency and Capacity.
e payload capacity of this algorithm can be calculated according to the following formula: where T is the time length that the audio carries the watermark. In our study, T is equal to 64 seconds, so the payload capacity is ((128 × 32bit)/64s) � 64 bps. e average values about the payload capacity (bps), SNR (dB), ODG, MOS of the audio, BER (%), and NC of the extracted watermark are listed in Table 2.
In our test, all the audio signals are processed with the four algorithms mentioned in Table 2, respectively, to obtain four groups of carried audio signals which will be provided to 20 listeners (10 males and 10 females, aged between 18 and 60 years old) in order to get MOS scores. Table 2 shows that this algorithm has good transparency because the average SNR is up to 25.96 dB, ODG is −0.99, and MOS is 4.5 while the payload capacity is 64 bps, which is higher than the standard of IFPI. Most importantly, BER is equal to 0, and NC is equal to 1, which indicates that this algorithm has good robustness when there is no attack, so the extracted watermark image in Figure 8(a) is the same as the original image shown in Figure 8(p). Compared with the algorithms in paper [16] and paper [23], this proposed algorithm has a larger payload capacity and better transparency, and the robustness is stronger than that in paper [23]. Although the payload capacity of this algorithm is not as high as the algorithm in the paper [22], the transparency is more superior. Besides, this proposed algorithm is more robust than other three algorithms, which will be discussed in Section 4.2. Figure 9 shows the waveform pictures of the carried audio without attack before and after embedding the watermark (we only display an audio clip lasting about 3 seconds to clearly show the details), and their spectrogram pictures are shown in Figure 10. It can be seen that there is no obvious change in the waveform and spectrogram of the audio before and after embedding the watermark, which indicates that this algorithm's transparency is nice.

Robustness.
is section will evaluate the algorithm robustness by BER and NC when resisting against various conventional signal processing operations and synchronous Even Mathematical Problems in Engineering attacks and compare the experimental results with other algorithms in three related papers.

Conventional Signal Processing Operations.
Conventional signal processing operations are the most common attacks encountered by audio in the process of being used and spread, and they may cause damage or even loss of the watermark hidden in the audio, so the watermarking algorithm must have strong robustness to withstand these attacks. ese operations mainly include the following types in Table 3.
BER (%) and NC of the extracted watermark are averaged and listed in Table 4 under these signal processing operations. e extracted images whose NC values are closest to the average value are shown in Figure 8. According to the experimental results in Figure 8 and Table 4, this algorithm has strong robustness against conventional signal processing operations, which can be summarized as follows.
When resisting noise corruption with 30 dB and 40 dB, requantization, low-pass filtering with cutoff frequency of 12 kHz, and echo addition with 50 ms, the extracted images are almost the same with the original image. BER values are equal to 0, and NC values are equal to 1, which indicates that the proposed algorithm has excellent robustness against these attacks.
When resisting MP3 compression, noise corruption with 20 dB, low-pass filtering with cutoff frequency of 8 kHz, resampling, echo addition with 50 ms, and amplitude scaling, the extracted images are similar to the original image. BER values are below 1.28%, and NC values are above 0.9740, so the proposed algorithm has good robustness against these attacks.
When resisting low-pass filtering with cutoff frequency of 4 kHz, the former half of the extracted watermark image is very clear, while another half is completely blurred, NC is 0.7458, and BER is 25.64%, as shown in Figure 8(k). e reason for this phenomenon is mainly related to the algorithm parameters, including the length N e of the data by DCT, the region [b 0 + 1, b 0 + L if ] for embedding the watermark, and the sampling rate f s of the audio. In our experiment, sampling rate is 44.1 kHz, N e � 4096, b 0 � 300, and L if � 1024, so the embedding frequency range [f 1 , f 2 ] can be calculated as follows.
It can be seen from formulas (18) and (19) that the watermark can be concealed in the frequency range of [1.62, 7.13] kHz in the audio, so low-pass filtering with a cutoff frequency higher than 7.13 kHz or lower than 1.62 kHz almost has no effect on the watermark.
When resisting low-pass filtering with cutoff frequency of 8 kHz, the extracted watermark is relatively clear, NC is 0.9990, and BER is 0.05%. However, because the upper limit of the embedding region is 7.13 kHz, which is very close to the cutoff frequency (8 kHz) of the filter, and the low-pass filter has 3 dB amplitude attenuation near the cutoff frequency, there are still a few noise points in the extracted watermark image shown in Figure 8(l). When resisting lowpass filtering with cutoff frequency of 12 kHz, the extracted watermark is the same as the original image, as shown in Figure 8(m), NC is equal to 1, and BER is equal to 0. When  the audio is subjected to low-pass filtering with cutoff frequency of 4 kHz, the frequency components above 4 kHz in the audio will be removed, so the watermark in the frequency range of [1.62, 4] kHz can be extracted (the former half of the image in Figure 8(k) is very clear), while the watermark in the frequency range of [4, 7.13] kHz cannot be extracted (another half in Figure 8(k) is completely blurred). In practical application, the cutoff frequency of the low-pass filter and the frequency range of the embedding region should be staggered by adjusting the algorithm parameters to prevent watermarks from being damaged.

Synchronous Attack.
Synchronization attack is the most challenging type in the research of robust watermarking algorithm.
In Table 5, there are four kinds of synchronization attacks with different strengths for testing the robustness, including TSM, PSM, jittering, and random cropping. After the audio is subjected to the above synchronization attacks, BER (%) and NC values of the extracted watermark are, respectively, averaged and listed in Tables 6-9. e extracted images whose NC values are closest to the average value are shown in Figures 11-14.   (1) TSM. Table 6 shows the average BER (%) and NC of the extracted watermark under TSM with different strengths from −30% to +30%. It can be seen from the experimental results in Table 6 and Figure 11 that this algorithm has excellent robustness for overcoming TSM attacks with different strengths. BER values all are below 15.31%, which is far superior to the standard of IFPI. NC values all are above 0.8267, so the extracted images all can distinguish its content in them.
(2) PSM. When the audio is subjected to PSM, its playing time will not change, but the position and shape of the voiced frame will change slightly. Table 7 shows the average BER (%) and NC of the extracted watermark under PSM with different strengths from −5% to 5%. Although the extracted images are not very clear in Figure 12, their content can still be distinguished.
(3) Jittering. Table 8 shows the average BER (%) and NC of the extracted watermark under jittering with different strengths from 1/100000 to 1/500. e extracted images are shown in Figure 13. As shown in Figure 13(a), under the maximum attack strength (1/500), the extracted image contains more noise points, but its main feature can still be identified, in which NC is 0.9303 and BER is 3.68%. As the attack strength weakens, the extracted images become more and more similar to the original image in Figure 8(p), and BER values become smaller and smaller, so this proposed algorithm has strong robustness against jittering.
(4) Random Cropping. e average BER (%) and NC of the extracted watermark under random cropping with different strengths are shown in Table 9. From the experimental results, the extracted images in Figure 14 all are relatively clear, NC values are above 0.9754, and BER values are below 1.90%, which all show that this proposed algorithm is robust against random cropping.

Comparative Analysis of Robustness.
In order to compare the algorithm robustness with the related algorithms in papers [16], [22], and [23], Table 10 lists BER (%) values of these algorithms when resisting signal processing operations and synchronization attacks. From the experimental results in Table 2 and Table 10, the comparative analysis is discussed as follows about these four algorithms. Compared with the algorithm in paper [16], this proposed algorithm has larger payload capacity, higher transparency, and better robustness against synchronization attacks and conventional signal processing operations except for some attacks, such as MP3 compression with 64 kbps, low-pass filtering with cutoff frequency of 4 kHz, and PSM. According to the embedding principle of this algorithm, the frequency band where the watermark is located can be changed by modifying the algorithm parameters in practical application, so this proposed algorithm can overcome the attack from low-pass filtering with cutoff frequency of 4 kHz in fact. In the following comparative analysis with   the other two algorithms, this viewpoint will not be reiterated. e robustness of this proposed algorithm is far superior to the algorithms in paper [22] and paper [23], although the payload capacity is slightly lower than that in paper [22]. Synchronization attacks may change the overall structure of the audio, but it has little effect on the shape of the voiced frame. e proposed ISM in this algorithm can accurately track the position of the largest amplitude in the voiced frame to determine the embedding region where the watermark is located.
erefore, this algorithm has strong robustness against various malicious attacks.

Security.
e watermark hidden in audio is protected by encryption technology and information hiding technology, so it is necessary to analyse the security of this algorithm from the key space constructed by encryption technology and information hiding technology. e proposed algorithm uses a triple key Ch(x 1 , α 0 , δ) to encrypt the watermark and seven key parameters (N, N e , L 1 , L 2 , b 0 , L if , a) to conceal the watermark. Ch(x 1 , α 0 , δ) and a are taken in the real field, so this algorithm has infinite key space in theory. In fact, they are affected by the word length, so their key space is limited. In our test, the computer system is 64-bit, N, N e , and L if all are 16-bit, and L 1 , L 2 , and b 0 are 10-bit, so the key space to encrypt the watermark can be calculated as 2 192 , and the key space to conceal the watermark is 2 142 . From the above analysis, even if the attacker obtains the principle of the algorithm, as long as these key parameters are not known, it is difficult for the attacker to obtain the watermark.

Complexity.
e complexity of the algorithm is an important index to evaluate the performance of the algorithm. It is usually measured by the computational cost when embedding and extracting the watermark. In our experiment, the average time for extracting the watermark is 0.8544 s and that for embedding the watermark without GA is 1.6246 s. When GA is used to search for the best algorithm parameter, the embedding time is related to the evolution time of GA. It can be seen that GA enables our proposed algorithm to achieve a good balance between transparency and robustness, but it also brings a large computational cost. When embedding the watermark, the average time for searching the embedding region in a voiced frame is 0.037 ms. When extracting the watermark, the average time for tracking the extracting region in a voiced frame is 0.036 ms. It can be seen that our proposed ISM takes up very little computational cost to find the synchronization marks.

Conclusions
In our study, it is found that the playing time of the audio will be longer or shorter after being attacked by TSM, but the shape of the voiced frame will not change basically. erefore, an ISM which can search for the embedding region where the watermark is located is developed, in which it takes the sample point with the largest amplitude in the voiced frame as the synchronization mark. GA is utilized to optimize the key algorithm parameter to balance both transparency and robustness. Combining the "energy concentration" characteristic of DCT and the stability characteristic of SVD, a robust and blind audio watermarking algorithm with ISM and GA is proposed for overcoming malicious synchronization attacks and conventional signal processing operations. e following measures are taken to improve the algorithm robustness. Firstly, the proposed ISM can accurately track the region where the watermark is located. Even if the structure of the audio changes slightly, the ISM can accurately search this synchronization mark in the voiced frame to track the region that can be used to embed and extract the watermark. Secondly, GA is utilized to optimize the key algorithm parameter to balance both transparency and robustness. irdly, the audio will be divided evenly twice to avoid the drift of the embedding region caused by the change of the audio structure. At last, the watermark is repeatedly embedded in three voiced frames to improve the algorithm robustness. Embedding the same watermark in the three voiced frames is equivalent to embedding the watermark with a triple repetition code. e experimental results confirm that this proposed algorithm has excellent robustness in the case that the payload capacity is 64 bps; it can not only withstand conventional signal processing operations but also resist TSM, PSM, jittering, and random cropping. Especially, this algorithm even stands up to TSM with strength from −30% to +30%.
Although the proposed algorithm has excellent robustness when overcoming TSM, jittering, random cropping, and various conventional signal processing operations, the experimental results of this algorithm under PSM are not good enough, mainly because PSM makes the synchronization mark in the voiced frame shift, which leads to the error bits in the extracted watermark. erefore, the performance of this algorithm is not enough when withstanding some attacks, such as deliberately distorting the peak amplitude points to remove synchronization mark, which will be further studied in our future work. In addition, GA is used to optimize the key algorithm parameter, which is very helpful to balance the transparency and robustness. However, GA needs a long evolution time to search for the optimal algorithm parameters, which greatly increases the computational cost. Based on this, this algorithm is not suitable for the application with strict time requirements. In future research, we will strive to enhance the security and robustness against more types of synchronization attacks. Data Availability e data used to support the findings of the study are included within the article and are obtained from public platform.

Conflicts of Interest
e authors declare that they have no conflicts of interest.