Commutative Watermarking-Encryption of Audio Data with Minimum Knowledge Verification

MWK (EK (O) ,m) = EK (MWK (O,m)) , (1) where E is the encryption function, K is the encryption key, WK is the watermarking key, O is the cleartext media data, andm is the mark to be embedded. If encryption and watermarking do commute, their combination can serve as an important building block within a Digital Rights Management (DRM) System, as detailed further in Section 2. In the present paper, an existing CWE concept for still images [1] is extended to audio files. To the best of our knowledge, this is the first CWE scheme for audio files to appear in the literature. In addition, we show that the presented CWE scheme can be integrated into a modified version of a protocol due to Craver and Katzenbeisser [2], enabling zero-knowledge verification of the watermark, meaning a verifier can verify the presence of a watermark without disclosure of the mark M or the watermarking key WK.The rest of the paper is organized as follows: in Section 2, we motivate the need for CWE schemes and identify some basic requirements. In Section 3, we shortly review existing CWE schemes for still images and encryption/watermarking techniques for audio files, with a special emphasis on those algorithms using similar techniques as in our approach. In Section 4, we present our CWE scheme in detail. Section 5 provides experimental results on the robustness and fidelity of the watermarking part. Section 6 presents the integration of the CWE scheme into a zero-knowledge protocol for verifying the mark, and Section 7 concludes the paper.


Introduction
Commutative watermarking-encryption (CWE) means the combination of watermarking and encryption in such a way that the encryption and watermarking functions commute; that is, where E is the encryption function,  is the encryption key,   is the watermarking key,  is the cleartext media data, and  is the mark to be embedded.If encryption and watermarking do commute, their combination can serve as an important building block within a Digital Rights Management (DRM) System, as detailed further in Section 2. In the present paper, an existing CWE concept for still images [1] is extended to audio files.To the best of our knowledge, this is the first CWE scheme for audio files to appear in the literature.In addition, we show that the presented CWE scheme can be integrated into a modified version of a protocol due to Craver and Katzenbeisser [2], enabling zero-knowledge verification of the watermark, meaning a verifier can verify the presence of a watermark without disclosure of the mark M or the watermarking key   .The rest of the paper is organized as follows: in Section 2, we motivate the need for CWE schemes and identify some basic requirements.In Section 3, we shortly review existing CWE schemes for still images and encryption/watermarking techniques for audio files, with a special emphasis on those algorithms using similar techniques as in our approach.In Section 4, we present our CWE scheme in detail.Section 5 provides experimental results on the robustness and fidelity of the watermarking part.Section 6 presents the integration of the CWE scheme into a zero-knowledge protocol for verifying the mark, and Section 7 concludes the paper.

Motivation for CWE
The concept of commutative watermarking-encryption (CWE) was first discussed in [3] with a special emphasis on watermarking in the encrypted domain.From the left-hand side of (1) it is clear that the watermarking function M must be able to act in the encrypted domain, which means that only a limited set of audiovisual features (if any) is available to the embedder and can be used to embed the mark.

Dispute Resolve Protocols.
The prime motivation to look at CWE schemes originates from the need to implement socalled Dispute Resolve Protocols, where a rights owner  provides a digital media object  to a distributor , who in turn sells  to some customer .In this scenario, a number of attacks are possible, most importantly the case where  sells a copy of  in his own right.In particular, if such a 2 Advances in Multimedia copy is detected, the Dispute Resolve Protocol must be able to identify  as the rightful owner of  and to identify  as the offending party.
An obvious solution is that  embeds a watermark identifying  as the rightful owner into  and provides the marked object  to .The distributor  in turn marks  for each customer  with an additional watermark uniquely identifying .Unfortunately, in this scenario the distributor  is able to generate  identical copies of  and sell them to  customers   .If these copies are marked with the identifier of some specific customer , the distributor  can repudiate having generated the copies and the customer  could be held responsible for the offence of .
The basic problem here is that  has access to the marked object  in plaintext.If a CWE scheme is available, however, the following protocol between a generic seller  and a generic buyer  becomes possible, as proposed in [4]: (1)  encrypts  with her symmetric key   .The result is  = E   ().
( (5)  removes his encryption from  and is in possession of the individually marked object  = (M(,   )).
If the distributor  takes the role of the seller in this protocol and the rightsholder  performs the en-and decryption operations in steps ( 1) and ( 4), respectively, the problem mentioned above can be solved, if a CWE scheme for the media object  is available.The need for a CWE scheme becomes obvious in steps (3) and ( 4), where an encrypted media object is watermarked and the presence of a watermark is verified in an encrypted object, respectively.Moreover, steps (3) and ( 4) call for a public key watermarking scheme, where there is a private embedding key and a public detection key, or an asymmetric scheme, where it is possible to verify the existence of a watermark without fully disclosing the embedding key or the watermark itself.

DRM Systems.
In Digital Rights Management (DRM) Systems [5], encryption and watermarking are often combined in a natural way: the media data are transferred to a compliant media player in encrypted form, so that access to the plaintext data happens only under control of the compliant player.In addition, watermarks are embedded into the media data to have an additional layer of protection which is present even after the data have been decrypted.These watermarks can be used to claim copyright, enforce copying restrictions, or track illegal copies offered on the Internet.If a CWE scheme is used, compliant media players have the opportunity to detect and insert watermarks even in encrypted data.More generally, it should be possible to protect multimedia data throughout the distribution chain in a flexible way by allowing the encryption and watermarking operations to commute [6].

Searching in Encrypted
Databases.With the advent of cloud computing, new security challenges have arisen.For example, cloud computing clients need to secure their data, not only to protect their data from public attacks, but also to protect their data from their cloud service provider [7].Thus, clients need to encrypt their data in the cloud.On the other hand, a cloud service provider or a client often has the need to search through the client data according to certain metadata or tags.It is therefore highly desirable to provide techniques which can protect the clients' privacy and offer a large amount of accessibility at the same time.CWE schemes can provide such a solution, if metadata are used as watermarks and embedded into the encrypted data.

CWE Schemes for Image Data.
To the best of our knowledge, no CWE schemes for audio data have been proposed so far.However, there have been a number of attempts aimed at still images, of which we only review the so-called invariant encryption approach, as it is also used in our audio CWE scheme.For a more comprehensive review of existing CWE schemes for still images, see [8].
The invariant encryption approach to CWE as introduced in [1] is to encrypt the media data completely (as opposed to the partial encryption approach, which leaves part of the data unencrypted to host the watermark), but to use a weaker cipher that leaves a feature space of the media data invariant.This invariant feature space can be used to embed a watermark.For example, a permutation cipher can be used for encryption, leaving the global first-order statistics of the image untouched.The invariant feature space is therefore represented by the image histogram and a histogram-based algorithm can be used to embed the mark.The advantage of the invariant encryption approach is that all media data are encrypted (and not just a subset as in partial encryption schemes).The disadvantage, besides using a weaker cipher, is an inherent lack of robustness of the watermark.

Audio Ciphering Methods.
Let the audio signal consist of a set of  sample values  = {() | 0 ≤  ≤  − 1}.Most existing audio ciphering methods like, for example, [9] or [10], substitute the audio sample values and change (i.e., flatten) the global histogram of the amplitudes of the sample values.The flattening of the histogram makes it impossible to use the histogram for embedding a watermark.In [11], however, a permutation cipher is used to permute the sample values in the time domain, thereby keeping the histogram invariant.This shows that it is possible to transfer the invariant encryption approach to audio data.

Audio Watermarking Methods.
From the host of existing audio watermarking methods (see [12] for an overview), the method by Xiang et al. [13] is the most important for our work, as it uses (a part of) the amplitude histogram for embedding the mark.The range of the audio sample values  is splitted into equal-sized bins.The amplitude histogram  is an -dimensional vector where ℎ() denotes the number of samples falling into the th bin.As embedding the mark has altered the mean value of the amplitude values, for extraction, the correct mean value has to be searched within a search space .For each mean value in , the corresponding histogram is formed and the distance between the first  extracted bits and a known synchronization sequence sync is computed.The mean value associated with the minimum distance is used to extract the remaining watermark bits.
The described synchronization process helps to make the watermark robust against TSM attacks (cf.Section 5.2).Although the watermarking scheme is based on the histogram, it cannot be used in conjunction with a permutation cipher to form CWE scheme, because only a certain number of sample values in a histogram bin are modified.Therefore, after application of the permutation cipher, different sample values than before are modified, which destroys the commutativity property.Moreover, the scheme by Xiang et al. does not use a secret watermarking key   .

The Proposed CWE Scheme
The proposed scheme is based on the earlier ideas [1,13] described in Section 3. In order to apply them in the audio domain and in order to make the overall scheme more robust to TSM attacks, some modifications were necessary, which are described in the following paragraphs.

Ciphering
Algorithm.An analogue audio signal is transferred into the digital domain by sampling the timecontinuous signal at a certain discrete sampling rate.At the same time, the obtained samples are quantized according to the bit depth available, the result being a set of  sample values  = {() | 0 ≤  ≤  − 1}, where  can be seen as a discrete time coordinate.Common bit depths for representing audio are 16, 20, or 24 bit.The general idea is to permute the discrete points in time, while leaving the sample values untouched.In order to generate the permutations, the discrete version of Arnolds Cat Map [14] was used, because it is a well-known chaotic map used by many authors for generating permutations in image ciphering (see e.g., [15]).The discrete Cat Map is a two-dimensional map defined on a  ×  square grid by where  and  are parameters that can serve as the secret key if the function is used for encryption purposes.Twodimensional permutations of the square grid can be quickly generated by repeated application of the Cat Map.Note, however, that there are only  2 different keys.Therefore, it has been proposed in [16] to change the secret parameters in every iteration of the Cat Map.In order to apply the Cat Map on a discrete audio signal of length , the audio signal is rearranged into a square grid of size √ .If  is not a square number, the signal is padded with random sample values having the same probability distribution (i.e., the same histogram) as the original signal.This makes sure that the padded values cannot be distinguished from the original values by an attacker.Moreover, the original histogram is largely unchanged by the padding (cf. Figure 1).Figure 2 shows the effect of the Cat Map after five iterations on the waveform of an example signal.The resulting PSNR between original and enciphered signal is 16.47.

Basic Principles.
The design goals for the watermarking algorithm to be used within our proposed CWE scheme were as follows: (i) The watermarking algorithm should commute with the permutation cipher in the sense of (1).
(ii) It should be robust against Time-Scale Modification (TSM) attacks (see Section 5.2).
(iii) It should be able to use a long watermarking key in order to prevent an attacker to insert her own watermark.
These goals call for a combination of the watermarking concepts described in [1,13]: in order to have full commutativity with the permutation cipher, it is necessary to swap entire histogram bins.These swaps can be randomized using a secret watermarking key as described in [1].However, this procedure can imply a substantial change of the histogram mean.In order to deploy a synchronization procedure for robustness against TSM attacks as described in [13]  swapped.However, it might be changed by a TSM attack, which could lead to the detector choosing the wrong bin pairs for extracting the mark.Therefore, the bin pairs used to embed the mark now need to serve as watermarking key, as opposed to an initial seed for a pseudorandom number generator as in [1].Detection.The detector needs to know the original mean value  of the samples of the unmarked cover work , along with the synchronization sequence sync of length  and the sequence of bin pairs   = {(  ,   ) | 1 ≤  ≤   } used for embedding the mark , which serves as a watermarking key.As described in [13], for finding the correct mean value after a potential TSM attack, the detector first computes a search space  = {(1−Δ), (1−Δ)+1/, . . ., (1+Δ)}, where  is a parameter governing the size of the search space.Now, for each member   of the search space, the corresponding histogram part   is formed and a synchronization sequence sync  is extracted from  by comparing the first  histogram bin pairs given in   .For each histogram part   , sync  is extracted and the Rogers-Tanimoto [17] In what follows, we provide a lower bound of the number of possible keys.  consists of   bin pairs chosen from the relevant part  of the amplitude histogram containing   bins.We divide the bins into   equal parts of size   / W . Assuming that, in each pair, the first bin comes from a different part and second bin is chosen in a distance ≤ step max from the first, there are (  /  ) ⋅ step max possibilities to choose a single bin pair.Because there are   bin pairs and their order is important, we arrive at a bound for the number of keys.Typical parameter choices like   = 128,   = 32 and step max = 5 lead to a bound  ≈ 2 232 .Note that if the histogram pairs in   are revealed, but not their order, as in the protocol described in Section 6.2, a watermark length of   = 48 is still sufficient to provide a key length of about 200 bit.

Permutation Cipher.
As mentioned above in Section 4.1, the Cat Map suffers from a low number of possible keys, which can be remedied only if the key parameters are changed from iteration to iteration.Because in principle the required permutations can be generated in a different way, it is more interesting in this context to look at the security of permutation ciphers for audio files in general.In [18] the authors investigated the security of permutation ciphers as applied to  ×  images with  greyvalues and found that if  = log  ( 2 ) plaintexts are known, attacks with a complexity ( ⋅  4 ) are possible, requiring frequent key updates.Applying these results to an audio file of length  means that  = log  () known plaintexts are sufficient to break the cipher, where  is the number of possible sample values.Because the bit depth of an audio file is usually higher than that of an image file (16 Bit as opposed to 8 Bit per sample, resp., pixel), the key for the permutation needs to be updated twice as often as for an image file of comparable size.

Experimental Results
The following experiments were carried out with a collection of audio files provided by the European Broadcast Union (EBU) for sound quality assessment.The audio files include artificially generated signals as well as speech, single instruments, and pop music (https://tech.ebu.ch/publications/tech3253).

Perceptibility.
In order to measure the perceptibility of an embedded mark, the Peak Signal-to-Noise Ratio (PSNR) between the marked work and the original work was computed, as is common in the literature.Figure 3 shows the PSNR between original file and marked file for increasing length of the watermark and seven example soundfiles.The parameter set used for embedding was   = 1500, step max = 9,  = 2.5,  = 2.According to [19], noise becomes  perceptible at a PSNR < 35 dB. Figure 3 therefore shows that using these parameters it is possible to embed up to 512 bit without problems.
If the histogram bins are broadened (i.e., if   is decreased), the capacity goes down, as less bins are available for embedding, while more samples are affected by the embedding, making the watermark more perceptible, but also more robust.Figure 4 shows the PSNR values for 70 test files for   = 128 and a 32 Bit mark and for   = 224 and a 56 Bit mark, respectively.The solid lines indicate the average PSNR values.The watermarking parameters are chosen in such a way that a good robustness against TSM attacks is achieved (cf.Section 5.2); however, the noise introduced by the watermark is at the border of being noticeable.

TSM Attacks.
TSM attacks basically try to desynchronize embedder and detector by compressing or extending the time axis of the audio file.A common requirement is that an audio watermark should be able to survive a rescaling of about 10% [20].Moreover, the human auditory system is relatively insensitive to TSM attacks, which makes even higher percentages seem realistic.In resample mode, certain audio samples are repeated or removed in order to stretch or extend the time axis.In pitch-invariant mode, the speed of the audio file is modified without changing the samples.In order to implement these attacks in practice, the popular opensource tool Audacity V2.1.1 (http:/www.audacityteam.org) was used.
Figure 5 shows the Bit Error Rates (BER) when retrieving watermarks of 32 bit and 56 bit length after TSM attacks.For these parameter choices, the proposed watermarking algorithm is extremely robust against TSM attacks in resample mode, while the robustness against pitchinvariant mode is slightly worse, but still very good.In particular, the required robustness against 10% rescaling is fulfilled.In general, the robustness is quite sensitive to the choice of parameters.In particular, if the number   of histogram bins is further increased, the robustness decreases accordingly.

StirMark.
In order to evaluate the robustness against general signal manipulations, the well-known benchmarking tool StirMark for Audio, V1.3.2 (https:/sourceforge.net/ projects/stirMark) was used to simulate common attacks.
Although a low number   of histogram bins were chosen (  = 128 or   = 224) to achieve a higher robustness, the algorithm performed very differently, depending on the type of attack.For example, low-pass filtering or inserting a sine signal into the audio file is able to destroy the mark completely.On the other hand, sample manipulations like inserting Zero-Samples, periodic deletion of samples (cropping, or CutSamples in StirMark), or manipulating the least significant bits (LSBs) of the samples do not affect the watermark.Table 1 (for   = 128) and Table 2 (for   = 224) give the detailed results of our experiments with StirMark.In general, it is hard to devise a CWE algorithm which is robust against a wide class of attacks, because in a CWE scheme the embedder must be able to operate in the encrypted domain and therefore cannot rely upon important perceptual features of the cover work.It is possible, however, to achieve certain robustness against a certain well-defined class of attacks, as the proposed algorithm shows.

Commutativity with Encryption.
The presented CWE scheme is not fully commutative in theory because of the padding needed in the encryption step: as the embedding step changes the original histogram, the padding samples introduced by a mark-then-encrypt operation can be slightly different from the padding in an encrypt-then-mark operation.We tested the influence of this issue by verifying a mark which was embedded into seven test files in the following three scenarios: (i) (M   (E  (), )): the mark  is embedded into the encrypted cover work, and then extracted.
(ii) (E  (M   (, ))): the mark is embedded into the plaintext cover work.The marked cover work is encrypted and the mark is extracted.
(iii) ( −1  (M   (E  (), ))): the mark is embedded into the encrypted cover work, then the work is decrypted and the mark extracted.
In all scenarios the mark could be extracted without any errors from all test files.In practice, the number of padding samples is very small compared to the overall number of samples and they are rarely influenced by the watermarking.
Note that the noncommutativity is not intrinsic to the overall scheme, but results from the special way the permutations are generated, namely, by deploying the twodimensional discrete Cat Map.If the permutation is generated by some alternative mechanism like, for example, the one proposed in [21], which does not require padding, the scheme is fully commutative by construction.

Minimum Knowledge Verification of the Mark
The discussion in Section 2 has shown that in order to be useful in a generic buyer-seller protocol, there ought to be a way to verify the watermark without fully disclosing either the mark or the watermarking key.In [2], Katzenbeisser and Craver propose a probabilistic protocol which is in principle able to integrate any symmetric watermarking algorithm.
Here, we make a few modifications to this protocol to work with the proposed audio watermarking algorithm.These modifications strive to make full use of the special properties of the proposed CWE scheme and are able to eliminate a certain weakness of the scheme by Craver and Katzenbeisser.
6.1.The Protocol by Craver and Katzenbeisser.In this protocol, a prover Alice wants to prove the presence of a watermark  to some verifier Bob without disclosing  or the watermarking key   .The cover work  is viewed as an array of  samples.The watermark , which has also length , is embedded by the prover using some symmetric watermarking algorithm.The result is the marked work .
Alice now generates some secret permutation  and publishes  In order to prove the presence of  within  to Bob, Alice and Bob engage in a multistep protocol.Each step  consists of the following substeps: (1) Alice generates two permutations   and   with the property   ∘   = .Then she computes   =   () and   =   ().(2) Alice generates a so-called ownership ticket where  is a secure hashfunction and  1 and  2 are encrypted versions of   and   , respectively.Alice sends OT to Bob. (3) Bob flips a coin and, depending on the outcome, asks Alice either to decrypt  1 or  2 for him.
(4) If  1 is opened, Bob can compute   =  −1  (()) and   =  −1  (()) and verify the hashvalues contained in OT.Having thus verified to be in possession of the correct   , Bob goes on to verify that   =  −1  (()) is present within   .
(5) If  2 is opened, Bob computes   =   () and   =   () and verifies the hashvalues contained in OT.In this case, Alice's knowledge of  is verified.
Craver and Katzenbeisser go on to show that if these steps are repeated  times, Alice has probability 2 − to fool Bob into believing that her watermark is contained in .
In our opinion, this protocol, while being very ingenious, has two drawbacks: first, the verifier gets to know the marked work  together with () and can therefore get some information about the secret permutation  (the same is true for  and (), but in this case getting information about  means being able to solve an instance of the graph isomorphism problem [22]).
Second, and more importantly, it is not clear how exactly in step (4) the presence of   in the scrambled work   should be verified without disclosure of the watermarking key.

The Modified Version.
In our modified version of the protocol, we take advantage of the special structure of our watermarking algorithm and strive to eliminate the two drawbacks mentioned above.As in Craver and Katzenbeissers original protocol, the prover Alice generates a secret ZeroLength-10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 26 ZeroRemove 0.0 0.0 0.0 0.0 0.0 0.0 0.0 permutation  and a graph .She marks the cover work  with the watermark  using the algorithm described in Section 4.2 to get the marked work .She then publishes , , () and the permuted watermarking key (  ), but not () or ().Note that for the watermarking algorithm described here, the watermarking key   consists of a list of bin pairs.If this list is used for extracting the watermark in permuted form, the result will be the permuted watermark.The modified protocol now proceeds in  steps.Each step  consists of the following substeps: (1) Alice generates two permutations   ,   with the property   ∘   = .Then she computes   =   () and   =   ().
(4) If  1 is opened, Bob can compute   =  −1  (()) and verify the hashvalue (  ) contained in OT.Having thus verified to be in possession of the correct   , Bob goes on to compute   =  −1  ((  )) and uses   to extract   =  −1  (()) from .He can check the correctness of   by verifying the hashvalue (  ) in OT.
(5) If  2 is opened, Bob computes   =   () and verifies the hashvalue contained in OT.In this case, Alice's knowledge of  is verified.
In this modified version of the protocol, it is not necessary to publish () because the permuted mark   can be extracted directly from .Moreover, it is clear how to do this without knowledge of the watermarking key, as the permuted key   will yield the permuted mark   .Another interesting aspect of the modified protocol in connection with the watermarking algorithm described here is that it can be applied to a permuted marked work () in exactly the same way, where  is some permutation independent of the secret permutation  used in the interactive verification protocol.
The relevant part of the amplitude histogram consists of the bins covering the interval  = [−, ], where  is the mean value of the absolute amplitude values and  is some fixed parameter.This condition makes sure that the bins in the relevant part of the histogram are "well filled," that is, ℎ() ≫ ∀.To embed a watermarking bit   , a triple of consecutive histogram bins with heights (, , ) is used.If   = 1, the relation 2/( + ) <  should hold, where  is a predefined threshold value.If the relation is not satisfied by the three bins, a certain number of samples is shifted from the first and third bin of the triple into the second bin by adding and subtracting, respectively, a bin width  to the samples.An analogous process is carried out if   = 0.
Let the watermark  consist of   bits   .The first  bits of the watermark are used as synchronization sequence sync and should be known to the detector.As a first step, the embedder generates the amplitude histogram  of the audio signal and forms its relevant part  = [−, ] as described in Section 3.3.For each 1 ≤  ≤   the embedder computes a histogram bin pair in the following way: (i) Generate a random number 1 ≤   ≤   , where   is the number of bins within .(ii) Find the   th unused bin within .Generate another random number step, such that 1 ≤ step ≤ step max and 0 ≤   =   + step ≤   .(iii) If   has not been used before and |ℎ(  ) − ℎ(  )| > , save the pair (  ,   ).The watermarking bit   is now embedded in the following way: (a) For   = 1, the relation ℎ() > ℎ() must hold.If this is not the case, swap the bins by assigning new values to all samples in the bins.(b) For   = 0, the relation ℎ() < ℎ() must hold.If this is not the case, swap the bins by assigning new values to all samples in the bins.(iv) If   has been used before or |ℎ(  ) − ℎ(  )| ≤  generate a new random number   .
dissimilarity  RT = 2 ( 01 +  10 ) 2 ( 01 +  10 ) +  11 +  00 (4) from sync is computed, where   is the number of occurrences where a sync bit is  and a corresponding sync  bit is .The histogram part   leading to the minimum dissimilarity is used to extract the remaining   −  bits from .