A Relative Phase Based Audio Integrity Protection Method : Model and Strategy

Audio oriented integrity protection should consider the characteristics of audio signals based on the combination of audio application scenarios. However, in the current popular network interaction environment, traditional verification based solutions can no longer work. A kind of integrity protection scheme for audio business should be redesigned in these new scenarios. In this context, a method of audio integrity protection based on relative phase (RP-AIP) is proposed. Through the design of integrity object (I.O.), the integrity of audio can be abstracted as the completeness and accuracy of it. The I.O. is bound to the audio signal in a uniformly and randomly embedded manner, and the embedding rules are controlled by the relative phase characteristic of the audio itself. The model and strategy of RP-AIP are illustrated, and the corresponding process and algorithm are demonstrated as well. Simulation experiments illustrate the feasibility of the proposed solution and indicate the superiority of its performance.


Introduction
The rapid development of digital multimedia and Internet technology has facilitated the interaction of various multimedia services based on images, videos, and also audio.In particular, the widespread adoption of new generation instant messaging applications, such as "WhatsApp", "WeChat", and "LINE", has made voice based communication more and more popular.Since voice messaging is more convenient and easier to use than text or other mediums, it has quickly become a widespread way of instant messaging products.Voice-class audio business has now become the mainstream of network interaction.
The voice messages can not only carry the speaker's intention, but also identify the speaker.It is generally acknowledged that the voice-class interaction is more secure and reliable than the text-class.Indeed, it is even allowed to undertake high trust requirement functions.For example, voice authentication has been recommended in the process of transfer of accounts in "WeChat".However, we should be aware that it is easy to change the original intention of the speaker by cutting or tampering operations on the voice messages.It must be recognized that malicious cutting or tampering attacks on voice-class audio business has become a new major issue troubling audio communication [1].
On this issue, audio integrity protection can effectively resist malicious tampering and cutting.However, currently available audio protection solutions are generally oriented towards copyright [2][3][4], and more details can be found in the survey done by R. D. Shelke et al. [5].These studies are not only in time domain [6], but also in various transform domains to discuss how to protect the integrity of audio signals [7][8][9][10].The issue of audio integrity protection for network interaction scenarios is rarely studied.When it comes to this kind of audio class business, it has some special features, such as the following.

(i) Streaming Characteristic
This is the most important and essential characteristic that is different from other interactive mediums.Compared to static texts and images, the streaming media characteristic of audio determines that the media should always accompany the process of integrity protection.In other words, integrity measurement based protection solutions (SHA-1, MD5, etc.) can no longer work.This characteristic poses new requirements and challenges for the integrity protection of audio.

Security and Communication Networks
How to design such a scheme that can accompany the media transmission and unlink the global association is the first essential issue to consider.

(ii) Imperceptibility
This characteristic is to ensure that the quality of experience (QoE) of users will not be degraded.Comparing to the human visual system (HVS), the human auditory system (HAS) is more sensitive and easy to perceive subtle changes in the audio signal.This is also why audio signal processing is more difficult than image and video signals.Time-domainoriented signal processing method tends to have better synchronization, but at the same time, it is rarely used because of its serious signal distortion.This characteristic just contradicts the requirements of the previous one, which is another new difficulty we need to consider.

(iii) Self-Detectability
As streaming media, audio signal cannot stay: while listening while disappearing.Therefore, the integrity detection should be self-fulfilled, without relying on the original signal.That is to say, the integrity protection and detection process should be synchronized as well: while playing while detecting.

(iv) Underdetermination
Due to the unpredictability of transmission environment and noise effects, we do not emphasize the complete certainty of audio integrity protection.More specifically, we believe that the audio oriented integrity protection does not have to be completely determined.This is because our fundamental goal of audio integrity protection is to ensure the correctness and credibility of the audio information (e.g., the speaker's intention), rather than the audio signal medium itself.
For these reasons, a new method of audio integrity protection based on relative phase (RP-AIP) is proposed in this paper, which is used for protecting audio from being maliciously cut or tampered and ensuring the audio information can be correctly conveyed.Taking into account the above features, we first divided the audio into uniform segments and sampled them to get discrete audio signals.Then, we transformed them by DFT (Discrete Fourier Transform) and DCT (Discrete Cosine Transform), respectively.In the DFT domain, we can get the relative phase relation of the host segment and use this as a rule.In the DCT domain, we will embed a series of constructed eigenvalues as the integrity object (I.O.) into each segment furtherly.By this method, the integrity of the audio can be abstracted as the completeness and accuracy of the I.O. in each segment.This model has broken the direct relationship between integrity and audio media, so as to achieve the purpose of synchronous detection of streaming media.
The remainder of this paper is organized as follows.We introduced some related technologies about audio masking effect and the noise model in DCT.Then we illustrated the model and analysis of RP-AIP in Section 3.After that, we presented the process and strategy of RP-AIP and described the details of them in Section 4. In Section 5, we illustrated the feasibility of our model with simulation experiments.
Finally, Section 6 concluded the paper and identified future directions.

Related Technologies
2.1.Audio Masking Effect.The HAS can be seen as a set of frequency analysis systems that contains about 26 band-pass filters, which can distinguish about 20Hz to 20kHz.However, HAS is difficult to distinguish between adjacent frequencies, that is, if a weak sound distributes in the adjacent frequency of a strong sound, the strong one will mask the weak one [11].This is the audio masking effect, and the so-called adjacent frequency is called the critical band with "Bark" unit.The audio masking effect can be described by masking function (MF), which is related to the sound pressure (SP) of the audio and the distance (d) between the masked and the masker.
where the SP(k) is the sound pressure, s(i) is a frame of audio signal, N is the number of samples per frame, and the h(i) is a weighted Hanning window as shown in the following equation: Obviously, the closer to the masked and masker, the greater the masking effect.Otherwise, the weaker the masking effect is until the masked is out of the critical band of the masker.Furthermore, the masking function of the masker can be expressed as shown in (3).It can be seen that the masker has no effect on audio which is outside [-3, 8] Bark.
Audio masking effect is the basis of audio embedding.Through it, the embedding data can be embedded without the human ear being aware of it.Therefore, the embedding data should depend on the original audio, and the embedded distribution must be determined by the masking characteristic of the original audio signal.

Noise Model in DCT.
The DCT transform can convert the frequency of the original audio signal to DCT coefficients, as shown in (4) [12].The embedding of weak signals on strong signals can be regarded as changes in DCT coefficients, so the essence of the embedding data is equivalent to an additive noise.

𝑆 (𝑘) = 𝛿 (𝑢)
where s(n) is the discrete sequence of audio signal in time domain, S(k) is the corresponding DCT coefficient sequence, N is the number of samples, and () is a weight factor as shown in the following equation; Assuming that the data embedded on the i-th coefficient of S(k) is E(i), then the inverse transform signal   () can be calculated as shown in (6).In this way, we can find the changes between the original and transformed signals.
Let   () − () = ()() cos(Ψ) ≜ (, ), then the noise effect in the time domain can be expressed as where Ψ = (2 + 1)/2.It can be seen from ( 6) and ( 7) that, in the DCT domain, the noise effect is only related to the embedded coefficient.Thus, we can evaluate the distortion of the original signal caused by the embedding property.
In ( 8), it can be seen that ∑ −1 =0 ((, )/()) ∝ ()/(), and the scaling factor can be furtherly defined as () in the following equation: where Ψ = (2 + 1)/2.So far, we learn that () is the final effect caused by embedding on the audio signal and is also the sign of signal distortion.

Mathematical Modeling.
According to the audio noise model, we can establish the mathematical model as shown in (10).We expect to find such a binding relation that associates the I.O.with the audio signal.Then, the integrity of the host audio will be extracted and expressed accordingly.

∃ ⊗
..  =  ⊗   ← integrity of  (10) where S refers to the original audio, E refers to the I.O., and   refers to the transformed audio.⊗ here means a sort of binding relation we expected, which will be further introduced later.In addition, E should be unperceivable but robust to ensure that S will not be distorted.Since the segments are evenly divided, the I.O.can be equally distributed in the audio data, and the size of the fragments can be adjusted by actual requirements.In this way, the integrity of the audio can be abstracted as the completeness and accuracy of the I.O.by uniform embedding.

Model
In addition, it is necessary to explain the function of Part D. Considering the issues of network congestion, accidental packet loss, and so forth caused by network anomalies, the received I.O.may not be exactly the same with the original.However, the loss and damage of I.O.caused by the transmission anomalies are generally accidental and random, but when it comes to malicious attack, it is generally deterministic and directional.Moreover, as stated earlier, the fundamental goal of audio integrity protection is to ensure the correctness and credibility of the audio information, rather than the audio signal medium itself.To this end, the use of fuzzy vault can solve this embarrassment, because the fuzzy vault can guarantee certain similarity rather than complete certainty.

Embedding Analysis.
Through the introduction and analysis of audio integrity protection characteristics, we know that audio embedding should be transparent to the user.On the other hand, the distortion tolerance characteristic requires that the embedding process should have some robustness [13].Interestingly, these two requirements are contradictory: to ensure transparency, it often means that the operation is usually based on the unimportant components of the audio signal, which are not sensitive to human ear, such as the high frequency regions.However, robustness is generally dependent on the important components of the audio signal, which are often sensitive to human ear, such as the low frequency regions [14].
After the audio signal sampled, it is transformed into a nonperiodic discrete signal in the time domain, and then a set of coefficient matrices composed of DCT coefficients can be obtained by the DCT transformation.Here, the low frequency signal energy is gathered in the upper left corner of the matrix, and the high frequency energy distributed in the lower right corner region is almost zero.Therefore, the eigenvalues can be embedded in these two regions: the low frequency region is robust because it carries a large amount of the main signal energy, while the high frequency region has good invisibility because of the absence of perception.Here comes the question, what kind of embedding strategy is the fairest?To this end, we propose to use the relative phase relation characteristic of the audio signal's frequency spectrum to control the embedding position.

Process and Strategy of RP-AIP
4.1.Audio Segmentation.According to the scheme design concepts, we need to embed the eigenvalues evenly in the audio signal, so this requires that the audio signal data should be divided evenly in advance.
The number of segments can be determined based on the elements of I.O.and the size of the audio.In general, assuming that there are n elements in the I.O., then we will divide the audio signal data into n segments as well.Thus, it can be ensured that the whole I.O.can be completely embedded and bound with the audio.

Relative Phase Invariance.
Through the early part we learnt that the human ear is very sensitive to relative phase changes of audio signal but lacks the ability to perceive and distinguish the absolute phase.Therefore, for a certain audio signal, the phase component is more important than the amplitude component, and any change or damage to the phase component will cause unacceptable distortion of the audio quality or even completely destroy the audio information.At the same time, the communication theory also declares that the phase modulation has strong robustness to the noise signal.
More specifically, by DFT transform, the phase distribution of a segment of the audio signal is easy to obtain.There is only one maximum point and one minimum point in the phase spectrum as shown in Figure 2, and their relative positions are fixed (if there are multiple extreme points, the first one prevails).As mentioned above, the relative phase will maintain consistency regardless of any transformation.This characteristic offers the possibility to random embedding of the I.O.

Random Embedding Strategy.
We will propose a strategy to determine the embedding position according to the position of the extreme points in the phase spectrum.Since the extreme points are random in the phase spectrum, the choice of the embedding position controlled by them is also random.Therefore, this strategy will guarantee the mutual balance between the transparency and robustness as previously described.Furthermore, the characteristic of relative phase invariance in frequency spectrum is unique because only the audio segment is involved, which guarantees the self-detectability as well.
An 8×8 coefficient matrix D can be obtained from a DCT transformation of an audio signal s(n), as shown in Figure 3.The most robust embedding method recommended is embedding at Position (0,0), while the most transparent one is at Position (7,7).Embedding at Position (0,0) can improve robustness but adds more perceptual distortion, while embedding at Position (7,7) can improve imperceptibility but offers less robustness.
To get a good compromise between these parameters, the mid-band coefficients such as (2,2) (L) and (4,4) (H) are

Security and Communication Networks
(1) receiving the key vaults as kv (i); (2) set the number of kv (i) as n; (3) divide the audio signal equally into n segments as s (j); (4) sampling the s (j) to c (j); (5) ∀c(j), initialize j=1; (6) while!end do (7) if j≤n then (8) calculate the phase spectrum () of s(j); (9) determine the position of the maxima and minima in (); (10)   selected as the embedding position instead.Position L or H is the Boolean choice for random embedding.The so-called Boolean choice means that a choice must be made from these two positions and the selection process is random.Based on the relative phase relation discussed above, a random embedding strategy is proposed as follows.The flow chart of this strategy is shown in Figure 4, and its algorithm is illustrated in Algorithm 1.
(i) If the maximum point of the audio segment appears before the minimum point, the corresponding eigenvalue will be embedded into the low frequency region (Position L).
(ii) Otherwise, the corresponding eigenvalue will be embedded into the high frequency region (Position H).
After determining the embedding position, the next step is to construct a fuzzy vault containing the secret sequence.On the finite field F, the secret sequence key ∈ F. By an encrypting set P (P ∈ F), the key can be packaged into the vault of P to generate the I.O. of the sender.
After embedding the I.O.into the audio, a new DCT coefficient matrix   is generated.By making a DCT inverse transformation on   , we can obtain a new audio signal   () containing I.O.as well.Finally,   () is the transformed audio signal that we exactly expect.

Detection and Extraction.
The detection and extraction of I.O.can be achieved by the difference between D and   .When the receiver receives I.O., it is essentially a copy Q of the fuzzy vault P containing the secret sequence.On the finite field F, by matching the Q (Q∈F) and P, the key can be parsed from the vault to obtain the I.O. of the receiver.After the I.O.are recovered, the integrity of the audio signal can be determined according to the completeness and accuracy of them furtherly.Figure 5 shows the reconstruction and detection process.
In Figure 5, (a) shows the reconstruction process of the host audio and   () is the reconstructed one which contains the I.O.(b) is the detection and extraction process of I.O.ẽV is the eigenvalues and ẼV is its DCT transformation.ẽ V is the extracted eigenvalues and Ẽ V is its DCT transformation.By matching the similarity between ẽ V and ẽV , the similarity between the received and original audio can be evaluated.According to the features of the extracted I.O., it can be found whether the audio is cut or tampered.More details can be found in the experimental section.

Integrity Evaluation.
After the I.O. is extracted, the similarity of ẽV and ẽ V can be calculated by their cosine distance d, and the integrity of the host audio  au can be evaluated based on this, as shown in (11).It can be seen that when  takes 1,  au is a decimal within (0, 1).The greater the value of  au , the higher the similarity, also, the better the audio integrity.
where d is the cosine distance of ẽV and ẽ V ,  is a weight coefficient, and the default value is 1. ẽ() is the i-th element of ẽV and ẽ () is the i-th element of ẽ V .

Audio Quality Criterion. The signal to noise ratio (SNR)
is the widely approved and used audio quality criterion for the evaluation of audio signal transformation, as shown in the following equation: where   () is the transformation of ().
From (11), we can see that the SNR is only related to (); that is to say, the audio quality after embedded eigenvalues is only related to the embedding position of its DCT coefficient i.This proves that the embedding position is crucial to the imperceptibility and robustness of the host audio in our RP-AIP model.Therefore, the embedding rule based on the relative phase of the audio signal itself is of extraordinary significance.

Embedding and Extraction of I.O.
In this experiment, it is assumed that the I.O. is a black-and-white image as shown in Figure 6, and the host audio is a set of audio clips of the left channel as shown in Figure 7 and Table 1.Firstly, we uniformly embedded the I.O. in the host audio and then recovered it in the opposite way.
According to (12), the SNR of different audio types is calculated as shown in Table 1, which is in line with the requirements of robustness and imperceptibility.

Integrity Protection Performance.
In the transmission process, the audio may suffer from network congestion, accidental packet loss, and other accidents caused by network  anomalies, even filtering, compression, and A/D conversion.Therefore, the first thing to consider is whether the model can effectively distinguish between such accidents and malicious attacks, such as cutting and tampering.As is stated above, the loss and damage of I.O.caused by the transmission anomalies are generally accidental and random, but deterministic and directional caused by malicious attacks.Thus, this issue can be judged from the recovered I.O.

Audio Processing.
In this experiment, we simulated five common audio processing methods: smoothing/low-pass filtering (with 4kHz cutoff frequency), band-pass filtering (with 200-2kHz cutoff frequency), MPEG-1 (with compression ratio of 10.5:1 and 12:1) compression, and A/D conversion.The evaluation of audio signal distortion is indicated by the similarity between the original and recovered I.O.as well.
Figure 8 shows the simulation results.Among them, (a) is the recovered I.O.after a smoothing/low-pass filtering, (b) is the one after a band-pass filtering, (c) is the one after a MPEG-1 (with compression ratio of 10.5:1) compression, (d)  2 and presented by Figure 9.
It can be seen from Figure 8 and Table 2 that the RP-AIP model can well adapt to the common audio processing; that is to say, the general audio processing method will not lead to the destruction and loss of I.O.It can be seen from Figure 9 that the RP-AIP model has similar robustness against the common audio processing compared with the 0-coefficient embedding model.In addition, as far as the smoothing/lowpass and band-pass filtering methods are concerned, the performance of band-pass filtering of RP-AIP is superior to other models significantly, thanks to the method of random embedding in the mid-band coefficient.In general, RP-AIP exhibits sufficient robustness for general audio processing methods.

Malicious Attack.
In this experiment, we simulated three kinds of malicious attacks as cutting, tampering, and resampling.Moreover, in the type of tampering, it can be divided into self-media tampering and non-self-media tampering: self-media tampering refers to using the media itself for shifting, swapping, and so forth, while non-selfmedia tampering refers to using other media for replacement, coverage, and so forth.Resampling can also be divided into upsampling and downsampling: upsampling refers to  interpolation expansion of the original audio, while downsampling refers to the extraction of the original audio.Cutting and tampering are undoubtedly malicious attacks on audio and attempt to change the speaker's intention.However, resampling attacks often only affect the length or quality of the audio, especially the upsampling, but cannot change the speaker's intention.
Figure 10 shows the simulation results.Among them, (f) is the recovered I.O.after a cutting attack, (g) is the one after a self-media tampering attack, (h) is the one after a non-selfmedia tampering attack, (i) is the one after an upsampling attack, and (j) is the one after a downsampling attack.The similarity between the original and recovered I.O.and the comparison with the other two common audio embedding methods as mentioned above can be found in Table 3 and presented by Figure 11.
As can be seen from Figure 10 and Table 3, in addition to the upsampling attack, the RP-AIP model can effectively  It can be seen from Figure 11 that the RP-AIP model has definite better performance than the other two models.This is because attacks such as cutting and tampering are generally oriented to the audio signal's time domain rather than the frequency domain, so the damage to the low and high frequency parts is random.The embedding method with only consideration of low or high frequency can only guarantee one aspect, while the random embedding method in RP-AIP model can synthesize these two frequency parts.This is also the innovation and highlight of our work.

Conclusion
Audio integrity protection for network interaction environment is not only an effective way to protect audio information, but also an important link in building trusted communication.Due to real-time constraints of interaction scenarios, this kind of audio protection needs to reconsider some new characteristics, such as streaming characteristic, imperceptibility, self-detectability, and distortion tolerance, which are not covered by the traditional programs.This poses new challenges to the existing audio protection solutions.
To this end, a method of audio integrity protection based on relative phase (RP-AIP) is proposed in this work.In the RP-AIP model, we established the concept of integrity object (I.O.) in order to propose and abstract the integrity of the audio signal and then transform it into a tangible representation form.In addition, we also fully considered the characteristics of the audio signal in the DFT and DCT transform domain and used the frequency characteristics of the audio itself to guide the random embedding of I.O., thereby achieving blind detection and extraction of it.The tangible expression of the audio integrity and the relative phase based random embedding scheme are the highlight of our work.
The simulation experiments show that the RP-AIP model can guarantee the SNR of the audio signal, and the embedding of I.O.will not cause signal distortion.In addition, it has good adaptability to some common audio processing, such as smoothing/low-pass filtering, band-pass filtering, MPEG-1 compression with different ratios, and A/D conversion.However, aiming at malicious attacks such as cutting, tampering, and resampling, the RP-AIP model shows obvious superiority to other similar protection solutions.Of course, the research about audio integrity protection is still in its infancy, and more work will be done to make it more complete.

Figure 1 :
Figure 1: The overall structure of the RP-AIP model.The eigenvalues generated by the key and fuzzy vault are embedded into the DCT coefficient matrix, controlled by the relative phase relation indicated from DFT phase spectrum.

Figure 2 : 7 Figure 3 :
Figure 2: The phase spectrum of a segment of the audio signal.The audio sampling frequency is 44100kHz and the length is 0.5 seconds.The length of the intercepted fragment is 0.003 seconds, containing 120 sampling points.

Algorithm 1 :Figure 4 :
Figure 4: The flow chart of the random embedding strategy.The choice of the embedding position between L and H is controlled by the relative phase relation of the host audio segment.

Figure 5 :
Figure 5: The process of the reconstruction and detection of the eigenvalues.(a) is the reconstruction process and (b) is the detection process.

Figure 6 :
Figure 6: The black-and-white image of 100×100 pixels as I.O. in these experiments.

Figure 7 :
Figure 7: The host audio clips in the experiment.The audio sampling frequency is 22050 kHz and the length is 1.0 seconds.(A 1 ): a segment of voice message; (A 2 ): a prompt tone of Windows 10; (A 3 ): a piece of music.

Figure 8 :
Figure 8: The simulation results of the recovered I.O. of different audio processing methods.

Figure 10 :Figure 11 :
Figure 10: The simulation results of the recovered I.O. of different malicious attacks.

Table 1 :
The SNR of different audio types after embedding I.O.

Table 2 :
The similarity  between the original and recovered I.O. in audio processing experiments.

Table 3 :
The similarity  between the original and recovered I.O. in malicious attack experiments.As stated earlier, the I.O. is the tangible representation form of the audio integrity; that is, changes in I.O.directly reflect changes in audio media.Therefore, we can evaluate the type and degree of the attack on the audio by observing I.O.In the case of (f), since cutting will directly lead to the loss of I.O., when the I.O. is reconstructed by 100×100 pixels, it will become an incomplete image.The missing part of I.O.indicates that the corresponding audio part has been cut off.In the same way, the tampered part appears as random noise or disordered as shown in (g) and (h), because the audio has suffered the corresponding tampering attack.However, for resampling attacks, the RP-AIP shows different results as shown in (i) and (j).This is because the upsampling tends to interpolate the original audio signal without causing loss of I.O., while downsampling is the extraction of the original audio signal, which directly leads to the loss and destruction of I.O.As stated in the Introduction, we are more concerned about whether the attack has tampered with the speaker's intention, but resampling is generally not.