Detecting Steganography of Adaptive Multirate Speech with Unknown Embedding Rate

Steganalysis of adaptive multirate (AMR) speech is a significant research topic for preventing cybercrimes based on steganography in mobile speech services. Differing from the state-of-the-art works, this paper focuses on steganalysis of AMR speech with unknown embedding rate, where we present three schemes based on support-vector-machine to address the concern. The first two schemes evolve from the existing image steganalysis schemes, which adopt different global classifiers. One is trained on a comprehensive speech sample set including original samples and steganographic samples with various embedding rates, while the other is trained on a particular speech sample set containing original samples and steganographic samples with uniform distributions of embedded information. Further, we present a hybrid steganalysis scheme, which employs Dempster–Shafer theory (DST) to fuse all the evidence from multiple specific classifiers and provide a synthesized detection result. All the steganalysis schemes are evaluated using the well-selected feature set based on statistical characteristics of pulse pairs and compared with the optimal steganalysis that adopts specialized classifiers for corresponding embedding rates. The experimental results demonstrate that all the three steganalysis schemes are feasible and effective for detecting the existing steganographic methods with unknown embedding rates in AMR speech streams, while the DST-based scheme outperforms the others overall.

In today's mobile world, adaptive multirate (AMR) codec has become a well-known and important compression standard for speech coding and been widely employed in not only 3G and 4G speech services [26][27][28] but also various mobile instant messaging apps (such as WhatsApp, Snapchat, LINE, and WeChat).Moreover, it is also a popular file format for storing AMR-encoded spoken audio supported by almost all mobile communication devices.Due to its increasing popularity and broad influence in mobile communications, AMR speech is spontaneously considered as an ideal carrier by the steganographic research community, and some relevant studies have been successfully performed [29][30][31][32][33].
AMR is a typical codec based on an algebraic code-excited linear prediction algorithm, in which algebraic codebook indices (ACIs), also called fixed codebook indices (FCIs), occupy a large percentage of each speech frame [26][27][28].Taking the AMR speech codec at 12.2 kbps mode [28], for example, 140 bits out of 244 frame bits is allocated to FCIs, suggesting that FCIs account for a large proportion (57.38%) of all frame bits [33].Therefore, they are popularly regarded as nice candidates for steganographic carriers in the existing studies [29][30][31][32][33]. Geiser and Vary [29] first incorporated information hiding into speech coding of the AMR codec by modifying the fixed-codebook-search algorithm.Specifically, two secret bits can be hidden into a track pulse through limiting the searching range of the second FCI to two of eight candidate values.Their experimental results demonstrate that this method can offer a steganographic bandwidth of 2 kbit/s for the AMR speech codec at 12.2 kbps mode, while guaranteeing an imperceptible impact on speech quality and fairly small computational complexity.Moreover, following the similar idea, Miao et al. [30] proposed an adaptive suboptimal pulse combination constrained method for steganography in the AMR speech stream.Their main advantage over the previous method is enabling regulation of the steganographic capacity by introducing an embedding factor .For example, for the AMR speech codec at 12.2 kbps mode,  can be typically set as 1, 2, or 4, so the steganographic bandwidths are correspondingly 1, 2, or 3 kbit/s [32,33].It has been demonstrated that, by choosing a befitting , this method can achieve a nice trade-off between the distortion of speech quality and the embedding capacity [30].
To prevent potential cybercrimes based on the above steganographic methods, some steganalysis studies have accordingly been conducted.Miao et al. [31] first presented two steganalysis methods for AMR speech.One is called Markov-based method that adopts Markov transition probabilities to evaluate the relationship between pulse positions in each track, while the other is Entropy-based method that employs the joint entropy and the conditional entropy to measure the uncertainty of pulse positions [31].However, the above two kinds of statistical features are not accurate enough for characterizing AMR speech, because they ignore the fact that the pulse positions may often be interchanged in the AMR encoding process [33].Moreover, Ren et al. [32] presented a steganalysis method called Fast-SPP, which employs probabilities of same pulse positions (SPP) as the features to detect the existing steganographic methods [29,30].However, the SPP features only reflect the distributions of two track-pulses being in the same position, which are not comprehensive enough to characterize AMR speech [33].Particularly, if a steganographic method designedly abandons the track-pulses with the same positions and the ones that would be the same after the embedding operation, Fast-SPP could not detect any abnormalities [33].Therefore, in our previous work [33], we presented more accurate and more complete features for steganalysis of AMR speech.To avoid the impact induced by possible interchange of pulse positions in each track, we employ the statistical features of pulse pairs to characterize AMR speech, including the probability distributions of pulse pairs reflecting the long-term distribution of speech signals, Markov transition probabilities of pulse pairs depicting the short-term invariant characteristic of speech signals, and joint probability matrices of pulse pairs characterizing the track-to-track correlation [33].Moreover, to optimize the feature set as well as cut down the dimension, a feature selection mechanism using adaptive boosting (AdaBoost) [34][35][36][37][38] is designed.Employing the selected optimal feature set, a support-vector-machine (SVM) based steganalysis of AMR speech was presented.The experimental results show that the proposed method significantly outperforms the previous ones.
However, all the above steganalysis methods assume that the embedding rate (also called the usage rate of the cover, which is the ratio between the practical embedded bits and the total number of cover bits) of steganographic samples in a given test set is exactly known.In other words, they generally train specific classifiers for steganographic samples with predefined embedding rates, and each specialized classifier is expected to detect the steganographic samples with the corresponding embedding rate.Unfortunately, in practice, we usually cannot ascertain whether the steganographic operation has been performed on a given sample, let alone knowing the concrete embedding rate.Thus, it is necessary and significant to develop detection technique for steganography with unknown embedding rate [39][40][41].To the best of our knowledge, this work in this paper is the first one dedicated to address the concern in the speech steganalysis field.In the image steganalysis field, however, some pioneer researchers have presented two useful schemes for detecting image steganography with unknown embedding rate.Both the two schemes adopt global classifiers based on a machine-learning algorithm (e.g., SVM) as the detectors, but the components of their training set are different.Specifically, the training set of the first scheme includes original (untouched) samples and steganographic samples with various embedding rates [40,41], while that of the other one consists of original samples and steganographic samples with uniform distributions of embedded data [40].In this work, we would like to attempt to first extend the two existing schemes to AMR speech steganalysis with unknown embedding rate employing the state-of-the-art steganalysis features presented in our recent work [33].Besides, incorporating with Dempster-Shafer theory (DST) [42,43], we further present a hybrid steganalysis scheme for AMR speech based steganography with unknown embedding rate.DST, also called evidence theory, is a well-established framework for uncertain reasoning, which can fuse available evidence from different sources and achieve a level of belief (confidence; trust) by considering all of them [42][43][44][45][46].The main idea behind the presented steganalysis scheme is employing an algorithm based on DST to combine all the evidence from a set of classifiers intended for detecting steganographic approaches with specific embedding rates and accordingly providing a synthesized judgement for having or not having hidden information.All the three steganalysis schemes are evaluated with a great number of AMR-encoded speech samples and compared with the optimal steganalysis that uses every specialized classifier to detect the steganography with the corresponding embedding rate.The experimental results show that all these steganalysis schemes are feasible and efficient for detecting the state-of-the-art steganographic methods with unknown embedding rates in AMR speech streams, while the DST-based scheme can achieve better detection performance than the other ones.
The remaining of this paper is organized as follows.To make this paper self-contained, Section 2 first reviews the state-of-the-art steganalysis features based on statistical characteristics of pulse pairs.Section 3 presents the three steganalysis schemes for detecting AMR speech based steganography with unknown embedding rate.Section 4 evaluates the performance of the three steganalysis schemes by a set of comprehensive experiments, which is followed by concluding remarks given in Section 5.

Steganalysis Features Based on Statistical Characteristics of Pulse Pairs
In this work, all the presented steganalysis schemes would adopt the state-of-the-art detection features based on statistical characteristics of pulse pairs for AMR speech, which consists of long-term features, short-term features, and track-to-track features [33].
The probability distributions of the pulse pairs are employed to depict the long-term features of AMR speech.Assume that the given AMR speech sample to be detected has  subframes and each subframe contains  tracks.For the th (0 ≤  ≤  − 1) track in the th (0 ≤  ≤  − 1) subframe, two pulse positions as a pulse pair ( , ,  ,+ ) can be extracted.For a pulse pair (, ), its probability (denoted by  (,) ) appearing in all subframes can be determined as follows: where "&" is the binary AND operation, "‖" is the binary OR operation, and ( = ) is a characteristic function defined as follows: Let the number of candidate positions for every pulse in each track be ; the number of the possible pulse pairs (denoted by ) is Therefore, there are  ×  pulse pairs in each subframe.That is to say, the dimension of the long-term feature set (LTFS) for pulse pairs is  × .
According to the short-term invariance of speech signals [47], the pulse pair of a track in the current subframe is bound to have a strong correlation with the one of the same track in the prior subframe [33].In this sense, for the th (0 ≤  ≤  − 1) pulse pairs (i.e., the pulse pairs of the th tracks) in all subframes, the sequence of pulse-position pairs   = { ,0 ,  ,1 , . . .,  ,−1 } can be considered as a Markov chain.Accordingly, the Markov transition matrix (MTM) can be employed to describe the transitive correlation of pulse-pair states in the given track.Moreover, as a first-order Markov chain,   satisfies In the th tracks of all subframes, the probability (( 1 ,  1 ) | ( 2 ,  2 )) that the pulse pair ( 1 ,  1 ) occurs after the pulse pair Further, the MTM for the th track (denoted by M  ) can be determined as follows: where  is the number of all possible pulse-position pairs for the th track that can be determined as (3); V , = ( , ,  , ) is the th (0 ≤  ≤  − 1) possible pulse-position pair for the th track, where  , and  , are the potential pulse positions for the th track.Moreover, assume that there are  candidate positions for each pulse; , , and  satisfy the following relation: Since there are  possible pulse-position pairs in each track, the size of each MTM is  × .Taking the MTMs of all  tracks into account, the dimension of the feature set would be very large.However, the characteristics of all the MTMs are similar.Therefore, we often adopt the average Markov transition probabilities (MTPs) as the steganalysis features instead.Apparently, the average MTM (denoted by M) is determined as Accordingly, the dimension of the short-term feature set (STFS) for pulse pairs is  × .
Furthermore, the joint probability matrices of the pulse pairs in different tracks are employed to characterize the track-to-track features.To be specific, for the pulse pair of the th track and the one of the th track (0 ≤ ,  ≤  − 1), the joint probability matrix (JPM) J , is where  is the number of all possible pulse-position pairs for the th track that can be determined by (3); V , (V , ) is the th (0 ≤  ≤  − 1) possible pulse-position pair for the th (th) track; and (V , , V ,ℎ ) is the joint probability of V , and V ,ℎ (0 ≤ , ℎ ≤  − 1).Specifically, the joint probability of the pulse-position pair (  ,   ) in the th track and the pulseposition pair (  ,   ) in the th track can be determined as follows: ((  ,   ) , (  ,   )) where  is the number of the subframes,  , ( , ) is the pulse pair in the th (th) track of the th subframe (0 ≤  ≤ −1), ( = ) is a characteristic function defined as (2), and "&" is the binary AND operation.
Like STFS above, we adopt the average JPM as the trackto-track feature set (TTFS) instead of all JPMs to reduce the computational complexity.Specifically, the average JPM (denoted by J) is Apparently, the dimension of the TTFS is  × .Accordingly, the total dimension of all the three feature sets is ×+ 2×.Taking the AMR speech codec at 12.2 kbps mode as an example, there are five tracks in each subframe (i.e.,  = 5), where two pulses share eight candidate positions, that is,  = 8.Thus, there are  = 36 pulse pairs in each track, and the total dimension of all feature sets is 2772.These features are still too large to be directly adopted in the machine-learning based steganalysis scheme, since very-high-dimensional features would not only cause huge computational costs in the detection phase but also be more likely to induce overfitting in the training phase [33].Thus, a feature selection mechanism based on AdaBoost [34][35][36][37][38] is employed to optimize the feature set as well as reduce the dimension.In the previous work [33], by this mechanism a reduced feature set with the 498 most effective features is obtained for the AMR speech codec at 12.2 kbps mode, of which the composition is shown in Table 1.Given that the excellent effectiveness of the selected feature set for steganalysis of AMR speech has been verified, we directly employ it in this paper.

Steganalysis Schemes for Detecting AMR Speech Steganography with Unknown Embedding Rate
In this section, we present three steganalysis schemes for detecting AMR speech based steganography with unknown embedding rate employing SVM, which is a well-known machine-learning tool with excellent performance on classification [48][49][50][51][52][53] and popularly employed in the steganalysis field [17-20, 24, 25, 33].The first two schemes are extended from the existing image steganalysis schemes [40][41][42], which both employ global classifiers to detect the steganography but adopt different training sets.As depicted in Figures 1 and 2, the first scheme trains the global classifier using a comprehensive speech sample set, including original samples and steganographic samples with various embedding rates, while the second one adopts a particular speech sample set, consisting of original samples and steganographic samples with uniform distributions of embedded data, to train the global classifier.For ease of description, we denote the first scheme as GC-M, meaning that it trains the global classifier on mixed samples with various embedding rates, and the second scheme as GC-U, meaning that it trains the global classifier on particular samples with uniform distributions of embedded data.In this work, for each AMR speech based steganographic method, the training set of GC-M involves the steganographic samples with the embedding rates from 10% to 100%.Moreover, to obtain the steganographic AMR speech samples with uniform distributions of embedded data for GC-U, we choose the tracks for hiding information in each subframe in a uniform random manner during the steganographic processes.
In addition, we further present a steganalysis scheme based on Dempster-Shafer theory (DST) for AMR speech based steganography with unknown embedding rate, as shown in Figure 3.To make the paper self-contained, we first review DST briefly.DST is a well-established mathematical theory of evidence first presented by Dempster [42] and Shafer [43], which can combine the evidence from different  sources to obtain the probability of a certain event [43].
Owing to its powerful reasoning function based on evidence combination, DST has been popularly employed in many fields, such as information fusion [44], classification [45], and intrusion detection [46].Generally, DST is constructed on a finite set of  possible elements (denoted by Θ = { 1 ,  2 , . . .,   }) under consideration, called a frame of discernment.Note that Θ is exhaustive, and all elements in Θ are mutually exclusive.Let 2 Θ be the set including all possible subsets of Θ.A mass function for assigning a probability mass to each element, also called basic probability assignment, is defined as follows: where 0 is the empty set.Each nonempty subset  of Θ is called a focal element, and its mass function () represents the exact belief for the proposition described by .Further, the belief function for a subset  of Θ, denoted by Bel(), is the sum of the mass values of all its subsets; namely, The plausibility function for a subset  of Θ, denoted by Pl(), is the sum of the mass values of all the subsets of Θ that intersect ; namely, Moreover, DST provides a combination rule to obtain a synthesized belief value for an element by fusing the evidence from different sources.Formally, assume that  1 ,  2 , . . .,   are mass functions for a subset  of Θ from different evidence, the combination rule can be stated as follows: where  is a conflict factor that measures the degree of conflict for all the evidence and can be determined as follows: Note that if  = 1, all the available evidence is highly contradictory and thereby cannot be directly combined.
In our work, the frame of discernment Θ for detecting AMR speech based steganography with unknown embedding rate is defined as Θ = {, }, where  and  represent the cover (original) and steganographic samples, respectively, and accordingly, 2 Θ = {0, {}, {}, {, }}.As shown in Figure 3, we adopt the specific SVM-based classifiers for the embedding rates from 10% to 100% as ten independent evidence sources.That is to say, there are ten mass functions from the specific SVM-based classifiers for various embedding rates.Specifically, the th mass function from the classifier for the embedding rate of 10% ×  is defined as follows: where   (  ) (  (  )) is the confidence probability for the test sample belonging to the cover (steganographic) classification, offered by the SVM-based classifier for the embedding rate of 10% × .

Performance Evaluation and Analysis
In this paper, all the SVM-based classifiers are implemented employing LibSVM [49], a popular open-source software library for SVM.Specifically, the classifiers are constructed on the linear SVM (C-style) with RBF kernel, in which the default parameters are employed, that is,  = 1 and  = 1/1064.Moreover, we collect a total of 3366 ten-second speech samples from audio materials for language learning, of which the components are shown in Table 2. Without loss of generality, we typically choose the AMR codec at 12.2 kbps mode as the cover codec.In the experiments, all steganalysis schemes are evaluated on through detecting the state-of-the-art steganographic methods, namely, Geiser's method [29] and Miao's methods at the modes of  = 1, 2, and 4 [30].Prior to the steganographic experiments, we randomly select a half (1683) of the total speech samples as the cover sample set for training (CSST) and take the remaining samples as the cover sample set for detection (CSSD).In the steganographic experiments, the embedded messages are all randomly produced.For the three steganalysis schemes, we define their training sets as follows: (i) The training set of the first scheme (GC-M): for each steganographic method, the training set includes 1400 speech samples randomly selected from CSST and 1400 mixed steganographic speech samples at the embedding rates from 10% to 100%, where there are 140 speech samples at each embedding rate.
(ii) The training set of the second scheme (GC-U): for each steganographic method, the training set includes 1400 speech samples randomly selected from CSST and 1400 steganographic speech samples with uniform distributions of embedded messages.(iii) The training sets of the third scheme (DST-based scheme): for each steganographic method, it is necessary to train the specific classifiers for different embedding rates.Accordingly, for each embedding rate, a training set needs to be created, which includes 1400 speech samples randomly selected from CSST and 1400 samples generated by performing the given steganographic method at the corresponding embedding rate.In addition, to evaluate the steganalysis performance at the various embedding rates from 10% to 100%, we create ten detection sample sets for each steganographic method.Specifically, for each embedding rate, the detection sample set consists of 1400 speech samples randomly chosen from CSSD and 1400 speech samples generated by performing the given steganographic method at the corresponding embedding rate.Further, we evaluate the performance of the three steganalysis schemes by comparing them with the steganalysis based on specific classifiers (SCs) [33].In all steganalysis experiments, we make the statistical analyses on accuracy (ACC, the proportion of true detection results), false positive rate (FPR, the proportion of false positives out of all negatives), and false negative rate (FNR, the proportion of false negatives out of all positives).Figures 4, 5, 6, and 7, respectively, show the experimental results of detecting all the four steganographic methods for the ten-second speech samples at the embedding rates from 10% to 100%, from which we can learn that all the three steganalysis schemes in this paper are feasible and effective, while there are some differences in their detection performance.To be specific, the DST-based scheme outperforms GC-U and GC-M on the whole as also shown in Tables 3-6, since the detection accuracies of the DST-based scheme are better than the others in most cases and closer to those of the scheme based on SCs overall.Moreover, the FPRs of the DST-based scheme are smaller than the others in any case.By the way, for a given steganographic method, the FNRs of each steganalysis scheme presented in this paper are almost the same at any embedding rate, since each scheme adopts the     identical classifier to detect the cover samples.In the cases of the embedding rates smaller than 40%, some detection accuracies of the DST-based scheme are very slightly lower than GC-U or GC-M.The main reason behind this phenomenon is that the detection accuracies of the specific classifiers are relatively low and thereby more likely make the evidence from them highly contradictory.Overall, since the embedding capacities of ten-second speech samples under the embedding rates lower than 40% are very small, the detection performance of all the steganalysis schemes is not so good (particularly, the accuracies are lower than 80% for Geiser's method and Miao's methods at the modes of  = 1 and 2).In this sense, how to further improve the steganalysis performance for relatively low embedding rates is still a question worthy of study.
In addition, to comprehensively evaluate the performance of the presented schemes for detecting steganographic methods at variable embedding rates, we prepare a mixed detection sample set for each steganographic method, which consists of 1400 speech samples randomly chosen from CSSD and 140 steganographic samples generated by performing the given steganographic method at each embedding rate from 10% to 100%.Figure 8 shows the statistical results of the steganalysis experiments.From these charts, we can learn that all the presented three schemes can achieve relatively good accuracies for detecting the existing steganographic methods.

Figure 8 :
Figure 8: Experimental results for detecting steganographic methods at variable embedding rates.
The first steganalysis scheme (GC-M) that trains the global classifier on mixed samples with various embedding rates.

Table 2 :
The components of adopted ten-second speech samples.

Table 3 :
Statistical results of accuracies for detecting Geiser's method.
(c) Statistical results of FNR