Pitch shifting is a common voice editing technique in which the original pitch of a digital voice is raised or lowered. It is likely to be abused by the malicious attacker to conceal his/her true identity. Existing forensic detection methods are no longer effective for weakly pitch-shifted voice. In this paper, we proposed a convolutional neural network (CNN) to detect not only strongly pitch-shifted voice but also weakly pitch-shifted voice of which the shifting factor is less than ±4 semitones. Specifically, linear frequency cepstral coefficients (LFCC) computed from power spectrums are considered and their dynamic coefficients are extracted as the discriminative features. And the CNN model is carefully designed with particular attention to the input feature map, the activation function and the network topology. We evaluated the algorithm on voices from two datasets with three pitch shifting software. Extensive results show that the algorithm achieves high detection rates for both binary and multiple classifications.
Voice disguising [
The simplest way of electronic disguising is to change the playback speed of the target voice. Although the speaker’s identity could be concealed, the rhythm of the disguised voice generated in this way is relatively unnatural and is not often adopted by the attackers in practice. Pitch shifting is a typical electronic disguising technique in which the pitch of the voice is changed while keeping the duration unchanged. Generally, the pitch-shifted voice is more natural in terms of timbre, tone, etc., and difficult to be detected. In this paper, we mainly focus on identification of pitch-shifted voices.
Clark [
Recently, some studies on the detection of weakly pitch-shifted voices have been reported. Based on [
Convolutional Neural Networks (CNN) [
Although many methods have been proposed for pitch shifting identification, there is still room to improve the performance especially when the suspected voices are weakly-shifted. In this paper, a CNN model for pitch shifting detection is proposed. By analyzing the principle of voice pitch shifting, LFCC and the first derivative coefficients are used as identification features. Comparing to other related works, the proposed CNN achieves remarkable performance in both binary and multiple classifications. The main contributions of our work are summarized as follows. High accuracy is achieved on identifying weakly pitch-shifted voice. Since the difference between the original voice and the weakly pitch-shifted voice is little, the identification is a challenging task in previous work. Utilizing CNN architecture to identify the pitch-shifting voice, which improve the performance compared to the previous work. And the proposed network architecture is carefully devised. Massive experiments are conducted on two dataset and three pitch shifting software, which indicates the proposed method achieved great robustness.
The remainder of the paper is organized as follows. In Section
Voice pitch shifting can be performed in either time-domain or frequency domain. Time-domain Pitch Synchronous Overlap Add (TD-PSOLA) is a commonly used approach which works by windowing [
In this paper, we use semitone to measure the pitch of shifted voice. A semitone is the smallest interval between two tones. It is defined as the interval between two adjacent notes in a 12-tone scale [
where
We randomly choose a voice sample from the TIMIT [
Waveform and spectrogram of original voice and pitch-shifted voice. (a) Waveform; (b) Spectrogram.
LFCC is a cepstral feature widely used in voice identification and achieves significantly performance [
The voice signal is first pre-processed with pre-emphasized and then windowed. Let
where
where
Finally, the DCT is applied to the Log-power of the
where
Since most of the pitch shifting techniques do not fully model temporal characteristics of voice [
Convolution neural networks have shown remarkable performance in various classification tasks. It generally consists of an input layer, multiple hidden layers and an output layer. The hidden layers are crucial to the network performance, which typically are combination of different kinds of layers such as convolutional layers, pooling layers and full connected layers [
The proposed network architecture is shown in Figure
Proposed CNN architecture.
In our network, each convolutional group includes two convolutional layers and a pooling layer. The convolution layer consists of a set of linear convolutional filters which can generate local feature maps. Two-dimensional convolutional layer preforms a convolution on the input feature map with a specific kernel size. Let
where
Pooling layers are adopted after convolutional layers which can obtain more global information by combining the feature information extracted from the convolution layer. Max pooling is commonly used in the pooling layer. It is a downsampling operation, which chooses the maximum value within a local window is taken as the output
where
After three convolutional groups, the fully connected layer acts as a “classification” map in the network, which can do the high-level reasoning and learn distributed feature representation. Neurons in fully connected (FC) layer are connected to all activation functions in the previous layer. However, overly complex networks will reduce the generalization of the model. Dropout is a simple and effective regularization technique to prevent over-fitting [
Softmax can be considered as an effective multiple-output competitive whose output represents the likelihood of classification. Therefore, the dimension of its output represents the number of classes. Let
where
In summary, the architecture and parameters of the proposed network are shown in Table
Architecture and parameters of the proposed network.
No. | Layer | Kernel size/neuron numbers | Strides | Input channels | Parameters |
---|---|---|---|---|---|
1 | Convolutional 1 | (5,5) | (1,1) | 1 | 1664 |
2 | Convolutional 2 | (5,5) | (1,1) | 64 | 102464 |
3 | Pooling 1 | (2,2) | (2,2) | 64 | — |
4 | Convolutional 3 | (5,5) | (1,1) | 64 | 102464 |
5 | Convolutional 4 | (5,5) | (1,1) | 64 | 102464 |
6 | Pooling 2 | (2,2) | (2,2) | 64 | — |
7 | Convolutional 5 | (5,5) | (1,1) | 64 | 102464 |
8 | Convolutional 6 | (5,5) | (1,1) | 64 | 102464 |
9 | Pooling 3 | (2,2) | (2,2) | 64 | — |
10 | Flatten | 2496 | — | — | — |
11 | Fully connected | 4096 | — | — | 1.02 ∗ 107 |
12 | Softmax |
|
— | — | 4096 ∗ |
1
The proposed identification algorithm is based on the first derivative of LFCC and CNN classifier. With a group of equaling distributed triangular filters, LFCC can capture more characteristics both in low frequency and high frequency comparing with other acoustics features such as MFCC. Thus, the difference between the original voice and the pitch-shifting voice are easier to be distinguished. CNN is considered to have better performance in classification task for multi-layers process with less time and subsampling layers give better feature extraction. The proposed algorithm consists of training and testing stages, as shown in Figure
Diagram of the proposed pitch-shifting identification algorithm.
In the training stage, the voice pitch-shifted different factors and the original voice are considered as separate classes. After extracting the first derivative of LFCC based on Equation (
In the testing stage, the first derivative of LFCC are first extracted and then fed into the trained CNN model. The probability given by softmax in Equation (
In the experiments, the proposed algorithm is evaluated on TIMIT [
For each voice sample, 20-dimensional LFCC feature map is extracted by setting the length of frame
The detection rate is used to evaluate the performance of the proposed network. Let
In this paper, TanH is utilized as activation function in the proposed network. We use Adam algorithm [
The training process of proposed network.
We randomly choose 100 voice samples from each sub-dataset of TIMIT which shifted with shifting factors from
Visualization of different feature maps by
In this case, as a comparison to [
Detection performance of strongly pitch-shifted voice in binary classification.
Pitch shifting software | Training dataset | Testing dataset | Detecting method | |||||
---|---|---|---|---|---|---|---|---|
[ |
[ |
Proposed | ||||||
Rate | FAR | Rate | FAR | Rate | FAR | |||
Audition | TIMIT | TIMIT | 99.86 | 0.02 |
|
|
99.54 | 0.10 |
TIMIT | UME | 97.60 |
|
|
1.19 | 95.89 | 1.52 | |
UME | TIMIT |
|
0.36 | 98.58 |
|
97.51 | 1.45 | |
UME | UME |
|
0.15 |
|
|
99.49 |
|
|
|
||||||||
GoldWave | TIMIT | TIMIT |
|
|
99.94 | 0.01 | 99.58 | 0.05 |
TIMIT | UME |
|
|
96.82 | 2.04 | 96.29 | 1.53 | |
UME | TIMIT |
|
0.05 | 98.45 |
|
98.44 | 1.17 | |
UME | UME |
|
|
99.70 | 0.07 | 99.12 | 0.36 | |
|
||||||||
Audacity | TIMIT | TIMIT |
|
|
99.97 |
|
99.97 |
|
TIMIT | UME | 99.13 | 0.44 | 97.57 | 2.10 |
|
|
|
UME | TIMIT |
|
0.01 | 98.72 |
|
99.96 | 0.01 | |
UME | UME |
|
|
99.95 |
|
99.84 | 0.11 |
Bold values represent the best performance under same circumstances (in same row) of the three methods. For criteria detection rate (Rate), higher is better. For criteria false alarm rate (FAR), lower is better.
It can be seen that, all the detection methods achieve a detection rate higher than 95% and FAR lower than 2%. The method in [
Compared with binary classification, multiple classification is more practical for real forensic application. In this case, we not only recognize whether the suspected voice is pitch-shifted, but also determine the specific shifting factor. The results are presented in Figure
Detection rates of strongly pitch-shifted voice. (a-
In this case, we focus on weakly pitch-shifted samples shifted from
Detection performance of weakly pitch-shifted voice in binary classification.
Pitch shifting software | Training dataset | Testing dataset | Detecting method | |||||
---|---|---|---|---|---|---|---|---|
[ |
[ |
Proposed | ||||||
Rate | FAR | Rate | FAR | Rate | FAR | |||
Audition | TIMIT | TIMIT | 98.11 | 0.83 | 97.29 | 1.34 |
|
|
TIMIT | UME | 92.95 | 5.50 | 93.25 |
|
|
1.84 | |
UME | TIMIT | 96.72 |
|
95.21 | 1.72 |
|
0.52 | |
UME | UME | 97.70 | 0.88 |
|
|
96.82 | 0.91 | |
|
||||||||
GoldWave | TIMIT | TIMIT | 97.92 | 0.68 |
|
|
98.14 | 1.47 |
TIMIT | UME | 82.86 | 14.60 | 91.56 |
|
|
5.95 | |
UME | TIMIT | 92.58 |
|
93.93 | 0.25 |
|
1.25 | |
UME | UME | 98.39 |
|
|
0.14 | 97.79 | 0.92 | |
|
||||||||
Audacity | TIMIT | TIMIT | 98.27 | 0.32 |
|
|
99.10 | 0.29 |
TIMIT | UME | 83.04 | 15.44 | 87.96 | 10.07 |
|
|
|
UME | TIMIT | 91.89 | 0.06 | 91.84 |
|
|
0.33 | |
UME | UME | 98.89 |
|
|
|
98.39 | 0.87 |
Like the previous section, multiple classification is adopted after the binary evaluation. The result show in Figure
Detection rates of weakly pitch-shifted voice. (a-
Generally, in Figure
Hence, both binary and multiple classifications show that the proposed algorithm achieves good performance and has strong robustness in detecting weakly pitch-shifted voice.
In this paper, an algorithm for pitch-shifted voice identification is proposed. A convolutional neural network architecture is designed and adopted as the classifier to detect the pitch-shifted voice while linear frequency cepstral coefficients are extracted as acoustic features. The algorithm is evaluated on two datasets and three audio editing software. Extensive results indicate that the proposed algorithm achieves much better detection rates and FARs in most cases, and the proposed network shows better generalization ability comparing to traditional classifier such as GMM. Next, network architecture which can replace handcrafted acoustic features is also one of the directions worth studying.
The open source databases used in this work have been listed in the reference.
The authors declare that they have no conflicts of interest regarding the publication of this paper.
This research was funded by the National Natural Science Foundation of China, grant numbers [61300055, 61672302]; Natural Science Foundation of Zhejiang, grant number [LY17F020010, LY20F020010]; Natural Science Foundation of Ningbo, grant number [2017A610123] and Zhejiang College Students Science and Technology Innovation Training Program, grant number [2018R405033].