Speaker Recognition Using Wavelet Packet Entropy, I-Vector, and Cosine Distance Scoring

.


Introduction
Speaker recognition refers to recognizing the unknown persons from their voices.With the use of speech as a biometric in access system, more and more ordinary persons have benefited from this technology [1].An example is the automatic speech-based access system.Compared with the conventional password-based system, this system is more suitable for old people whose eyes cannot see clearly and figures are clumsy.
With the development of phone-based service, the speech used for recognition is usually recorded by phone.However, the quality of phone speech is low for recognition because the sampling rate of the phone speech is only 8 KHz.Moreover, the ambient noise and channel noise cannot be completely removed.Therefore, it is necessary to find a speaker recognition model that is not sensitive to those factors such as noise and low-quality speech.
In a speaker recognition model, the speech is firstly transformed into one or many feature vectors that represent unique information for a particular speaker irrespective of the speech content [2].The most widely used feature vector is the short vector, because it is easy to compute and yield good performance [3].Usually, the short vector is extracted by Mel frequency cepstral coefficient (MFCC) method [4].This method can represent the speech spectrum in compacted form, but the extracted short vector represents only the static information of the speech.To represent the dynamic information, the Fused MFCC (FMFCC) method [5] is proposed.This method calculates not only the cepstral coefficients but also the delta derivatives, so the short vector extracted by this method can represent both the static and dynamic information.
Both of the two methods use discrete Fourier transform (DFT) to obtain the frequency spectrum.DFT decomposes the signal into a global frequency domain.If a part of frequency is destroyed by noise, the whole spectrum will be strongly interfered [6].In other words, the DFT-based extraction methods, such as MFCC and FMFCC, are insensitive to the noise.Wavelet packet transform (WPT) [7] is other type of tool used to obtain the frequency spectrum.Compared with the DFT, WPT decomposes the speech into many small frequency bands that are independent of each other.Because of those independent bands, the ill effect of noise cannot be transmitted over the whole spectrum.In other words, WPT has antinoise ability.Based on WPT, wavelet packet entropy (WPE) [8] method is proposed to extract the short vector.References [8][9][10][11] have shown that the short vector extracted by WPE is insensitive to noise.I-vector is another type of feature vector.It is a robust way to represent a speech using a single high-dimension vector and it is generated by the short vectors.I-vector considers both of the speaker-dependent and background information, so it usually leads to good accuracy.References [12][13][14] have used it to enhance the performance of speaker recognition model.Specially, [15] uses the i-vector to improve the discrimination of the low-quality speech.Usually, the i-vector is generated from the short vectors extracted by the MFCC or FMFCC methods, but we employ the WPE to extract those short vectors, because the WPE can resist the ill effect of noise.
Once the speeches are transformed into the feature vectors, a classifier is used to recognize the identity of speaker based on those feature vectors.Gaussian mixture model (GMM) is a conventional classifier.Because it is fast and simple, GMM has been widely used for speaker recognition [4,16].However, if the dimension of the feature vector is high, the curse of dimension will destroy this classifier.Unfortunately, i-vector is high-dimensional vector compared with the short vector.Cosine distance scoring (CDS) is another type of classifier used for the speaker recognition [17].This classifier uses a kernel function to deal with the problem of high-dimension vector, so it is suitable for the i-vector.In this paper, we employ the CDS for speaker classification.
The main work of this paper is to propose a new speaker recognition model by using the wavelet packet entropy (WPE), i-vector, and cosine distance scoring (CDS).WPE is used to extract the short vectors from speeches, because it is robust against the noise.I-vector is generated from those short vectors.It is used to characterize the speeches used for recognition to improve the discrimination of the low-quality speech.CDS is very suitable for high-dimension vector such as i-vector, because it uses a kernel function to deal with the curse of dimension.To improve the discrimination of the ivector, linear discriminant analysis (LDA) and the covariance normalization (WCNN) are added to the CDS.Our proposed model is evaluated by TIMIT database.The result of the experiments show that the proposed model can deal with the low-quality speech problem and resist the ill effect of noise.However, the time cost of the new model is high, because extracting WPE is time-consuming.This paper calculates the WPE in a parallel way to reduce the time cost.
The rest of this paper is organized as follows.In Section 2, we describe the conventional speaker recognition model.In Section 3, the speaker recognition model based on i-vector is described.We propose a new speaker recognition model in Section 4, and the performance of the proposed model is reported in Section 5. Finally, we give out a conclusion in Section 6.

The Conventional Speaker Recognition Model
Conventional speaker recognition model can be divided into two parts such as short vector extraction and speaker classification.The short vector extraction transforms the speech into the short vectors and the speaker classification uses a classifier to give out the recognition result based on the short vectors.

Short Vector Extraction.
Mel frequency cepstral coefficient (MFCC) method is the conventional short vector extraction algorithm.This method firstly decomposes the speech into 20-30 ms speech frames.For each frame, the cepstral coefficient can be calculated as follows [18]: (1) Take DFT of the frame to obtain the frequency spectrum.
(2) Map the power of the spectrum onto Mel scale using the Mel filter bank.
(3) Calculate the logarithm value of the power spectrum mapped on the Mel scale.
(4) Take DCT of logarithmic power spectrum to obtain the cepstral coefficient.
Usually, the lower 13-14 coefficients are used to form the short vector.Fused MFCC (FMFCC) method is the extension of MFCC.Compared with MFCC, it further calculates the delta derivatives to represent the dynamic information of speech.The derivatives are defined as follows [5]: where   is the th cepstral coefficient obtained by the MFCC method and  is the offset.  is the th delta coefficient and   is the th delta-delta coefficient.
where x is a short vector extracted from an unknown speech.(x;   , Σ  ) is the th Gaussian function in GMM, where   , Σ  are its mean vector and variance matrix, respectively.  is the combination weight of the Gaussian function and satisfies ∑  =1   = 1. is the mixture number of the GMM.All of the parameters, such as weights, mean vectors, and variance matrices, are estimated by the famous EM algorithm [19] using the speech samples of a known speaker.In other words, (x) represents the characteristic of the known speaker's voice, so we use (x) to recognize the author of the unknown speeches.Assume that an unknown speech is denoted by Y = {y 1 , y 2 , . . ., y  }, where y  represents the th short vector extracted from Y. Also, assume that the parameters of (x) are estimated using the speech samples of a known speaker .The result of recognition is defined as where  > 0 is the decision threshold and should be adjusted beforehand to obtain the best recognition performance.If  ≤ 0, then the GMM decides that the author of the unknown speech is not the known speaker ; if the  > 0, then the GMM decides that the unknown speech is spoken by the speaker .

The Speaker Recognition Model Using I-Vector
The speaker recognition model using i-vector can be decomposed into three parts such as short vector extraction, ivector extraction, and speaker classification.Figure 1 shows the structure of the model.There are three types of speeches used for this model.Background speeches contains thousands of speeches spoken by lots of people, the known speeches are the speech samples of known speakers, and the unknown speeches are spoken by the speaker to be recognized.In the short vector extraction, all of the speeches are transformed into the short vectors by a feature extraction method.In the i-vector extraction, the background short vectors are used to train the background model.The background model is usually represented by a GMM with 2048 mixtures, and all covariance matrices of the GMM are assumed the same for easy computation.Based on the background model, the known and unknown short vectors are used to extract the known and unknown i-vectors, respectively.Note that one i-vector refers to only one speech.In the speaker classification, a classifier is used to match the known i-vector with the unknown i-vector and give out the recognition result.

The Proposed Speaker Recognition Model
The accuracy of recognition system usually drops off rapidly because of the low-quality speech and noise.To deal with the problem, we propose a new speaker recognition model based on wavelet packet entropy (WPE), i-vector, and cosine distance scoring (CDS).In Section 4.1, we describe the WPE method and use it to extract the short vector.Section 4.2 describes how to extract the i-vector using the above short vectors.Finally, the details of CDS are described in Section 4.3.

Short Vector Extraction.
This paper uses WPE to extract the short vector.The WPE is based on the wavelet packet transform (WPT) [20], so the WPT is firstly described.WPT is a local signal processing approach that is used to obtain the frequency spectrum.It decomposes the speech into many local frequency bands at multiple levels and obtains the frequency spectrum based on the bands.For the discrete signal such as digital speech, WPT is usually implemented by the famous Mallat fast algorithm [21].In the algorithm, WPT is realized by a low-pass filter and a high-pass filter, which are generated by the mother wavelet and the corresponding scale function, respectively.Through the two filters, the speech is iteratively decomposed into a low-frequency and a highfrequency components.We can use a full binary tree to describe the process of WPT.The three structures are shown in Figure 2.
In Figure 2, root is the speech to be analyzed.Each nonroot node represents a component.The left child is the low-frequency component of its parent and the right child is the high-frequency component of its parent.The left branch and the right branch are the low-pass and high-pass filtering processes followed by 2 : 1 downsampling, respectively.The filtering processes are defined as where h and g are the low-pass and high-pass filter, respectively.  is the length of the frequency component at level .* is the convolution operation. is the total number of the decomposition levels.Because the WPT satisfies the conservation of energy, each leaf node denotes the spectrum of the frequency bands obtained by WPT.Based on the WPT, the wavelet packet entropy (WPE) method is proposed to extract the short vector and we add a normalization step into the method to reduce the ill effect of the volume in this paper.
The flow chart of WPE used in this paper is shown in Figure 3.
Assume that there is a digital speech signal that has finite energy and length.It is firstly decomposed into 20 ms frames, and then each frame is normalized.The normalization process is defined as where  is a signal frame and  is its length. is the mean value of the frame and  is its standard variance. is the normalized frame.After the normalization process, the WPT decomposes the frame at 4 levels using (4).Therefore, we finally obtain 16 frequency bands, and the frequency spectrums in those bands are denoted as  0 4 ,  1 4 , . . .,  15  4 , respectively.For each spectrum, the Shannon entropy is calculated.The Shannon entropy is denoted as with where   is the energy of the th spectrum. , is the energy distribution of the th spectrum. is the length of each frequency spectrum.Finally, all of Shannon entropies of all spectrums are calculated and are collected to form a feature vector that is denoted as [ 0 ,  1 , . . .,  7 ]  .

I-Vector
Extraction.I-vector is a robust feature vector that represents a speech using a single high-dimension vector.
Because it considers the background information, i-vector usually improves the accuracy of recognition [22].Assume that there is a set of speeches.Those speeches are supplied by different speakers and the all speeches are transformed into the short vectors.In the i-vector theory, the speaker-and channel-dependent feature vector is assumed as where m is the speaker-and channel-dependent feature vector.m is the background factor.Usually, it is generated by stacking the mean vectors of a background model.Assume that the mean vectors of the background model are denoted by   1 ,   2 , . . .,    , where each mean vector is a row vector.m is denoted by [ 1 ,  2 , . . .,   ]  .T is named the total variability matrix and represents a space that contains the speaker-and channel-dependent information.w(U) is a random vector having standard normal distribution (0, 1).The i-vector is the expectation of the w(U).U is a set of speeches and all of speeches are transformed into the short vectors.Assume that a background model is given, and Σ is initialized by covariance matrix of the background model.T and w(U) are initialized randomly.T and w(U) are estimated by an iteratively process described as follows: (1) E-step: for each speech in the set U, calculate the parameters of the posterior distribution of w(U) using the current estimates of T, Σ, and m.
(2) M-step: update T and Σ by a linear regression in which w(U)s play the role of explanatory variables.(3) Iterate until the expectation of the w(U) is stable.
The details of the estimation processes of T and w(U) are described in [23].

Speaker Classification. Cosine distance scoring (CDS)
is used as the classifier in our proposed model.It uses a kernel function to deal with the curse of dimension, so CDS is very suitable for the i-vector.To describe this classifier easily, we take a two-classification task, for example.Assume that there are two speakers denoted as  1 and  2 .The two speakers, respectively, speak  1 and  2 speeches.All speeches are represented by i-vectors and are denoted by , where x   is i-vector representing the th speech sample of the speaker s i .We also assume there is an unknown speech represented by i-vector y.The purpose of the classifier is to match the unknown i-vector with the known i-vectors and determine which one speaks the unknown speech.the result of the recognition is defined as where N i is the total number of speeches supported by the speaker s i . is the decision threshold.If D i (y) ≤ 0, the unknown speeches are not spoken by the known speaker s i ; if D i (y) > 0, then author of the unknown speeches is the speaker s i .(⋅, ⋅) is the cosine kernel and is defined as where x is the known i-vector and y is the unknown ivector.Usually, the linear discriminant analysis (LDA) and within class covariance normalization (WCCN) are used to implement the discrimination of the i-vector.Therefore, the kernel function is rewritten as where A is the LDA projection matrix and W is WCCN matrix.A and W are estimated by using all of the i-vectors and the details of LAD and WCCN are described in [24].

Experiment and Results
In this section, we report the outcome of our experiments.In Section 5.1, we describe the experimental dataset.In Section 5.2, we carry on an experiment to select the optimal mother wavelet for the WPE algorithm.In Section 5.3, we evaluate the recognition accuracy of our model.In Section 5.4, we evaluate the performance of the proposed model.Finally, the time cost of the model is count in Section 5.5.

Experimental Dataset.
The results of our experiments are performed on the TIMIT speech database [25].This database contains 630 speakers (192 females and 438 males) who come from 8 different English dialect regions.Each speaker supplies ten speech samples that are sampled at 16 KHz and last 5 seconds.All female speeches are used to obtain background models that represent the common characteristic of the female voice.Also, all male speeches are used to generate another background model characterizing the male voice.384 speakers (192 females and 192 males) are randomly selected and their speeches are used as the known and unknown speeches.The test results presented in our experiments are collected on a computer with 2.5 GHz Intel Core i5 CPU and 8 GM of memory and the experimental platform is MATLAB R2012b.

Optimal Mother Wavelet.
A good mother wavelet can improve the performance of the WPE algorithm.The performance of a mother wavelet is based on two important elements such as the support size and the number of vanishing moments.If a mother wavelet has large number of vanish moments, the WPE would ignore much of unimportant information; if the mother wavelet has small support size, the WPE would accurately locate important information [26].Therefore, an optimal mother wavelet should have a large number of vanishing moments and a small support size.In this view, the Daubechies and Symlet wavelets are good wavelets, because they have the largest number of vanishing moments for a given support size.Moreover, those wavelets are orthogonal and are suitable for the Mallat fast algorithm.
In is paper, we use the Energy-to-Shannon Entropy Ratio (ESER) to evaluate those Daubechies and Symlet wavelets to find out the best one.ESER is a way to analyze the performance of mother wavelet and has been employed to select the best mother wavelet in [27].The ESER is defined as where  is the Shannon entropy of the spectrum obtained by WPT and  was the energy of the spectrum.The high energy means the spectrum obtained by WPT contained much enough information of the speech.The low entropy means that the information in the spectrum is stable.Therefore, the optimal mother wavelet should maximize the energy and meanwhile minimize the entropy.In this experiment, 8 Daubechies and 8 Symlet wavelets, which are, respectively, denoted as db1-8 and sym1-8, are employed to decompose speeches that are randomly selected from the TIMIT database.We run the experiment 100 times and record the average WSER of those mother wavelets in Table 1.
In Table 1, We find that db4 and sym6 obtain the highest ESER.In other words, the db4 and sysm6 are the best mother wavelets for the speech data.Reference [28] suggests that the sym6 can improve the performance of the speaker recognition model.However, the Symlet wavelets produce the complex coefficients whose imaginary parts are redundant for the real signal such as digital speech, so we abandon the sym6 and choose the db4.

The Accuracy of Speaker Recognition Model in Clear
Environment.This experiment evaluates the accuracy of the speaker recognition model.We randomly select 384 speakers (192 females and 192 males).For each speaker, half of speeches are used as the unknown speech and the other half of speeches are used as the known speeches.For each speaker, the speaker recognition model matches the his/her unknown speeches with all of the known speeches of the 384 speakers and determines who speaks the unknown speeches.If the result is right, the model obtains one score; if the result is wrong, the model gets zero score.Finally, we count the score and calculate the mean accuracy that is defined as In this experiment, we use four types of speaker recognition models for comparison.The first one is the MFCC-GMM model [4].This model uses MFCC method to extract 14D short vectors and uses the GMM with 8 mixtures to recognize speaker based on those short vectors.The second one is FMFCC-GMM model [16].This model is very similar to the MFCC-GMM model, but it uses the FMFCC method to extract the 52D short vectors.The third one is the WPE-GMM model [10].This model firstly uses WPE to transform the speeches into 16D short vectors and then uses GMM for speaker classification.The last one was the WPE-I-CDS model proposed in this paper.Compared with WPE-GMM model, our model uses the 16D short vectors to generate 400D i-vector and uses CDS to recognize speaker based on the ivector.We carry on each experiment in this section 25 times to obtain the mean accuracy.The mean accuracy of the above 4 models is shown in Figure 4.
In Figure 4, we find that MFCC-GMM obtains the lowest accuracy of 88.46%.The result of [4] shows the MFCC-GMM model can obtain accuracy of higher than 90%.This is because we use the GMM with 8 mixtures as the classifier, but [4] uses the GMM with 32 mixtures as the classifier.Large mixture number can improve the performance of the GMM, but it also causes the very high computational expense.WPE-I-CDS obtain the highest accuracy of 94.36%.This interprets the achievements of i-vector theory.On the other hand, when the 8 KHz speeches (low-quality speeches) are used, all accuracy of speaker recognition models is decreased.The accuracy of MFCC-GMM, FMFCC-GMM, and WPE-GMM decrease by about 6%.Comparatively, the accuracy of WPE-I-CDS decreases by about 1%.This is because the i-vector considers the i-vector to improve the accuracy of the speaker recognition model, and the CDS used the LDA and WCCN to improve the discrimination of the i-vector.Reference [29] also reports that the combination of the i-vector and the CDS can enhance the performance of speaker recognition model used for low-quality speeches such as phone speeches.

The Accuracy of Speaker Recognition Model in Noisy
Environment.It is hard to find a clean speech in the real applications, because the noise in the transmission channel and environment cannot be controlled.In this experiment, we add 30 dB, 20 dB, and 10 dB Gaussian white noise into the speeches to simulate the noisy speeches.All noises are generated by the MATLAB's Gaussian white noise function.
For comparison, this experiment employed three i-vector based models such as MFCC-I-CDS [30], FMFCC-I-CD [31], and our WPE-I-CDS.The two models are very similar to our proposed model, but they use the MFCC and FMFCC to extract the short vectors, respectively.The accuracy of the 3 models in noisy environment is shown in Figure 5.
In Figure 5, the three models obtained high accuracy in clean environment.This also shows that the i-vector can improve the recognition accuracy effectively.However, when we use the noisy speeches to test the 3 models, their accuracies decrease.When 30 dB noise is added to the speeches, the accuracy of the three models decreases by about 4%.This shows that all of the models can resist weak noise.However, when we enhance the power of noise, the accuracy of MFCC-I-CDS and FMFCC-I-CDS drops off rapidly.In particular, when the noise increases into 10 dB, the accuracy of the above two models decreases by more than 30%.Comparatively, the WPE-I-CDS's accuracy decreases by less than 12%.Those show that the WPE-I-CDS is robust in noisy environment compared with MFCC-I-CDS and FMFCC-I-CDS.This is because the WPE uses the WPT to obtain the frequency spectrum but MFCC and FMFCC use the DFT to do that.The WPT decomposes the speech into many local frequency bands that can limit the ill effect of noise, but the DFT decomposes the speech into a global frequency domain that is sensitive to the noise.

The Performance of the Speaker Recognition Model.
Usually, the speaker recognition model is used in the access control system.Therefore, a good speaker recognition model should have ability to accept the login of the correct people and meanwhile to reject the access of the imposter, as a gatekeeper does.In this experiment, we use the receiver operating characteristic (ROC) curve to evaluate the ability of our model.The ROC curve shows the true positive rate (TPR) as a function of the false positive rate (FPR) for different values of the decision threshold and has been employed in [2].
In this experiment, we randomly select 384 speakers (192 males and 192 females) to calculate the ROC curve.Half of those speakers are used as the correct people and another half of the speakers are used as the imposters.We firstly use the speeches of the correct people to test the speaker recognition model to calculate the TPR, and then we use the speeches of the imposters to attack the speaker recognition model to calculate the FPR.The 4 models, such as MFCC-GMM, FMFCC-GMM, WPE-GMM, and our WPE-I-CDS, are used for comparison.To plot the ROC curve, we adjusted the decision thresholds to obtain different ROC points.The ROC curves of those 4 models were shown in Figure 6.
Low FPR shows that the speaker recognition model can effectively resist the attack coming from the imposters, and high TPR shows that the speaker recognition model can accurately accept the correct speakers' login.In other words, a speaker recognition model can be useful if its TPR is high for a low FPR.In Figure 6, when FPR is higher than 0.45, all models obtain the high TPR, but WPE-I-CDS obtain higher TPR than other 3 models for a given FPR that is less than 4.5.This shows that the WPE-I-CDS can more effectively achieve the access control task than other models.5.6.Time Cost.This section tests the time cost of the fast MFCC-GMM, the conventional MFCC-I-CDS, and our WPE-I-CDS.We used 200 5-second-long speeches to test each model and calculated the average time cost.The result of this experiment was shown in Table 2.
In Table 2, MFCC-GMM does not employ the i-vector for speech representation, so it does not cost time to extract the i-vector.Comparatively, the WPE-I-CDS should cost time to extract the i-vector.The WPE-I-CDS cost the most time to extract the short vector compared with the MFCC-GMM.This is because the WPT used by WPE is more complex than the DFT used by the MFCC.On the other hand, the parameters of GMM should be estimated beforehand, as MFCC-GMM cost time to train the classifier.CDS needs not cost time to estimate the parameters, but it should cost time to estimate the matrices of the LDA and WCNN in the training classifier step.In all, the i-vector can improve the recognition accuracy at cost of increasing the time consumption and calculating the WPE costs too much time compared with Parallel computation is an effective way to reduce the time cost, because the loops in the linear computation can be finished at once using a parallel algorithm.For example, a signal, whose length is , is decomposed by WPT at  levels.In the conventional linear algorithm of WPT, we have to run a filtering process whose time complexity was (log )  ×  times for each decomposition level, so the total time cost of WPT is ( log ).If we used  independent computational cores to implant the WPT using a parallel algorithm, the time complexity of WPT can reduce to ( log ).This paper uses 16 independent computational cores to implement the WPE parallel algorithm, and the last line of Table 2 shows that the time cost of WPE is reduced very much.

Conclusions
With the development of the computer technique, the speaker recognition has been widely used for speech-based access system.In the real environment, the quality of the speech may be low and noise in the transformation channel cannot be controlled.Therefore, it is necessary to find a speaker recognition model that is not sensitive to those factors such as noise and low-quality speech.
This paper proposes a new speaker recognition model by employing wavelet packet entropy (WPE), i-vector, and CDS, and we name the model WPE-I-CDS.WPE used a local analysis tool named WPT rather than the DFT to decompose the signal.Because WPT decomposes the signal into many independent frequency bands that limit the ill effect of noise, the WPE is robust in the noisy environment.I-vector is a type of robust feature vector.Because it considers the background information, i-vector can improve the accuracy of recognition.CDS uses a kernel function to deal with the curse of dimension, so it is suitable for the high-dimension feature vector such as i-vector.The result of the experiments in this paper shows that the proposed speaker recognition models can improve the performance of recognition compared with the conventional models such as MFCC-GMM, FMFCC-GMM, and WPE-GMM in clean environment.Moreover, the WPE-I-CDS obtains higher accuracy than other i-vectorbased models such as MFCC-I-CDS and FMFCC-I-CDS in noisy environment.However, the time cost of the proposed model is very higher.To reduce the time cost, we employ the parallel algorithm to implement the WPE and i-vector extraction methods.
In the future, we will combine audio and visual feature to improve the performance of the speaker recognition system.

Figure 1 :
Figure 1: The structure of the speaker recognition using i-vector.

Figure 4 :
Figure 4: The mean accuracy of 4 models in clean environment.

Figure 5 :
Figure 5: The accuracy of the 3 models in noisy environment.

Table 1 :
The average WSER of the mother wavelets.

Table 2 :
The time cost of the different speaker recognition models.Therefore, it is very important to find way to reduce the time cost of the WPE.