A Multiscale Chaotic Feature Extraction Method for Speaker Recognition

,


Introduction
Speaker recognition is a biometric recognition technique, which can identify speaker identity according to speaker personality information on a speech signal. From the existing biometric recognition, speaker recognition is one of the most convenient and accessible ones due to the abundance of mobile devices, with a microphone, allowing users to be authenticated across multiple environments and devices [1].
Research in speaker recognition has focused increasingly on enhancing robustness in adverse conditions induced by background noise. Many approaches have been proposed to address these challenges; one of the most successful being the i-vector technology [2] used jointly with the probabilistic discriminant analysis (PLDA) back-end [3,4]. In addition to the new utterance level features and back-ends, robust acoustic features are developed to improve the performance of the speaker recognition system. e cepstrum features (such as MFCC) of speech are the most distinguishing and first used [5] in speaker recognition. However, under the influence of channel distortion and background noise, the cepstral feature distribution of speech will change arbitrarily, which leads to its weak distinguish ability. erefore, in the early 1990s, a series of feature compensation techniques were proposed to enhance the generalization ability of speech features in recognition [6][7][8].
e existing feature compensation is mainly including filter compensation, noise model compensation, and empirical compensation.
e main purpose of filter compensation is to reduce noise or relieve the influence of noise on features. is method is based on the fact that channel and environmental distortions are superimposed on the logarithmic spectrum and the cepstrum domain. Furui S. believe that the variation in the channel is the offset of a single coefficient in the cepstrum vector. erefore, the cepstrum mean subtraction (CMS) method is used to relieve the influence of the channel [9].
is method also can reduce the channel noise to a certain extent, but this method also impairs the information on the cepstral coefficient. Unlike the CMS method, the relative spectrum feature is proposed to compensate for rapidly changing channel distortions, and it uses moving average filtering to simulate the exponential decay of the mean subtraction [10]. However, this method was later confirmed to have limited improvements in channel mismatch and additive background noise. e noise model compensation uses the prior knowledge of the noise spectrum to estimate the parameters of the pure speech through the noise model or the influence of noise on the speech. It mainly uses spectral equalization and spectral subtraction to relieve the influence of noise on features. J. Hansen et al. proposed a multidimensional equalization method to reduce the sensitivity of speech features of noise, thereby improving the distinguishing ability of speech features [11]. S.S. Bharti utilized interframe features to estimate the continuous noise spectrum, which can alleviate the problem of noise spectrum changes caused by singleframe estimation in the original spectrum subtraction, and then uses spectrum subtraction to enhance speech features and improve the robustness of speaker features [12]. e noise compensation model method mainly relies on the mathematical model of noise estimation. Because of the uncertainty of noise change, it is difficult to find a mathematical model with good performance.
Empirical compensation is a data-driven method, which is inherently random. Studies have shown that this method is better than the previous two [13]. is method directly uses spectrum comparison based on experience. In the training phase, to estimate the change between the clear speech and the noisy speech, the difference in the feature vector between the two frames is calculated, and the probability distribution is modeled by adding a bias term of this difference. In the evaluation stage, the minimum mean square error prediction method is adopted, and the bias vector is used to convert the noisy test feature vector into the equivalent clear speech feature vector. Afify M et al. proposed a random mapping method [14], which uses the joint distribution of clear speech and noisy speech feature vectors to generate a Gaussian mixture model, and then uses this joint model distribution to predict clear speech. is prediction method has a significant improvement compared with the previous minimum mean square error.
In summary, the method of feature compensation is to improve the distinguishing ability of speech features for reducing the influence of noise on the features. However, as long as noise exists, this improvement is always difficult to avoid the impact on noise on the recognition performance.
In this paper, a novel multiscale chaotic feature is proposed to speaker recognition. e proposed multiscale chaotic feature is evaluated using a nonlinear dynamic model based on wavelet decomposition (multiresolution analysis (MRA)). In our method, an MRA technique is used to capture more finer spectrum information. In speech feature, harmonic feature is an important factor to distinguish different speakers. Because harmonic can represent speaker's tone, tone information is usually distributed over different frequency components. e wavelet decomposition is an adapt method to capture frequency components. Moreover, we also take into account a chaotic characteristic of speech signal. Speech signal is a nonlinear system on long time series. In addition, this nonlinear characteristic should be reflected in speech features to speaker recognition. e chaotic feature based on the nonlinear dynamic model is used widely to the speech application system. e nonlinear dynamic model has been used in various fields of speech processing area, such as speech steganalysis [15], speech synthesis [16], speech recognition [17], and speech encryption [18]. e proposed feature represents the signal chaotic characteristic at different frequency bands.

Proposed Speaker Recognition System
To improve the performance of speaker recognition under adverse conditions induced by background noise, we proposed a multiscale chaotic feature extraction method to enhance the robustness of the recognition system. e proposed speaker recognition system is illustrated in Figure 1. In our proposed system, a time-domain speech signal is treated with short-time frames of N samples by windowing each frame with, e.g., the hamming window. To relieve the influence of noise, we extract a multiscale chaotic feature (MCF) in each frame, which is comprised of multiresolution analysis and chaotic feature. Multiresolution analysis is implanted by wavelet decomposition and chaotic features including the nonlinear dynamic model and acoustics features. More details will be descripted in Section 3. A Gaussian mixture model (GMM) is used to identify each speaker; here, we introduce a universal background model (UBM) for training the distribution of features that are not related to the speaker. e GMM-UBM model is used widely for speaker recognition [19][20][21] as a classifier; it is a generalization of the GMM model. e GMM-UBM model firstly performs a pretraining for the current speaker by collecting feature data from other speakers, which can solve the problem of recognition performance declining due to the insufficient feature data of the current train speaker. en, the pretrained model is fine-tuned to the target speaker model by a maximum a posteriori (MAP) adaptive algorithm [22].

Multiscale Chaotic Feature
In the speaker recognition systems, the speech feature is vital to recognition performance. e significant discrimination of the feature can distinguish accurate speakers. e acoustic feature (such as MFCC and LPC) has a strong discrimination to speaker recognition of clear speech. However, the performance of classification will decline sharply if the speech signal is disturbed by noise. In order to reduce the disturbance due to environment noise, we introduced a wavelet decomposition and reconstruction technology to enhance the resolution of speech features on the frequency domain. Take account of the speech signal is a nonlinear system, we extract the chaotic feature by the nonlinear dynamic model 2 Complexity to improve the recognition rate. In our proposed speaker recognition system, the feature extraction consists of two parts: multiresolution analysis and chaotic feature extraction.

Multiresolution Analysis.
Multiresolution analysis (MRA) is a technique, which forms a set of basis functions through stretch and scale based on wavelets. On large scale, they expanded basis functions to search for significant features, and on smaller scales, they find more details of features. In our system, according to the literature [23] method, the speech signal is decomposed into a number of subband signals by MRA. e MRA is carried out by passing a speech signal s(n) through a series of high pass and low pass filter banks. e speech signal is simultaneously passed through high pass and low pass filters with impulse response h(n) and g(n), respectively. e resulting outputs are the convolution of s(n) with h(n) and g(n), respectively: As the literature [23], we selected Daubechies 4 (db4) as the basis function for MRA in this paper. Likewise, we have also tested with other basis functions. We found that the recognition rate achieved with multiscale chaotic features using db4 basis function is higher than those results with other basis functions. erefore, the db4 basis function is selected for MRA. e filter coefficients h k and g k , corresponding to the high pass and low pass filters, respectively, are computed from the following MRA equations [24]: where j represents the decomposition scale (j � 1, 2, . . . , j o ) and k denotes the coefficient index of each decomposition scale. At the first level of decomposition (j � 1), the detail coefficient D j,k is an output of the high pass filter and the approximation coefficient C j,k is an output of the low pass filter. e detail coefficients and approximation coefficient captured the high frequency and low frequency information, respectively. e approximation band is further decomposed into detail and approximation bands at the next level of decomposition. e repeated decomposition can obtain multiple levels for getting better resolution. In our system, 3level decomposition is set (j o � 3) as shown in Figure 1. In order to analyse the speech signal to different resolutions, subband signals are reconstructed via each of the approximation and detail coefficients applying inverse discrete wavelet transform (IDWT) [24]. When reconstructing a subband signal, the other subband coefficients are set 0. In this paper, we obtained 4 subbands signals: S d1 , S d2 , S d3 , and S c3 ; these subband signals will be utilized to extract chaotic features.

Chaotic
Feature. In our feature extraction method, acoustic feature and nonlinear feature are extracted. Acoustic feature mainly focuses on speech spectral features (such as MFCC and LPC). Nonlinear feature represents the speech chaotic characteristics using a nonlinear dynamic model.

Acoustic Feature.
In acoustic features, we extract Mel frequency cepstrum coefficient (MFCC) and linear prediction coefficient (LPC). MFCC is computed based on the perception characteristics of the human auditory system. In the human auditory system, Mel frequency has a nonlinear relationship to the Hz frequency. We can obtain the Mel spectral feature by the nonlinear relationship as follows: where f denotes the Hz frequency. LPC represents the frequency envelope of voice; its computation is based on the speech signal digital model. e vocal tract model is a key factor to distinguish different speakers. erefore, LPC is usually used to represent the vocal tract envelope for various speech recognitions [25,26]. According to the speech signal digital model, a speech frame signal can be equivalent to a unit pulse sequence to excite the vocal tract. e process is a linear time-invariant system and can be represented as a form of different equations:   Figure 1: e proposed speaker recognition system using a multiscale chaotic feature.
Complexity 3 where x(n) is the real signal, the weighting term represents the prediction signal, and e(n) denotes the prediction error. e filter coefficient α i is calculated according to the minimum mean square error (MSE) criterion of e(n).

Nonlinear Feature.
e nonlinear dynamic model is an effective analysis method to study the chaotic characteristics of speech signals. According to this model, the nonlinear characteristics of the speech signal are obtained by processing the speaker signal as an one-dimensional time series. From the Takens embedding theorem, to reconstruct the phase space, one-dimensional speaker signals (x(1), x(2), . . . , x(N)) can be mapped to highdimensional space by selecting an appropriate minimum delay time τ and embedding dimension m with two parameters. In addition, the high-dimensional spaces after reconstruction are equivalent to the original space [27]. e reconstructed speaker speech signal becomes X i � (x(i), x(i+ 1), . . . , x(i + (m− 1)τ)), where i � 1, 2, . . . , N − (m − 1)τ. e key point of chaotic feature extraction includes the analysis of speaker speech signal in a high-dimensional space, the extraction of nonlinear feature parameters under the voice dynamic model.
(1) e minimum delay time: the minimum delay time describes the correlation between the neighboring components of the speaker speech signal (x(1), x(2), . . . , x(N)). In order to reconstruct the phase space of one-dimensional speaker speech signals, we calculate the minimum delay time τ and embedding dimension m by the C-C method [28].
(2) Maximum Lyapunov exponent: the Lyapunov exponent represents the average change rate of the local convergence or divergence of adjacent orbits in the phase space. e maximum Lyapunov exponent λ 1 denotes the speed of orbit convergence or divergence. When λ 1 > 0, the larger the value of λ 1 , the greater the rate of orbital divergence and the greater the degree of chaos. We use the small data size method to compute Lyapunov exponent [28]. e calculation method is as follows: x(2), . . . , x(N)) and denote it as X i � (x(i), x(i + 1), . . . , x(i + (m − 1)τ)), i � 1, 2, . . . , N − (m − 1)τ. en, find the nearest neighbor X i′ of each point X i in the phase space and limit the short separation. Define the distance d i (0) from the i-th point of the nearest point X i′ in its adjacent track: (4) Find each point X i in the phase space and calculate the distance next n unit times of the adjacent point pair: (5) If the orbit, which locates on the nearest point in the neighborhood domain, diverges from an exponential rate of λ 1 , then where T s is the sampling period. Taking the logarithm of both sides with the equation, we get Take the average of the logarithmic difference in the distance between all adjacent points, which is where q is the numbers of nonzero d j (i). Last, we use the least squares method to fit λ 1 : (3) Correlation dimension and Kolmogorov entropy: correlation dimension and Kolmogorov entropy are both nonlinear representation quantities under the nonlinear dynamic model. e correlation dimension describes the self-similar structure of the system. Kolmogorov entropy accurately describes the degree of confusion of the distribution probability of time series. We use the G-P algorithm [29] to calculate the correlation dimension and Kolmogorov entropy at the same time. e algorithm is as follows: (1) Firstly, we calculate the correlation integral C(r, m) and the C(r, m) − r curve. Reconstruct the m-dimensional phase space. en, given a critical distance r, search the phase point pair whose distance is less than r, and further calculate the ratio of all phase points. Last, we get the correlation integral function as follows: where m is the embedding dimension, M is the total number of phase points, M � N − (m − 1)τ, and θ is the Heaviside function, which satisfies correlation dimension D(m) is derived by the G-P algorithm as follows:

Complexity
Draw the ln C(r, m) − ln r curve, and take the slope of the approximate straight line part; the slope is as the correlation dimension D. (3) e Kolmogorov entropy formula is derived by the G-P algorithm as follows:

Experiment Setting.
In order to evaluate the performance of the proposed speaker recognition system, we carried out 2 experiments under clear speech and noise speech, respectively. To feature selection, we set 3 combinations: (1) acoustic features, (2) nonlinear feature, and (3) chaotic features (acoustic + nonlinear feature). We selected the i-vector speaker recognition model in the literature [31] as the baseline system. In the i-vector model, MFCC feature is extracted and further obtain the supervector as the speaker feature. Currently, it is the state-of-the-art of speaker feature. e details of the experiment settings are listed in Table 1. In our experiments, the hardware environment is Intel i7 CPU and 8 GB memory. e software environment is Windows 10 OS with 64 bits, the Matlab 2016a, and Voice speech tool package as a develop tool.

Corpus.
TIMIT corpus is used to evaluate our system. e corpus is composed of 630 speakers from different regions, each with 10 sentences. e length of each sentence is 3∼5 s, and the sampling frequency is 16 KHz with 16 sampling bits. In the TIMIT database, 630 people are divided into 462 and 168 according to a 3 : 1 ratio, which are used to train the background model and test the recognition system, respectively. Among the voice data of the 168 speakers tested, 9 sentences of each speaker were randomly selected as training data and 1 sentence as test data. e noise library is Noise-92 noise library. White noise and babble noise are selected as experimental objects. Since the two selected noises are associated with daily life scenes, these noises will also be mixed in real life application scenes. erefore, this noise experiment has certain representative and feasibility.

Preprocessing.
Speech signal is a nonstationary timevarying signal. Preprocessing must be performed first before speech analysis and feature extraction. e preprocessing usually includes endpoint detection, pre-emphasis, windowing, and framing processing. In this paper, endpoint detection adopts a double-threshold method based on zerocrossing rate and energy. e pre-emphasis is carried out by a first-order FIR high pass filter, and the pre-emphasis coefficient is set to 0.97. e frame length is set to 20 ms with 50% overlap.

Feature Extraction.
In the feature extraction phase, we extracted the acoustic features and nonlinear feature for each subband (4 subbands in one frame). Acoustic feature consists of 12-order MFCC coefficients and 4-order LPC coefficients and then calculated statistical features of each coefficient for classifying the statistical features including skewness, kurtosis, mean, variance, and median. Nonlinear feature comprises the minimum delay time, correlation dimension, K entropy, maximum Lyapunov exponent, and Hurst exponent. We also calculate its statistical characteristics, like as maximum value, minimum value, mean, median, and variance. e speaker feature is listed in Table 2.
In order to eliminate the problem of internal dependence of speech features due to different dimensionality, a mean normalization is carried out on features as follows: where μ and σ denote mean and standard deviation, respectively.

Results and Analysis.
In order to evaluate the discriminability of the proposed feature, we carried out 2 group experiments according to Table 1. Take account of the validity of equal error rate (EER) on speaker recognition evaluation, we selected EER as a metric to evaluate the performance of chaotic features. For EER, more less value, the better performance. In our experiments, the mixture numbers of GMM-UBM are set to 512, and the iteration of the EM train algorithm is 10 times, and the dimension of the T matrix of the i-vector model is set to 100. All parameters are obtained by iterative optimization.

Results and Analysis under Clean Speech Condition.
e results are listed in Table 2 under clean speech condition. It can be seen from Table 3, if using acoustic feature alone, the ERR is 2.562%, which has a similar performance compared with the i-vector model. is shows LPC feature is helpful to identify different speakers because LPC can represent the vocal tract envelope. e EER of nonlinear feature is 2.833%, which is the worst performance compared with other features. However, we find the chaotic feature is the better performance, and the EER value is reduced by 14% Complexity 5 compared with i-vector, which shows that the speech chaotic characteristic has a good discrimination.

Results and Analysis under Noise Speech
Condition. e purpose of this group experiment is to evaluate the robustness of chaotic features with noise speech. We selected stationary noise (white noise) and nonstationary noise (babble noise) as the disturb signal and set different disturb degrees. e SNR is set to 0 dB, 5 dB, 10 dB, 15 dB, 20 dB, 25 dB, and 30 dB. Table 4 and Figure 2 show the EER value of speaker recognition under white noise speech condition. From the results, the acoustic feature, nonlinear feature, and i-vector model are the similar average EER value. However, the chaotic feature obtained a better performance of recognition compared with other features. e EER values reduced by 27.53% compared with the i -vector model. e good performance is attributed to the chaotic characteristic. is also shows that the speech nonlinear features can relieve the disturbance of noise and improve the robustness of speaker recognition. Table 5 and Figure 3 give the ERR results under nonstationary noise babble condition. Similar with white noise disturb, there is also a good robustness of chaotic feature. Table 6 shows the average EER values for all experiments. From the results, compared with the i-vector model, the EER value reduced by 13.94% and 26.5% under clean speech and noise speech conditions, respectively. erefore, we believe the following:         (1) Acoustic feature has a good performance of recognition with clean speech. MFCC and LPC have a perfect discrimination to different speakers. However, the recognition performance will decline if the speech signal is disturbed by environment noise. (2) Speech chaotic characteristic based on the nonlinear dynamic model has a good compensation for the acoustic feature under environment noise condition. at is, the nonlinear feature can better distinguish different speakers in a noise environment. ese benefits of the multiresolution analysis techniques can better capture the frequency information of speakers.
(3) In the proposed method, we only extracted 5 nonlinear parameters, and the feature combination is not optimized. We suggest that feature optimization may improve the robustness of recognition.

Conclusion
In this paper, we proposed a novel multiscale chaotic feature for speaker recognition. e MRA technique is used to capture the frequency information of a speaker under environment noise condition. We extracted the nonlinear feature based on speech chaotic characteristics to improve the robustness of recognition. e experiment results show that this method is valid. erefore, we believe the speech chaotic characteristic is a robust feature to various speech application systems, such as speech recognition and speech emotion recognition. In this paper, the proposed feature is not optimized, and the feature optimization will be the next work.

Data Availability
e data used to support the findings of the study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.