Data-Driven Decision-Support System for Speaker Identification Using E-Vector System

Recently, biometric authorizations using fingerprint, voiceprint, and facial features have garnered considerable attention from the public with the development of recognition techniques and popularization of the smartphone. Among such biometrics, voiceprint has a personal identity as high as that of fingerprint and also uses a noncontact mode to recognize similar faces. Speech signal-processing is one of the keys to accuracy in voice recognition. Most voice-identification systems still employ the mel-scale frequency cepstrum coefficient (MFCC) as the key vocal feature. *e quality and accuracy of the MFCC are dependent on the prepared phrase, which belongs to text-dependent speaker identification. In contrast, several new features, such as d-vector, provide a black-box process in vocal feature learning. To address these aspects, a novel data-driven approach for vocal feature extraction based on a decision-support system (DSS) is proposed in this study. Each speech signal can be transformed into a vector representing the vocal features using this DSS. *e establishment of this DSS involves three steps: (i) voice data preprocessing, (ii) hierarchical cluster analysis for the inverse discrete cosine transform cepstrum coefficient, and (iii) learning the E-vector through minimization of the Euclidean metric. We compare experiments to verify the E-vectors extracted by this DSS with other vocal features measures and apply them to both text-dependent and text-independent datasets. In the experiments containing one utterance of each speaker, the average accuracy of the E-vector is improved by approximately 1.5% over the MFCC. In the experiments containing multiple utterances of each speaker, the average micro-F1 score of the E-vector is also improved by approximately 2.1% over the MFCC. *e results of the E-vector show remarkable advantages when applied to both the Texas Instruments/Massachusetts Institute of Technology corpus and LibriSpeech corpus. *ese improvements of the E-vector contribute to the capabilities of speaker identification and also enhance its usability for more real-world identification tasks.


Introduction
Over the last decades, recognition technologies based on biometrics such as fingerprint, facial features, voiceprint, and iris scans have been widely used in target identification for access security, system identity, private confirmation, etc. In terms of technical and practical usage, recognition through iris scans is the most secured and accurate and is applied to meet the requirements of military standards [1]. For mass requirements, fingerprint recognition is one of the most popular and mature identity-recognition technologies [2]. As fingerprint acquisition and recognition need specific devices [3], it has been increasingly replaced by face recognition in recent years, often preferred for its noncontact mode [4]. Facial data can be collected more easily than iris and fingerprint data, as most smartphones already have an inbuilt camera. However, the accuracy of face recognition is dependent on the recognition conditions, such as environmental brightness and camera angle [5]. Similar to face recognition, voice recognition is also a noncontact mode technique. e voiceprint can be easily collected using a microphone and other voice receivers, and its quality requirements are less dependent on environmental factors than those of face recognition [6]. Similar to fingerprints, voiceprints also contain unique biometric features and are superior to facial features on recognition accuracy. Nevertheless, voice recognition has many merits compared with other recognition technologies. Feature extraction from voiceprint data is the main technical bottleneck, and its realworld applications are fewer than those of fingerprint and face recognition in daily life. In contrast to face recognition, which employs image processing methods, a voiceprint is composed of classic mechanical waves that need signal processing methods to transform voice signals from timedomain representation into frequency-domain representation. Such vocal features are difficult to implement, but quite efficient for speaker identification owing to the biometric variances and personal characteristics between different voiceprints. In most methods of speaker identification (SI), there are two main processes: one is to extract the vocal features, and the other is to learn the identification model based on these features. e vocal features not only adequately represent the common properties of the same speaker, but also separate the different speakers as far apart as possible. erefore, an effective extraction of vocal features can determine the performance of SI models, such that these models can definitely identify the target speaker from multiple speakers' utterances. e general process of SI can be regarded as a decisionmaking support process that decides the identity of the corresponding speakers by their utterances. In the field of automatic speech recognition (ASR), most methods of SI are constructed by extracting the vocal features, which is also one of the most important applications of the decisionsupport system (DSS) for SI tasks. e linear prediction coefficient (LPC) was extracted by a linear combination of the exiting speech, which was the first proposed speech feature in 1967 [7]. Since the vocal feature named mel-scale frequency cepstrum coefficient (MFCC) was proposed in 1980 [8], it has been extensively applied in SI systems. e perceptive linear prediction coefficient was extracted by putting speech signals into the auditory model based on LPC in 2011 [9]. In 2012, a histogram frequency-domain transformation on the discrete cosine transform (DCT) cepstrum coefficient (HDCC) was carried out based on the idea of the MFCC feature extraction [10]. Subsequently, Kim and Stern applied a power-law nonlinear transformation instead of the traditional log nonlinear transformation of MFCC by auditory processing, and they proposed a new feature called power-normalized cepstral coefficients (PNCC) in 2016 [11]. In comparison with the traditional vocal features, an SI model based on the identity vector (i-vector) was also proposed in a data-driven approach [12], which is a popular topic in the field of ASR. Furthermore, Variani et al. applied a deep neural network to generate the d-vector, which is a similar feature to the i-vector [13]. Based on this d-vector, an end-to-end SI approach was also proposed [14]. e existing methods for the extraction of vocal features mostly used model-based approaches, such as MFCC, PNCC, and LPC. In contrast, several new vocal features such as d-vector are based on data-driven approaches. However, these approaches are "black box" in vocal feature learning [14]. Consequently, in this study, a novel method using the data-driven approach of hierarchical cluster analysis for SI is proposed. ere are three main contributions: (1) a novel vocal feature extraction method is proposed based on a datadriven approach of hierarchical cluster analysis; (2) the Euclidean metric is used as a measure to generate an adaptive feature vector called the "E-vector"; (3) DSS is established based on the E-vector to provide decisionmaking support services for SI tasks. In the data-driven hierarchical clustering approach, various personal phonetic features are considered to learn and extract the vocal feature vector. e distances between different cepstral coefficients of the same speaker are measured using the Euclidean metric, and the E-vector is generated through a hierarchical clustering approach by minimizing the Euclidean metric. In the comparative experiments of single utterance SI, the E-vector method improves the identification accuracy by approximately 3% over the MFCC and 5% over the HDCC, where the DSS of SI is based on the Gaussian mixture model (GMM). In the comparative experiments of multiple utterances SI, the micro-F1 score of the E-vector is better than the MFCC and HDCC, where the DSS of SI is based on both the GMM and hidden Markov model (HMM). e remainder of this paper is organized as follows: the problem statement and E-vector are introduced in Section 2. e conducted comparative experiments to evaluate the performance of the E-vector are described in Section 3. Finally, a conclusion is provided in Section 4.

Model-Based Extraction of Vocal Features.
e existing vocal features used in SI are mostly based on modelbased approaches, such as MFCC, HDCC, and PNCC. MFCC is a widely used speech feature first proposed in the 1980s. MFCC applies a discrete Fourier transform method to transform the time-domain signal into a frequency-domain signal. In an MFCC transformation, we use the following equation to translate the frequency-domain signal into melfrequency: where f is the original frequency, and mel(f) represents melfrequency. Subsequently, the amplitude based on mel-frequency is calculated by a series of triangular filters, as in Figure 1(a). Finally, MFCC is obtained by making a cepstrum analysis of the signal using the mel-frequency and triangular filters [8]. HDCC is a new feature proposed with the influence of MFCC. e HDCC creates a two-term span of histogram bins: 50-500 Hz with a span of 50 Hz each and 600-1000 Hz with a 100 Hz span of each as shown in Figure 1(b). After DCT cepstrum coefficients of each bin are obtained from histogram analysis, we can extract the HDCC for each bin [10]. PNCC has similar parts of the first two steps of MFCC in its initial process. Next, PNCC obtains the 2 Scientific Programming short-time spectral power using the squared gammatone summation. As shown in Figure 1(c), gammatone filters are power-law nonlinear transformations, different from the traditional log nonlinearity used in the MFCC. Finally, smoothing-weight processing is used on each frame, and spectral subtraction is applied to realize noise suppression [11].

Data-Driven Extraction of Vocal
Features. e existing research using data-driven approaches for SI mostly focused on clustering different speakers via their feature similarity. For instance, the Alibaba group proposed a speech recognition method based on a clustering method in 2017 [15]. ey obtained the feature vector based on cluster analysis of training data. en, the feature vector model was established for speech recognition [15]. Nevertheless, there are few researches on applying data-driven approach to extract vocal feature. e i-vector, d-vector, and end-to-end SI approach were proposed based on data-driven approach; however, they are black box (method without a transparent working process) [12][13][14]. Actually, vocal data has relevant regularities accounting for the speaker's personal phonetic features. erefore, in this paper, a novel vocal feature, E-vector, extracted using a data-driven approach is proposed.
e method learns the SI models based on the E-vector realized as a DSS.

Determining the Decision Objective.
In decision-making process, there are generally four steps, as shown in Figure 2 (a). At the beginning, decision objective (DO) should be determined after finding out the problem. en, the scheme will be designed based on the decision environment. Next, the scheme will be evaluated in order to carry the scheme. In an SI task, the SI process can be regarded as a multilabel classification task. e number of speakers is the number of classes; the labels are the utterance of each speaker. e DO is achieving the classification of all speakers by identifying all speakers' identities based on vocal features, as shown in Figure 2 (b).

E-Vector System for Speaker Identification.
In this section, we introduce the E-vector system for SI-DSS. It is shown in Figure 3(a) that the SI-DSS based on E-vector system is established in three steps: (i) data preprocessing, (ii) cluster analysis, and (iii) learning models. When a continuous speech signal is put into the E-vector system, data preprocessing is applied to obtain the inverse discrete cosine transform (IDCT) cepstrum coefficient; then the clustering method is used to analyze the IDCT cepstrum coefficient, and finally, GMM and HMM are applied to classify the speakers. e following charts show the detailed introduction of the E-vector system.

Step 1: Data Preprocessing.
e competency of data preprocessing is storing speech data in the form of the IDCT cepstrum coefficient. e original speech data is in the form of a continual signal wave, and the spectrogram is generally used to describe the continual signal wave. In this study, the spectrogram is extracted by the following three steps:  Minimize Euclidean metirc

Minimize Euclidean metirc
Recalculate Euclidean metirc Repeat (3) to (5) Output "E-vector" (1) e first step aims at making the speech signal wave more significant. High-pass filtering process shown in the following equation is used to preemphasize the input signal wave [7]: Here, z is the input speech signal, H (z) is the output preemphasis speech signal, and the value of μ is 0.97 in this study. (2) In the second step, the preweighted speech signal is segmented into small blocks to get a frame signal (frames of 20 ms in this study). (3) e third step is adding a Hamming window, W, to the framed signal. e Hamming window function is defined as where N is the number of each frame. It can make the voice data more periodic for analyzing each frame signal. en, homomorphic signal processing is applied to obtain the IDCT cepstrum coefficient. e processing involves three steps: (1) In the first step, the DCT is applied to obtain the multiplicative signal as equation (4) from all frames of the speech signal. e process of DCT is defined as Here, S (b) is the input speech signal and M is its points; C (a) is the output signal, and a is the points of transformation: Here, S h (b) is the high-frequency signal; S l (b) represents the low-frequency signal. (2) e second step is calculating the logarithmic energy of the output signal to convert the multiplicative signal into an additive signal as follows: (3) e third step is applying the IDCT to obtain the cepstrum coefficient as follows: Here, c (a) is the IDCT cepstrum coefficient, s h (b) is the output high-frequency signal, and s l (b) is the output lowfrequency signal.

Step 2: Cluster Analysis for IDCT Cepstrum Coefficient.
e obtained IDCT cepstrum coefficient is a data matrix, and the length of a row is proportional to the time duration of the input vocal signal, and the length of a column is proportional to the number of the speakers in the input signal. e analysis process of the IDCT cepstrum coefficients consists of five steps: (1) If the input vocal signal contains m speakers' speech, the IDCTcepstrum coefficient can be described by set A. e speech of speaker p (p � 1, 2, . . . , m ) can be described as in (9), where n is the number of A's columns. For such a data matrix, the cluster method can be applied to analyzing using the data-driven approach. An improved hierarchical cluster method is proposed to analyze the IDCT cepstrum coefficient as only adjacent columns can be grouped: A p � a P 1 , a P 2 , . . . , a P n .
(2) Each column of set A is regarded as a class, so there are n classes, as shown in Figure 3(b). (3) Calculate the distances of adjacent classes and define the similarity values as a set. Here, the Euclidean distance measure [16] is used to calculate the distance. e smaller the value of the distance, the greater the similarity. e Euclidean distance l i (l i ∈ S) of a i and a i+1 can be described as in (10). us, the set S can be composed of n − 1 number of distance values. It can be described as in (11): S � l 1 , l 2 , . . . , l n−1 .
(4) Compare all values in S; if the Euclidean distance l i of a i and a i+1 is the minimum one, group a i and a i+1 into a class. Update the classes of set A. (5) Iterate step (2) to step (4) until the number n of classes in set A is equal to X (X is determined by identification accuracy) as shown as Figure 3(c). en, the classes in set A constitute the E-vector.

2.3.3.
Step 3: Learning SI Models. e identification process matches the input feature with the model feature set by the degree of similarity. In this study, the model feature set is established using the GMM and HMM based on the E-vector feature Algorithm 1. e HMM achieves the identification task by searching for the sequence most likely to produce a particular output sequence in the implicit state; the process consists of six steps: (1) Define a vocal feature set A � a 1 , a 2 , . . . , a X for the model, where a X is the number X class of the vocal feature set A.

Datasets of Vocal Corpus.
In the experiments, we used two vocal datasets. One was the Texas Instruments/Massachusetts Institute of Technology (TIMIT) corpus, and the other one was the LibriSpeech corpus. e TIMIT corpus was applied as the representative of a text-dependent experiment, which contained 6300 sentences spoken by 630 persons [17]. e LibriSpeech corpus is a variety of audio datasets that consists of text and voice. us, the LibriSpeech corpus was used as the representative of a text-independent experiment [18]. Two groups of experiments were conducted with the TIMIT corpus and LibriSpeech corpus. Table 1 shows the number of speakers used in the experiments with the TIMIT and LibriSpeech corpus.

Evaluation Indicators.
In the research of identity identification, several evaluation indicators were applied to evaluate the algorithm's performance. e false rejection rate (FRR) is the proportion of cases mistaking the matched voiceprint as the unmatched voiceprint. It refers to the proportion of cases in which the same voiceprint is mistakenly considered as a different voiceprint when testing the voiceprint recognition on the standard voiceprint database: In this study, we applied the measures accuracy, precision, recall, and micro-F1 score to evaluate the performance of the E-vector against other features. If the number of speakers is m, TP i is the true positive number of "i" (0 ≤i ≤ m) speakers, FP i is the false positive number of "i," TN i is the true negative number of "i," and FN i is the false negative number of "i." e accuracy is calculated as follows: 6 Scientific Programming e precision is calculated as follows: e recall represents the percentage actually true in the positive set, and it is described as follows: e formula of micro-F1 is described as follows:

Optimization of E-Vector Dimension.
In order to decide the optimal dimension of the E-vector based on hierarchical cluster analysis, we selected 630 people of the TIMIT corpus (i.e., T4) and 40 people of the LibriSpeech corpus (i.e., L4) and measured the training accuracy. e proposed E-vector features with different dimensions of 15, 25, and 35 were used with GMM and HMM for the SI task. e results in Table 2 show that 15 dimensions of the E-vector obtain the highest training accuracy. In the experiments with the TIMIT corpus, the voice signals of 630 people were selected for the experiments. e voice signals of 40 people were selected for the experiments with the Lib-riSpeech corpus. It is shown in Table 2 that the E-vector obtains the same highest accuracy when it consists of 15 and 35 dimensions. We chose the smaller dimension, 15, as the dimension for the E-vector.

Single-Utterance Comparison Experiments.
We first tested the 15-dimensional E-vector, 13-dimensional MFCC, and 15-dimensional HDCC for SI with an input speech signal containing one utterance of each speaker. e different numbers of speakers in the TIMIT corpus and Lib-riSpeech corpus are identified using the GMM and HMM. e accuracy results are shown in Table 3. e best performances for each test on each corpus are shown in boldface type. e following is shown: (1) In the TIMIT corpus, the E-vector performs best with accuracy 1.000 when using the GMM, and the results of MFCC are relatively worse than those of the E-vector; HDCC is inferior to MFCC and E-vector with approximately 10% gap, as shown in Figure 4(a). When the recognition model is HMM, as in Figure 4(c), the results of MFCC and E-vector are both approximately 0.850. It can be found that all these characteristic parameters have good performance with a recognition accuracy of over 0.75. (2) In the LibriSpeech Input: z (continuous speech signal); frame 20 ms; Step 10 ms; n � X Output: E-vector A � a 1 , a 2 , . . . , a X ; accuracy; micro-F1 (1) Initialization: c (a)⟶IDCT cepstrum coefficient (2) Data preprocessing: c (a) � data preprocessing (z) (3) Cluster: Set S (S←∅), Name c (a) ⟶A � (a 1 , a 2 , . . . , a i , . . . , a n ).

(9)
If n = X (10) break; (11) else (12) continue; (13) end (14) end (15) end (16) Learn models: put A into GMM, HMM ⟶ accuracy, micro-F1 ALGORITHM 1: E-vector system for speaker identification.     Scientific Programming   Scientific Programming corpus, the E-vector also performs best with accuracy 1 when using the GMM; MFCC and HDCC are inferior to E-vector as shown in Figure 4(b). e MFCC is almost similar to the HDCC as shown in Figure 4(d), where the accuracy is approximately 0.93 using HMM. e identification result of E-vector shows remarkable advantages in single-utterance comparison experiments.

Multiple-Utterance Comparison Experiments.
In order to further verify the effectiveness of E-vector, we conducted the experiments with the input signal speech containing multiple utterances with GMM and HMM. We firstly used signals containing three utterances from each speaker, and the experiment results are shown in Table 4. Subsequently, we added the utterance of each speaker and used the signals with five utterances of each speaker, and the results are shown in Table 5.
In Figure 5, the identification results with the input signal containing three utterances are shown. (1) When GMM is used as the SI model, the micro-F1 scores of E-vector are a little higher than MFCC, approximately 1%, in both TIMIT and LibriSpeech corpuses. e micro-F1 score of HDCC is less than MFCC and E-vector by approximately 20% (see Figure 5(a)). (2) When HMM is the SI model, the micro-F1 score of E-vector is almost equal to MFCC. HDCC is inferior to others by approximately 10% (see Figure 5(c)) and it is almost equal to MFCC and E-vector in the LibriSpeech (see Figures 5(b) and 5(d)).
In Figure 6, the identification results with the input signal containing five utterances are described. When using GMM as the SI model, the micro-F1 scores of E-vector and MFCC are almost equal, as shown in Figures 6(a) and 6(b). When using HMM as the SI model, the results of the MFCC and E-vector are almost the same level, as shown in Figures 6(c) and 6(d).
e micro-F1 score of HDCC is less than MFCC and E-vector by approximately 20% (see Figure 6(a)), and it is a little inferior to others (see Figure 6(b)). In the LibriSpeech corpus, we can find in Figures 6(b) and 6(d) that both MFCC and E-vector show good performances with the score of micro-F1 over 0.96.
Since mirco-F1 score is a collaborative measure of precision and recall, it can better denote the identification performance and stability. erefore, we calculated the average value of micro-F1 score (Avg. micro-F1) and standard deviation of the micro-F1 score (Std. Dev. micro-F1) in the experiments of multiple utterances (three utterances and five utterances) and compared the results of E-vector with MFCC and HDCC based on different models and corpus databases (see Table 6 and Figure 7).
In Figure 7(a), we can found that the both Avg. micro-F1 and Std. Dev. micro-F1 of E-vector were superior to MFCC and much better than HDCC. However, Figure 7(b) shows that E-vector obtained a similar level of Std. Dev. micro-F1 as MFCC. E-vector still outperformed MFCC and HDCC in comparison of Avg. micro-F1. Subsequently, we can also obtain the same results in Figures 7(c) and 7(d). Particularly, in case of TIMIT corpus (see Figure 7(c)), the Avg. micro-F1 of E-vector can be improved by 0.65% and 21.40% against MFCC and HDCC, and Std. Dev. micro-F1 of E-vector can be improved by 5.41% and 21.40% against the other two vocal features. In a word, the above investigations revealed that the average proportion of mistaking the matched utterance as the unmatched utterance of E-vector is less than MFCC and HDCC; namely, FRR of E-vector is lower in multiple-utterance SI tasks.

Conclusions
In this paper, we proposed a novel data-driven approach for vocal feature extraction based on DSS. Our method learns the E-vector with the minimization of the Euclidean metric using hierarchical analysis for the IDCT cepstrum coefficient, which is obtained by voice data preprocessing. Several different graphs of the experiments illustrate the effectiveness of our method in challenging SI tasks. As a vocal feature extraction method, the generalization of E-vector is also significant. Our results show that E-vector has perfect identification performances in both one-and multipleutterance experiments in different corpus databases, with an approximately 1.5% superiority to MFCC at best. It is also shown that our method is suitable to both GMM and HMM, with an approximately 2.1% average micro-F1 score superiority to MFCC at best. ese advantages of the proposed method contribute to the capabilities of voice-feature extraction and enhance its usability for more real-world identification tasks. In our future work, we plan to investigate cosine similarity and correlation coefficient calculation methods to extract more optimized feature vectors.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.

Authors' Contributions
He Ma and Yi Zuo contributed equally to this work. 12 Scientific Programming