Voice biometrics is one kind of physiological characteristics whose voice is different for each individual person. Due to this uniqueness, voice classification has found useful applications in classifying speakers’ gender, mother tongue or ethnicity (accent), emotion states, identity verification, verbal command control, and so forth. In this paper, we adopt a new preprocessing method named Statistical Feature Extraction (SFX) for extracting important features in training a classification model, based on piecewise transformation treating an audio waveform as a time-series. Using SFX we can faithfully remodel statistical characteristics of the time-series; together with spectral analysis, a substantial amount of features are extracted in combination. An ensemble is utilized in selecting only the influential features to be used in classification model induction. We focus on the comparison of effects of various popular data mining algorithms on multiple datasets. Our experiment consists of classification tests over four typical categories of human voice data, namely, Female and Male, Emotional Speech, Speaker Identification, and Language Recognition. The experiments yield encouraging results supporting the fact that heuristically choosing significant features from both time and frequency domains indeed produces better performance in voice classification than traditional signal processing techniques alone, like wavelets and LPC-to-CC.
Unlike fingerprints, iris, retina, and facial feature, our voice is a kind of bodily characteristics that is useful in speaker identification but it remains relatively unexplored. Compared to other bodily features, voice is dynamic and complex, in the sense that a speech can be spoken in different languages, different tones, and in different emotions. Voice biometrics plays a central role in many biometrics applications such as speaker verification, authentication, and access control management. Furthermore voice classification potentially can apply to interactive-voice-response system for detecting the moods and tones of customers, thereby guessing if the calls are of complaints or complement, for example. More examples of voice classification have been described in our previous work [
Voice classification has been studied intensively in the biometrics research community using digital signal processing methods. The signatures of the voice are expressed in numeric values in the frequency domain. There lie considerable challenges in attaining high accuracy in voice classification given the dynamic nature in the speech data, not only the contents within but also the diversity of human vocals and different ways of speeches. In this paper we tackle the classification challenges by modeling human voices as time-series in the form of stochastic signals. In contrast to deterministic signals that are rigidly periodic, stochastic signals are difficult to be modeled precisely by mathematical functions due to uncertainty in the parameters of the computational equations. Time-series of voice data are nonstationary, with their statistical characteristics change over time when spoken. As far as human voice is concerned, almost all of them are stochastic and nonstationary, meaning that their statistics are time dependent or time varying.
Given such temporal data properties, human voice that is acquired continually from the time domain would be in the form of random time-series that often has a single variable (amplitude in loudness) over time. It is believed that the statistical characteristics are changing over time during a speech but they may form some specific patterns, so some inherent information can be derived from the time-series that are useful for classification. Specifically we adopt a recent preprocessing methodology, called Statistical Feature Extraction (SFX) [
Simulation experiments are carried out over four representative types of voice data or speeches being digitized for validating the efficacy of our proposed voice classification approach based on SFX and metaheuristic feature selection. This type of feature selection will find the optimal subset of features for inducing the classification model with the highest accuracy. The four types of testing data are deliberately chosen with the purpose of covering a wide range of possible voice classification applications, such as Female and Male (FM), Emotional Speech (ES), Speaker Identification (SI), and Language Recognition (LR). Given the multiattributes which are derived from the original time-series via the preprocessing step, feature selection (FS) techniques could be applied prior to training a classification model. Our results indicate that superior performance could be achieved by using SFX and FS together over the original time-series for voice classification. The improvements are consistent over the four testing datasets with respect to the major performance indicators.
The rest of the paper is structured as follows: The previous works on classifying voice data are reviewed in Section
Human voice is stochastic, nonstationary, and bounded in frequency spectrum; hence some suitable features could be quantitatively extracted from the voice data for further processing and analysis. Over the years, different attempts have been made by previous researchers who used a variety of time-series preprocessing techniques as well as the core classification algorithms for extracting acoustic features from the raw time-series data. Their performances, however, vary.
Some useful features selected for the targeted acoustic surveillance are [
In the research community of signal processing, the most widely used methods for voice/speech feature extraction are Linear Prediction Coding or Linear Prediction Coefficient (LPC), Cepstral Coefficient or Cepstrum Coefficient (CC), and Mel Frequency Cepstral Coefficient (MFCC). LPC consists of finding a time-based series of
As a general practice of pattern recognition, the final predictor coefficients are never applied because of the high variance. Instead, cepstral coefficients [
Similar to LPC and MFCC, PLP modifies the short-term spectrum of the speech by several psychophysically based transformations. The basic steps of PLP contain spectral analysis, critical-band spectral resolution, equal-loudness preemphasis, intensity-loudness power law, autoregressive modeling, and practical considerations [
Tsrrneo Nitta used multiple mapping operators to extract topological structures, hidden in time spectrum patterns. Linear algebra is the main technique. Karhunen-Loeve transformation and linear discriminant analysis were the feature extraction methods [
Our proposed method uses both statistical and spectral analysis for extracting all the possible features. Subsequently it selects useful features via a metaheuristic search. The qualified features are then used to reduce the vector dimensionality of training instances for building a classification model. The features from the temporal domain contain richer statistical information than only local maxima and local minima. Our method rides on the observed current trend of fusing information from both time and frequency domains. The merit is that a nonlinear relationship is represented by the spectrum of a spectrum, so only the useful features from the frequency domain in addition to other strong statistical features from the time-domain are encoded into the multidimensional vector which of course is limited in space. Besides, residual and volatility are introduced and embedded into voice classification to produce superior classification result.
Some recent research tapped on the power of data mining algorithms for performing voice classification in various applications. For instance, a new method is proposed by the research team of Lee et al. [
As a contribution to telemedicine in home telemonitoring, Maunder et al. [
For biomedical applications, Chenausky et al. made an important contribution in acoustic analysis of Parkinson’s disease (PD) speech [
In our previous work in [
Above all the methods a forementioned, encoding techniques from the frequency domains are used as sole features for modeling the voice samples. A single classification algorithm was used specifically for conducting the validation experiment in the literature. In this paper, we advocate combining features from both time and frequency domains, for a throughout coverage of all the voice data characteristics. Then feature selection is used to reduce the dimensionality of the training samples. This way, a minimum subset of relevant features is ensured, and they could be applied into most types of classification models without any limit of a specific type.
The SFX preprocessing methodology that is adopted in our research is efficient. Its main merit lies in its ability to transform voice data from one-dimensional to multidimensional features. The SFX technique could possibly fit into a standard data mining process, like the one shown in Figure
Preprocessing methodology as a part of the classification model learning process.
The model construction process is just a standard classification model learning in data mining; for example, a decision tree is built by creating decision paths that map the conditions of the attribute values, as seen from the training samples, to the predicted classes. Once a classifier is trained by processing through the whole training dataset, it is ready to classify new unseen testing samples, and its performance can be measured. The feature selection process is generalized enough to be an ensemble where the winner takes all. During calibration, several feature selection algorithms are put into test, and the best performing one in our case is Feature Selection with Wolf Search Algorithm (FS-WSA) [
The overall process about SFX.
The detailed illustration about SFX with Ensemble FS.
In a nutshell, the preprocessing methodology SFX is a way of transforming a two-dimensional time-series (amplitude versus time) into a multidimensional feature vector that has all the essential attributes sufficient to characterize the original time-series voice data. Information is taken from two domains, frequency and time, based on the original time-series. Thus there are two groups of preprocessing techniques being used here, namely, LPC-to-CC encoding (from the frequency domain), Descriptive Statistics of both whole and piecewise, and Dynamic Time Wrap (from the time domain). It is believed that having features obtained from both domains would yield an improved accuracy from the trained classification model due to thorough consideration of the characteristics, hence the representative features, from both domains.
Effectively the preprocessing methodology SFX transforms a matrix of original time-series to a set of training instances which have specific attribute values for building a classification model. Assume
The attributes
Linear Prediction Coefficients to Cepstral the Coefficients, or Linear Prediction Coding to Cepstrum Coefficients (LPC-to-CC) is selected as the main feature extraction method from the frequency domain in our case. The common production process of human voice contains the following steps of voice generation: the lungs expel air up, acting as the initial step of voice production. Then the air goes into the trachea, passing through the larynx. The larynx is a box-like organ and has two membranes named vocal folds. The voice is actually produced by the vibration of those vocal folds [
Linear prediction calculates future values of a signal in discrete time format based on a linear function of previous samples. It is always called linear prediction coding, which is a common tool widely used in speech processing for representing the spectral envelope of a signal in compressed form [
The original time-series voice data
A sample time-series voice data represented in LPC coefficients.
The problem of value setting of prediction order
The prediction error generated by this estimate method is the difference between the actual and the predicted values:
The autocorrelation sequence can then be represented as a matrix in the format of
When a windowed frame is applied on voice data
Cepstral Coefficients computation steps.
The cepstrum has a lot of advantages such as orthogonality, compactness, and source-filter separation; meanwhile the LPC coefficients are much more susceptible to the precision of numerical numbers, which are less robust than cepstrum coefficients [
Here we have a feature set
The extracted statistical features include the following statistics: Mean, Standard Deviation, 1st Quartile, 2nd Quartile, 3rd Quartile, Kurtosis, Interquartile Range, Skewness, RSS (residual sum of squares), Standard Deviation of Residuals, Mean Value of Volatilities, and Standard Deviation of Volatilities. Suppose
Quartile.
Now we introduce another model for characterizing and modeling observed time-series: autoregressive conditional heteroskedasticity (ARCH) model. So that in the model, at any time point in this sequence, it will have a characteristic variance.
If an ARMA model is supposed for the build of error variance, then the model is a Generalized Autoregressive Conditional Heteroskedasticity (GARCH) model [
We set the parameters of GARCH model with standard values such as the following. Distribution = “Gaussian”; variance Model = “GARCH”;
Though descriptive statistics may give us the overall summary of time-series data and characterize a general shape of time-series data, they may not be able to capture the precise trend movements which are also known as the patterns of evolving lines. In particular we are interested in distinguishing the time-series which belong to one specific class from those that belong to another class. The difference of trend movements can be estimated by a technique called Dynamic Time Warping.
Dynamic Time Warping (DTW) is an algorithm for measuring similarity between two time-series in the situation that both have similar shapes but they vary in time step or speed rate. DTW has been applied to many data objects like video, voice, audio, and graphics. Actually, DTW can explain and deal with any ordered set of data points by the format of linear combination [
In theory, DTW is most suitable for voice wave patterns because exact matching for such patterns often may not occur, and voice patterns may vary slightly in the time domain. DTW finds an optimal match between two sequences that allows for compressed sections of the sequences. In other words it allows some flexibility for matching two sequences that may vary slightly in speed or time. The sequences are “warped” nonlinearly in the time dimension to determine a measure of their similarity independent of certain nonlinear variations in the time dimension. Particularly suitable DTW is for matching sequences that may have missing information or various lengths, on condition that the sequences are long enough for matching.
Suppose that
Illustration of DTW calculation.
So far along the time-domain, statistics are extracted from the whole piece of the time-series as well as the similarity in terms of distance between the test time-series and the mean of its peer group. For a finer level of information, a piecewise transformation is applied which is called Piecewise Linear Function (PLF). A continuous time-series is converted into a collection of linear segments when PLF is applied on it. The purpose of this compressed expression method is to approximate a polynomial curve into a vector of finite
This is the key part of the research work because it contains our new contribution. Inspired by the financial analysis of stock market, residual and volatility are firstly imported in the application field of voice classification. Like historical volatility for one or more stocks over some specified trading days, we also believe that certain patterns of someone’s speech are involved in residual and volatility.
Each sentence is read by
In our experiment, we try to keep the length of every spoken sentence the same, being almost 10 k points after sampling. The number of segmentations is also 20, so each piece maintains at nearly 5 k sampling points. For each segment of the time-series, certain statistics that describe the trend and dynamics of the movement are extracted into the feature vector, that is,
(a) An example of sampled time-series voice data and its partition. (b) The amplified view of piecewise linear regression (partly).
Using this piecewise method, the features that are being extracted are statistics of each partition of the time-series. Table
The piecewise segment statistics feature extraction.
Attribute | 1 | 2 | 3 |
|
20 |
---|---|---|---|---|---|
Gradient | Grad 1 | Grad 2 | Grad 3 |
|
Grad 20 |
RSS | RSS 1 | RSS 2 | RSS 3 |
|
RSS 20 |
Resstd | Resstd 1 | Rresstd 2 | Resstd 3 |
|
Resstd 20 |
Volmean | Volmean 1 | Volmean 2 | Volmean 3 |
|
Volmean 20 |
Volstd | Volstd 1 | Volstd 2 | Volstd 3 |
|
Volstd 20 |
For each segment
A calibration test is used to determine the optimal choice of the length of each piece (interval) such that the highest classification accuracy can be obtained. Different numbers of intervals have been tried continually for piecewise transformation, extracting the corresponding attributes and running the classifiers. As the results shown in Figure
Calibration curve for segmentation selection.
In order to compare the effectiveness of the proposed time-series preprocessing method with the other existing methods, we test them on four different voice/speech datasets using nearly twenty popular and traditional classification algorithms in data mining.
Four representative types of voice data are tested by the simulation experiments; they are Female and Male (FM) Dataset, Emotional Speech (ES) Dataset, Speaker Identification (SI) Dataset, and Language Recognition (LR) Dataset.
The voice data is in the format of two-dimensional time-series, with an amplitude value in sound that varies over time; examples are given in Figures
Distributions of classes in different datasets.
Dataset name | No. of classes or labels | Notes |
---|---|---|
FM | 2 | Female and male |
ES | 4 | Happiness, anger, sadness, |
SI | 16 | 16 different speakers |
LR | 3 | Cantonese, English, and Mandarin |
The numbers of attributes associated with datasets and instances for training and testing by various preprocessing methods.
Preprocessing method | FM | ES | SI | LR |
---|---|---|---|---|
Wavelet | 50 | 50 | 50 | 50 |
LPC-to-CC | 10 | 10 | 10 | 10 |
SFX | 74 | 68 | 88 | 75 |
SFX + FS | 20 | 53 | 20 | 32 |
No. of instances for training | 258 | 179 | 564 | 600 |
No. of instances for testing | 172 | 160 | 272 | 150 |
Visualization of parts of each group of the datasets, FM, ES, SI, and LR is displayed in Figures
(a) Visualization of FM dataset that belongs to the “Female” group. (b) Visualization of FM dataset that belongs to the “Male” group. (c) Visualization of ES dataset that belongs to the “Anger” group. (d) Visualization of ES dataset that belongs to the “Happiness” group. (e) Visualization of ES dataset that belongs to the “Neutral” group. (f) Visualization of ES dataset that belongs to the “Sadness” group. (g) Visualization of SI dataset that belongs to the “Speaker 1” group. (h) Visualization of SI dataset that belongs to the “Speaker 2” group. (i) Visualization of SI dataset that belongs to the “Speaker 3” group. (j) Visualization of LR dataset that belongs to the “Cantonese” group. (k) Visualization of LR dataset that belongs to the “English” group. (l) Visualization of LR dataset that belongs to the “Mandarin” group.
Multidimensional (MD) visualization of each group of those datasets is shown in Figures
(a) MD visualization of FM. (b) MD visualization of ES. (c) MD visualization of SI. (d) MD visualization of LR.
Our experiments are performed by using popular and standard classification algorithms (with their default parameters applied) over the four sets of the above-mentioned voice data that are being handled by four preprocessing methods. A total of 20 classification algorithms are being used. The justification is that we try to test the generality of our voice classification model without being attached to any specific classification algorithm. In other words, the design of the voice classification model should be generic enough, and its efficacy should be independent from the choice of classifier. While the focus of the voice classification model is centered at the preprocessing steps which leverage the features from both time and frequency domains followed by feature selection for reducing the feature space dimension, classification algorithms can become flexible plug-and-play in our model design. The standard classification algorithms used in our experiments are well known in data mining research community as well as available in Weka (
List of standard classification algorithms used in our experiment.
Standard classification algorithm type | Algorithm |
---|---|
Bayes | NaiveBayes |
| |
Functions | LibSVM |
Multilayer perceptron | |
SMO | |
| |
Meta | Bagging |
| |
Rules | Conjunctive rule |
Decision table | |
FURIA | |
JRip/RIPPER | |
NNge | |
OneR | |
PART | |
| |
Decision Trees | BF tree |
FT | |
J48/C4.5 | |
LMT | |
NB tree | |
Random forest | |
Random tree | |
REP tree |
The four preprocessing methods used for comparison are as follows.
Objective function Initialize the population of wolves, Define and initialize parameters:
WHILE (
END-WHILE
Suppose
Max-Relevance and Min-Redundancy are
Optimal FS methods for each dataset.
FS accuracy % | FM | ES | SI | LR | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Feature selection method | CFS | ChiS | MRMR | WSA | CFS | ChiS | MRMR | WSA | CFS | ChiS | MRMR | WSA | CFS | ChiS | MRMR | WSA |
Original no. of attributes | 74 | 74 | 74 | 74 | 68 | 68 | 68 | 68 | 88 | 88 | 88 | 88 | 75 | 75 | 75 | 75 |
No. of attributes after FS | 8 | 55 | 30 | 20 | 17 | 53 | 30 | 21 | 20 | 71 | 30 | 29 | 32 | 32 | 30 | 31 |
Classification algorithm | ||||||||||||||||
J48 | 63.1783 | 64.7287 | 63.5659 | 67.4419 | 61.25 | 59.375 | 57.5 | 64.375 | 73.1618 | 66.1765 | 74.6324 | 64.7059 | 94 | 92.3333 | 70 | 94 |
BFTree | 71.7054 | 68.9922 | 62.0155 | 76.3566 | 58.125 | 56.25 | 65 | 65.625 | 83.0882 | 79.7794 | 85.6618 | 81.25 | 93.3333 | 91.6666 | 68 | 93.3333 |
FT | 74.031 | 79.0698 | 72.8682 | 82.5581 | 64.375 | 78.125 | 65.625 | 88.75 | 97.4265 | 88.9706 | 93.75 | 88.2353 | 72.6667 | 71 | 69.3333 | 76 |
LMT | 63.5659 | 92.4419 | 73.6434 | 97.6744 | 60.625 | 83.125 | 70 | 88.125 | 96.3235 | 92.4465 | 91.9118 | 87.8676 | 86.6667 | 85 | 66.6667 | 86.6667 |
NBTree | 71.7054 | 63.1783 | 63.1783 | 70.155 | 51.25 | 59.375 | 67.5 | 66.25 | n/a | n/a | n/a | n/a | 67.3333 | 65.6666 | 66.6667 | 68 |
RandomForest | 73.6434 | 70.5426 | 67.8295 | 72.8682 | 61.875 | 74.375 | 71.25 | 73.125 | 90.4412 | 74.2647 | 81.9853 | 81.9853 | 90.6667 | 89 | 70.6667 | 90.6667 |
RandomTree | 64.7287 | 59.6899 | 55.814 | 70.155 | 45 | 66.25 | 57.5 | 67.5 | 69.1176 | 61.7647 | 69.4853 | 63.9706 | 85.3333 | 83.6666 | 84.6667 | 85.3333 |
REPTree | 72.093 | 72.093 | 67.4419 | 73.6434 | 56.875 | 61.25 | 60 | 64.375 | 84.9265 | 84.1912 | 79.4118 | 83.4559 | 94.6667 | 93 | 66.6667 | 94.6667 |
ConjunctiveRule | 55.814 | 55.814 | 48.8372 | 65.8915 | 54.375 | 55 | 55 | 56.25 | n/a | n/a | n/a | n/a | 66.6667 | 65 | 65.3333 | 66.6667 |
DecisionTable | 70.155 | 70.155 | 53.876 | 68.6047 | 57.5 | 56.25 | 61.875 | 52.5 | 67.2794 | 58.8235 | 63.2353 | 59.5588 | 93.3333 | 91.6666 | 77.3333 | 93.3333 |
FURIA | 71.7054 | 77.1318 | 63.9535 | 74.8062 | 62.5 | 68.75 | 64.375 | 53.125 | 80.5147 | 62.5 | 78.6765 | 86.3971 | 84 | 82.3333 | 70 | 84 |
JRip | 72.8682 | 71.7054 | 67.8295 | 73.6434 | 66.25 | 54.375 | 58.125 | 55 | 74.2647 | 34.9265 | 72.0588 | 65.4412 | 98 | 96.3333 | 66.6667 | 98 |
NNge | 68.9922 | 66.6667 | 55.814 | 63.5659 | 53.125 | 46.25 | 52.5 | 50.625 | 92.6471 | 91.1765 | 92.2794 | 81.25 | 87.3333 | 85.6666 | 69.3333 | 87.3333 |
OneR | 50.3876 | 50.3876 | 54.2636 | 54.2636 | 54.375 | 54.375 | 54.375 | 61.25 | 55.1471 | 55.1471 | 49.2647 | 49.2647 | 93.3333 | 91.6666 | 54 | 93.3333 |
PART | 64.3411 | 68.6047 | 62.4031 | 68.6047 | 58.75 | 56.875 | 58.125 | 70.625 | 73.5294 | 67.2794 | 80.5147 | 71.3235 | 94 | 92.3333 | 69.3333 | 94 |
NaiveBayes | 70.9302 | 64.7287 | 67.4419 | 67.0543 | 64.375 | 68.75 | 58.75 | 58.75 | 95.2206 | 76.8382 | 86.0294 | 72.0588 | 78.6667 | 77 | 64 | 78.6667 |
Bagging | 73.2558 | 75.1938 | 70.155 | 75.969 | 63.75 | 64.375 | 68.75 | 53.125 | 89.3382 | 85.6618 | 84.9265 | 86.3971 | 93.3333 | 91.6666 | 66.6667 | 94.6667 |
LibSVM | 66.6667 | 87.5969 | 63.5659 | 89.9225 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | 57.3333 | 55.6666 | 46.6667 | 57.3333 |
MultilayerPerceptron | 68.6047 | 79.0698 | 72.4806 | 89.5349 | 63.75 | 70.625 | 61.875 | 66.875 | 91.9118 | 93.75 | 93.75 | 86.3971 | 100 | 98.3333 | 70 | 100 |
SMO | 72.4806 | 77.907 | 73.6434 | 79.4574 | 76.875 | 75.625 | 61.875 | 67.5 | 93.0147 | 93.75 | 93.3824 | 80.5147 | 98 | 96.3333 | 76 | 99.3333 |
Mean accuracy % | 68.04263 | 70.78489 | 64.03102 |
|
59.73684 | 63.65132 | 61.57895 |
|
82.78547 | 74.55568 | 80.64448 |
|
86.43333 | 84.76663 | 67.90001 |
|
Time (s) | 0.78 | 2.867 | 3.56 |
|
1.03 | 3.328 | 1.439 |
|
1.91 | 3.815 | 3.26 |
|
1.39 | 2.17 | 4.8 |
|
The objective of our experiments is to compare the performance of those four preprocessing methods on four kinds of voice datasets when a collection of data mining classifiers are applied. Our performance evaluation covers four main aspects:
Twenty popular classification algorithms were used on FM and LR datasets, which is regarded as a representative set of commonly used classifiers. However, the classifier of LibSVM could not be applied on ES and SI due to their formats. Some attribute data contain infinitely small values. Results from some classifiers are not available because of the time limitation: it takes too much time for them to build a classification model when the number of attributes gets very large. As such, LibSVM is excluded from experiments involving ES and SI. NBTree and Conjunctive Rule are excluded from experiments over the dataset SI. For feature selection, the algorithm candidate that yields the highest accuracy is used in the subsequent experiments.
The accuracy of the classification result is the most significant criterion for evaluating the performance. It is defined as the percentage of correctly classified instances over the total number of instances. This section shows total accuracies of four preprocessing methods on each voice dataset. Four sets of accuracy results and box plots for different dataset are presented in Figures
(a) FM boxplot and accuracy table. (b) ES boxplot and accuracy table. (c) SI boxplot and accuracy table. (d) LR boxplot and accuracy table.
From the aforementioned figures we find that the first two preprocessing methods, which are wavelet and LPC-to-CC, yielded a relatively nonstationary accuracy result on all four datasets. For LR dataset, wavelet method generated better result than LPC-to-CC. Conversely, LPC-to-CC was better for FM, ES, and SI. Recalling from Section
Meanwhile, SFX and SFX + FS showed relatively more stable results than the first two. They really improved the accuracy a lot. By a contrast of SFX and SFX + FS, after feature selection, the main range (
More evident comparison result is given when the accuracies are averaged out and placed together side by side in a bar chart in Figure
Comparison of average accuracy for different voice datasets and different preprocessing methods.
An interesting phenomenon is observed from Figure
Another considerable fact is also derived from Figure
Considering the number of classes in each dataset together with the accuracy result, we can find that the accuracy of binary targets classification (FM) is higher than multiple targets classification (ES) and (SI) for the frequency-domain encoding methods. For the time-domain methods like SFX and SFX-FS, good accuracy still can be attained in multiclass classification as in SI where the frequency-domain methods underperform.
Multiclass classification categorizes instances into more than two classes, whereby a hypothesis is constructed to make sure that discriminates can be distinguished between a fixed set of classes. An assumption is made before that, which is closed set and good distribution. If all possible instances belonging to each case fall into one of the classe, and each class contains statistically representative instances, then the performance of classification is good enough. For now, the boundary of every emotion in ES dataset is not clear (which is already shown in Figure
This section shows the accuracies of four datasets when every preprocessing method is applied on them, respectively. Four sets of accuracy results and radar charts by different preprocessing methods are shown in Figures
(a) Accuracy comparison of Wavelet preprocessing method. (b) Accuracy comparison of LPC-to-CC preprocessing method. (c) Accuracy comparison of SFX preprocessing method. (d) Accuracy comparison of SFX + FS preprocessing method.
It can be seen that in general the classification algorithms produce consistent results when wavelet and LPC-to-CC preprocessing methods are used. These almost all-rounded accuracy results are displayed in Figures
The classifier model generated from LMT is a single tree with different shapes on basis of various types of training data. If the data type is numeric, then a binary tree will be built with splits on those attributes; if the type is nominal, then a multi-split tree is the consequence. But the same thing is that the leaves are each logistic regression model which is quite capable for analysis of dataset with dependent features and bounded magnitudes of time-series. The algorithm is guaranteed that only relevant attributes are selected [
Multilayer Perceptron is a standard algorithm for any supervised learning task in data mining. The result is relatively better than any other classifiers, achieving almost 100% accuracy but the time cost is higher and sometimes unacceptable. However, some classifiers produce low accuracy, for instance, Naïve Bayes. Based on Bayes’ theorem with strong independence assumptions, Naïve Bayes acts as quite a simple classifier and it gets very widely adopted in many classification situations. But sometimes the relation between any pair of attributes is always dependent and the distribution of features is unknown in advance; thus the performance of such a simple probabilistic classifier is bad and unstable.
For a throughout performance evaluation, performance consideration of other parameters is considered as well; these include Kappa, Precision, Recall, F1, and ROC, which are commonly used in assessing the quality of the classification models in data mining. These performance indicators are briefly described as follows. The performance results pertaining to these indicators are averaged over all the four datasets and all the 20 classification algorithms. They are then shown in Section
Kappa statistic is widely used to measure variability between multiple observers. The meaning of Kappa statistic is how often multiobservers agree in terms of their interpretations. When two or more evaluators are checking the same data, Kappa statistic is assessed to show an agreement of evaluators when the same data categories are correctly assigned. As well known, simple agreement just between yes and no is poor because of the property of chance and arbitrary. That is why Kappa statistic is introduced and it is preferred [
Strength of agreement of Kappa statistic.
Kappa | Agreement | Interpretation |
---|---|---|
<0 | Less than chance agreement | Poor |
0.01–0.20 | Slight agreement | Slight |
0.21–0.40 | Fair agreement | Fair |
0.41–0.60 | Moderate agreement | Moderate |
0.61–0.80 | Substantial agreement | Substantial |
0.81–1.00 | Almost perfect agreement | Almost perfect |
Comparison of average Kappa statistic for different voice datasets and different preprocessing methods.
In pattern recognition and data mining, precision is the fraction of relevantly retrieved instances. In the situation of classifications, the terms positive and negative describe the classifier’s prediction results, and the terms true and false refer to whether the prediction results correspond to the fact or not [
Definitions of precision and recall terms.
Actual Class (Observation) | ||
---|---|---|
Predicted Class |
TP (True Positive) |
FP (False Positive) |
FN (False Negative) |
TN (True Negative) |
Precision is defined as:
Comparison of average precision for different voice datasets and different preprocessing methods.
In pattern recognition and data mining, recall is defined as the fraction of relevantly retrieved instances. We can infer that the same part of both precision and recall is relevance, based on which they all make a measurement. Usually, precision and recall scores are not discussed in isolation and the relationship between them is inverse, indicating that one increases and the other decreases. Recall is defined as
Comparison of average recall for different voice datasets and different preprocessing methods.
Comparison of average
A Receiver Operating Characteristic (ROC) is generated by plotting True Positive Rate (TPR) verse False Positive Rate (FPR) with many value settings of threshold. It is a graphical plot which illustrates the performance of sensitivity and specificity. TPR is also known as sensitivity, and FPR is one minus the specificity or true negative rate. A ROC space is defined by FPR and TPR as
ROC is useful for gaining insight into the decision-making ability of the model—how likely is the classification model to accurately predict the respective classes? The AUC measures the discriminating ability of a classification model. The larger the AUC, the higher the likelihood that an actual positive case will be assigned a higher probability of being positive than an actual negative case. The AUC measure is especially useful for datasets with unbalanced target distribution (one target class dominates the other). A comparison in terms of ROC AUC which is normalized to
Comparison of average ROC AUC for different voice datasets and different preprocessing methods.
The final results that are averaged and aggregated, from the individual results tested by using different datasets and different classification algorithms, are shown as follows. We compare in particular various preprocessing methods against a collection of performance indicators, as in Table
Overall Averaged Performance Comparison of Pre-processing Methods.
Average |
Pre-processing methods | |||
---|---|---|---|---|
Wavelet | LPC-2-CC | SFX | SFX + FS | |
Accuracy % | 52.67789 | 63.70274 | 72.72099 |
|
Kappa Statistics | 0.335301 | 0.490773 | 0.58568 |
|
Precision | 0.515225 | 0.652195 | 0.730832 |
|
Recall | 0.519617 | 0.638896 | 0.721978 |
|
F-measure | 0.496758 | 0.610196 | 0.701144 |
|
ROC | 0.717222 | 0.787528 | 0.836521 |
|
From Table
The accuracy and CPU time are evaluated across different feature selection algorithms; the averaged results together with the amount of attributes before and after FS are shown in Table
Overall averaged performance comparison of ensemble feature selections.
FS | No. attributes from frequency domain | No. attributes from time domain | Total no. attributes | No. attributes after FS | Average CPU time (s) | Av. Acc. % |
---|---|---|---|---|---|---|
CFS | 10 | 66 | 76 | 19 | 1.28 | 74.25 |
ChiSq | 10 | 66 | 76 | 52 | 3.05 | 73.44 |
MRMR | 10 | 66 | 76 | 30 | 3.26 | 68.54 |
WSA | 10 | 66 | 76 | 25 | 1240 (min. 31) | 75.29 |
In Table
Table
Overall averaged time cost comparison.
Time | Preprocessing | FS | Build Model | Total | ||||
---|---|---|---|---|---|---|---|---|
Dataset | LPC2CC | DS | DTW | Piecewise | ||||
FM | 10 s | 5 m 23 s | 15 m 3 s | 32 m | CFS | 0.78 s | 1.13 s | 52 m 37.9 s |
ChiSq | 2.867 s | 52 m 40 s | ||||||
MRMR | 3.56 s | 52 m 40.7 s | ||||||
WSA | 31.275 s | 53 m 18.4 s | ||||||
| ||||||||
ES | 9.5 s | 9 m 35 s | 21 m 38 s | 1 h 13 m | CFS | 1.03 s | 1.25 s | 1 h 44 m 24.8 s |
ChiSq | 3.328 s | 1 h 44 m 27.1 s | ||||||
MRMR | 1.439 s | 1 h 44 m 25.2 s | ||||||
WSA | 441.476 s | 1 h 51 m 45.2 s | ||||||
| ||||||||
SI | 15.8 s | 25 m 6 s | 38 m 23 s | 2 h 14 m | CFS | 1.91 s | 1.7 s | 3 h 17 m 48.4 s |
ChiSq | 3.815 s | 3 h 17 m 50.3 s | ||||||
MRMR | 3.26 s | 3 h 17 m 49.8 s | ||||||
WSA | 3585 s | 4 h 17 m 31.5 s | ||||||
| ||||||||
LR | 13.4 s | 16 m 48 s | 42 m 45 s | 1 h 57 m | CFS | 1.39 s | 1.56 s | 2 h 56 m 49.4 s |
ChiSq | 2.17 s | 2 h 56 m 50.1 s | ||||||
MRMR | 4.8 s | 2 h 56 m 52.8 s | ||||||
WSA | 906 s | 3 h 11 m 54 s |
Human voice is referred to as one of the bodily vital signs that could be measured, recorded, and analyzed as fluctuations of amplitude of sound loudness. Voice classification constitutes to a number of biometrics techniques of which the theories have been formulated, studied, and implemented in practical applications. Traditional classification algorithms from data mining domain, however, require the input of training data to be formatted in a data matrix where the columns represent features/attributes that characterize the voice data, and the rows are the instances of the voice data. Each record must have a verdict known as predicted class for training data. In the literature, mainly the characteristics of voice data are acquired from the frequency domain, for example, LPC, cepstral coefficients, and MFCC. Those popular preprocessing methods have demonstrated significant advantages in transforming voice data which is in the form of time-series to signatures in the frequency domain. There exist possibilities that some useful attributes can be harvested from the time domain considering the temporal patterns of voice data that are supposedly distinctive from one another. A challenge to overcome is its expensive computational cost of time and large search space in the time domain.
Considering the stochastic and nonstationary nature of human voice, a hybrid data preprocessing methodology is adopted in voice classification in this paper, where combined analysis from both frequency and time domain is included. In particular, a time domain feature extraction technique called Statistics Feature Extraction (SFX) is presented. SFX utilizes piecewise transformation that partitions a whole time-series into segments and statistics features are extracted subsequently from each piece. Simulation experiments were conducted on classifying four types of voice data, namely, Female and Male, Emotional Speech, Speaker Identification, and Language Recognition into different groups by using SFX and its counterparts (SFX and Feature Selection). The results showed that SFX is able to achieve a higher accuracy in the classification models for the four types of voice data.
The contribution is significant as the new preprocessing methodology can be adopted by fellow researchers that will enable them to build more accurate voice classification model. Besides, the feature selection result proves that a metaheuristic feature selection algorithm called Wolf Search (WSA) can achieve a global optimal feature subset for highest possible classification accuracy. As there is no free lunch in the world, WSA costs considerable amount of computational time.
The precision of piecewise transformation segmentation can be one of the future works. If the number of segments is too large (low resolution in time-series modeling), then it will lead to the low accuracy of feature extraction; if the window is too small (with very refined resolution), then a lot more computational costs are incurred. Although calibration was done beforehand for calculating the ideal segment length for subsequent processing, this again contributes to extra processing time, and the calibrated result may need to be refreshed should the natures of the voice data evolve. Some dynamic and incremental methods are opted for solving this calibration problem for estimating the correct length of segments. Furthermore the segment lengths can be variables that cope with the level of fluctuation of the voice data, dynamically.
The authors are thankful for the financial support from the research grant “Adaptive OVFDT with Incremental Pruning and ROC Corrective Learning for Data Stream Mining,” Grant no. MYRG073(Y2-L2)-FST12-FCC, offered by the University of Macau, FST, and RDAO.