Current machine-based techniques for vocal emotion recognition only consider a finite number of clearly labeled emotional classes whereas the kinds of emotional classes and their number are typically application dependent. Previous studies have shown that multistage classification scheme, because of ambiguous nature of affect classes, helps to improve emotion classification accuracy. However, these multistage classification schemes were manually elaborated by taking into account the underlying emotional classes to be discriminated. In this paper, we propose an automatically elaborated hierarchical classification scheme (ACS), which is driven by an evidence theory-based embedded feature-selection scheme (ESFS), for the purpose of application-dependent emotions' recognition. Experimented on the Berlin dataset with 68 features and six emotion states, this automatically elaborated hierarchical classifier (ACS) showed its effectiveness, displaying a 71.38% classification accuracy rate compared to a 71.52% classification rate achieved by our previously dimensional model-driven but still manually elaborated multistage classifier (DEC). Using the DES dataset with five emotion states, our ACS achieved a 76.74% recognition rate compared to a 81.22% accuracy rate displayed by a manually elaborated multistage classification scheme (DEC).
Speech emotion analysis has attracted growing interest within the context of increasing awareness of the wide application potential of affective computing [
Most works in the literature made use of acoustic correlation of vocal emotion expressions and explore prosodic features, including pitch, energy, formants, cepstral, voice quality [
In this paper, we propose an automatically elaborated hierarchical classification scheme, which is driven by an evidence theory-based feature selection (ESFS), for the purpose of application-dependent emotions’ recognition. Experimented on the Berlin dataset with 68 features and six emotion states, this automatically elaborated hierarchical classifier (ACS) showed its effectiveness, displaying a 71.38% accuracy rate compared to a 71.52% classification rate achieved by our previously dimensional model-driven but still manually elaborated multistage classifier (DEC). Using the DES dataset with five emotion states, our ACS achieved a 76.74% recognition rate compared to an 81% accuracy rate displayed by the manually elaborated multistage classification scheme (DEC). So far as we know, the best classification rates displayed in the literature are, respectively, 66% [
The remainder of this paper is organized as follows. Section
Feature subset selection is an important subject when training classifier in Machine Learning (ML) problems. Practical ML algorithms are known to degrade in prediction accuracy when faced with many features that are not necessary [
In our feature selection scheme, the term “belief mass” from the evidence theory is introduced into the processing of features. Dempster and Shafer wanted in the 1970s to calculate a general uncertainty level from the Bayesian theory. They developed the concept of “uncertainty mapping” to measure the uncertainty between a lower limit and an upper limit [
The evidence theory was completed and presented by Shafer in [
The doubt function is defined as
The Dempster’s combination rule can combine two or more
An exhaustive search of the best subset of features, leading to explore a space of 2n subsets, is impractical; we thus turn to a heuristic approach for the feature selection as does SFS. However, different from SFS which is wrapper-based approach, our evidence theory-based feature-selection technique, ESFS, makes use of the concept of belief mass from the evidence theory as a classifier and its combination rules to fuse various audio features, leading to an embedded feature-selection method. Moreover, as compared to the original SFS, the range of subsets to be evaluated in the forward process in ESFS is extended to multiple subsets for each size, and the feature set is reduced according to a certain threshold before the selection in order to decrease the computational burden caused by the extension of the subsets in the evaluation.
A heuristic feature selection algorithm can be characterized by its stance on four basic issues that determine the nature of the heuristic search process [
As illustrated in Figure computation of the belief masses of the single features from the training set, evaluation and ordering of the single features to decide the initial set for potential best features, combination of features for the generation of the feature subsets, making use of operators of combination, selection of the best feature subset.
ESFS feature selection scheme.
Before the feature selection starts, all features are normalized into [0,1]. For each feature,
By definition of the belief masses, the mass can be obtained in different ways which can represent the chance for a statement to be true. In our work, the PDFs (probability density functions) of the features of the training data are used for calculating the masses of the single features.
The curves of PDFs of the features are obtained by applying polynomial interpolation to the statistics of the distribution of the feature values from the training data.
Taking the case of a 2-class classifier as an example, the two classes are defined as subset
In the case of
Once computed the belief masses associated with each single feature from the training data, they are used to build a simple classifier. Indeed, given the belief masses associated with a single feature, a data sample can be simply assigned to the class having the highest belief mass. Using classification accuracy rate by these singles feature-based classifiers, all the features can then be sorted in a descending order as
The threshold of the classification rates is decided according to the best classification rate as
An example of ordered single features in
Only the features selected in the set FSini will attend to the latter steps of the feature selection process. The elements (features) in FSini are considered as subsets of features having the size 1 at the same time.
Having the best feature subsets with size
We note the set of all the feature subsets in the evaluation with size
The creation of a new composite feature from two other features is achieved by combining the belief masses of the two features, making use of an operator of combination. The fusing process works as follows.
Assume that
The performance of the new composite feature may be better than both its two base features used in the combination, as illustrated in Figure
Example of masses of a new composite feature in the case of 2 classes.
All these new composite features can also be sorted in a descending order according to their classification accuracy on the training dataset as we did for the single original audio features
The best composite feature having size
The stop criterion of ESFS occurs when the best classification rate begins to decrease while increasing the size of the feature subsets. In our work, in order to avoid missing the real peak of the classification performance, the forward selection stops when the classification performance continues to decrease in two steps,
Aggregation and fusion of different information sources are basic concerns in many systems and applications. There exists different fusion approaches, including evidence theory, possibility theory, or fuzzy set theory, but all these approaches can be summarized as application of some numerical aggregation operators. Generally speaking, the aggregation operators are mathematical functions consisting of reducing a set of numbers into a unique representative number [
Since the combination of the masses of the features in our feature selection scheme amounts to combine two features, the commonly used existing operators for two elements, the triangle norms, are used in our work to fuse the features as in (
The triangular norm (abbreviated as t-norm) is a kind of binary operation used in the framework of probabilistic metric spaces and in multivalued logic which was first introduced by Menger [
The definitions of a t-norm and a t-conorm are as follows.
A t-norm is a function
A well-known property of t-norms is
Formally, a t-conorm is a function
A well known property of t-conorms is
We say that a t-norm and a t-conorm are dual (or associated) if they satisfy the DeMorgan law.
Six parameterized t-norms, namely Lukasiewicz, Hamacher, Yager, Weber-Sugeno, Schweizer and Sklar, and Frank, which are frequently proposed in the literatures [ Lukasiewicz
Hamacher
Yager
Weber-Sugeno
Schweizer and Sklar
Frank
Figure
The property curve surfaces of the operators.
In addition to these t-norm operators, the average and the geometric average of the features are also used for the combination of the features. Average
Geometric average
The property curve surfaces of average and geometric average are displayed in Figure
The property curve surfaces of average and geometric average.
It should be noticed that since a step of normalization is applied in calculating the masses of the combined new features in (
ESFS can be used either as an embedded feature selection method, as we do in the next section when building the hierarchical classification scheme for emotion analysis, or as a simple filter method for selection of relevant features which can then be embedded into classifiers. Used as a filter method, we carried out experiments aiming at comparing the behavior of our ESFS with other filter feature-selection techniques, including Fischer filter method, PCA, and SFS. Using Berlin dataset for emotional speech recognition and Simplicity dataset for visual object recognition, our ESFS displayed better performance, showing its effectiveness in the selection of relevant features [
The fuzzy neighborhood relationship between some emotional classes, for instance between sadness and boredom, as evidenced by studies on acoustic correlates, leads to unnecessary confusion between emotion states when a single global classifier is applied using the same set of features. While several previous works have shown the effectiveness of multistage classification schemes on vocal emotion analysis, the elaboration of these hierarchical classification schemes were intuitive and manual. On the other hand, the number of emotions and their types to be recognized are typically dataset or application dependent. The empirically built hierarchical classification structure thus needs to be adjusted when the emotional space changes. In this section, we propose an automatically elaborated Hierarchical Classification Scheme (ACS) which is driven by our evidence theory-based feature selection technique ESFS. While keeping at least similar performance, the main goal here is to avoid unnecessary repeated work for manually building a new multistage classification scheme each time the vocal emotions to be analyzed change.
Basically, our ESFS, when applied as an embedded feature selection technique to an application specific vocal emotion recognition problem, automatically divides in an optimal way the set of emotional states to be recognized into two disjoint subsets of emotional states, leading to a hierarchical classifier represented by a binary tree whose root is the union of all emotion classes, while leaves are single emotion classes and intermediate nodes composite emotional classes discriminated by a subclassifier. Each of these subclassifiers is based on our ESFS introduced in the previous section, thus extracting the best features to best discriminate two composite emotional classes.
The generation process of an ACS is shown in Figure
Generation of a hierarchical classifier.
The hierarchical classifier is thus expressed by a binary tree. The initial frame of discernment is set as the root node of this binary tree. The main steps for generating an ACS are listed as follows.
The hierarchical structure is composed of several binary subclassifiers. The
All the possible pairs of complements are evaluated using ESFS to decide which partitioning of the initial affect space is the best from the viewpoint of classification accuracy. In order to avoid repeated partitioning, the pairs are defined to ensure that the number of elements in subset
list of pairs of subsets for 4 classes:
Index of pairs | ||
---|---|---|
1 | ||
2 | ||
3 | ||
4 | ||
5 | ||
6 | ||
7 |
Our feature combination and selection process (ESFS) is applied to each pair of the subsets and the belief masses of the training samples in the subsets can be obtained. All these pairs can then be sorted by their classification accuracy rates.
The two subsets in the pair with the highest classification rate (assuming it is the
The two children nodes of If If
When the number of leaf nodes equals to the number of emotional classes, the generation process of the binary tree stops. The information about the binary tree is stored in the model of the classifier.
In practice, we want our ACS resulted from the previous scheme to be as balanced as possible. Indeed, the overall classification accuracy rate of a multistage hierarchical classifier is approximately the product of the classification rates at each stage. Assuming the different stages in the classifier have classification accuracy rates close to each other as
In our work, balanced pair of subsets is put forward. For each pair of subsets
Because the subsets
If the pair of subsets with the highest classification rate (assuming that it is the
Thresholds according to the highest classification rate (%).
Rn1 | 95.8 | 94.8 | 93.6 | 92.6 | 87.8 | 86.6 | |||
---|---|---|---|---|---|---|---|---|---|
thre_diff | 2 | 2.5 | 3 | 3.5 | 5.5 | 6 |
If
The common structure of the ACS generated by this approach is shown in Figure
Hierarchical classifier and recognition route.
If the number of affect classes varies from 3 to 7 as it is the case for most affect recognition problems currently studied, Figure
Typical balanced ACS classifiers for 4, 5, 6 classes.
The effectiveness of our approach is experimented both on the Berlin and DES datasets. In the following, we first introduce the audio features. Then, our experimental results are presented and discussed.
We consider the same set of 68 features as in [
1–20. Mean, maximum, minimum, and median value and the variance of F0 and the first 3 formants.
21–23. Mean, maximum, and minimum value of energy 24. Energy ratio of the signal below 250 Hz 25–32. Mean, maximum, median, and variance of the values and durations of energy plateaus 33–40, 42–49. Statistics of gradient and durations of rising and falling slopes of energy contour 41–50. Number of rising and falling slopes of energy contour per second
51–63. Mean, maximum, variance and normalized variance of the 4 areas 64–66. The ratio of mean values of areas 2~4 to area 1
67. Entropy feature of Inverse Zipf of frequency coding 68. Resampled polynomial estimation Zipf feature of UFD (Up-Flat-Down) coding
Hold out cross-validation with 10 iterations were carried out on the Berlin dataset. In each of the iterations, 50% of samples were used as training set and the other 50% of samples as test set. Out of the seven basic emotions in the Berlin dataset, we excluded “disgust” as there are only 8 samples of “disgust” in the male samples, which is much less than the other emotional classes. Moreover, the acoustic features for this emotion was shown to be inconsistent [
Hierarchical classifiers, respectively, for male and female for the Berlin dataset.
Figure
Classification rate with HCS on Berlin dataset, The indexes on the
The best classification accuracy is
Our ACS described in Section
ACS classification scheme for the DES dataset.
Holdout cross-validations with 10 iterations are used in our experiments on DES dataset. In order to compare with previous works [
Classification rate with ACS on the DES dataset.
The indexes on the
The best result is
Experimented on the same DES dataset with the same 90% data of training and 10% data of testing in cross-validation, the best result obtained in the literature by Ververidis and Kotropoulos [
In Table
Comparison between the DEC and HCS on the two datasets (%).
Female samples | Male samples | Mixed without gender information | Mixed with gender information | ||
---|---|---|---|---|---|
Berlin dataset | DEC | 71.89 ± 2.97 | 75.75 ± 3.15 | 68.60 ± 3.36 | 71.52 ± 3.85 |
ACS | 71.75 ± 3.10 | 73.77 ± 2.33 | 57.95 ± 2.87 | 71.38 ± 2.33 | |
DES dataset | DEC | 85.14 ± 2.02 | 87.02 ± 1.44 | 57.34 ± 1.79 | 81.22 ± 1.27 |
ACS | 79.54 ± 1.95 | 81.96 ± 1.27 | 53.75 ± 1.71 | 76.74 ± 0.83 |
As we can see from the table, the automatically derived ACS offer very closely performance as compared to the empirical DEC while providing the advantage to avoid repeated empiric work when the emotion classification problem changes.
In this paper, we have introduced a new embedded feature selection scheme ESFS which is then used as the basis for automatically deriving hierarchical classification schemes called ACS in this paper. Such a hierarchical classifier is represented by a binary tree whose root is the union of all emotion classes, leaves are single-emotion classes, and nodes are subsets containing several emotion classes obtained by a subclassifier. Each of these subclassifiers is based on a new embedded feature selection method, ESFS, which allows to easily represent classifiers characterized by their mass function which is the combination of the information given by an appropriate feature subset, each subclassifier having its own one. Benchmarked on the Berlin and DES datasets, our approach has shown its effectiveness for vocal emotion analysis, leading to closely similar performance as compared to our previous empiric dimensional emotion model-driven hierarchical classification scheme (DEC).
Many issues need to be further studied. For instance, from machine learning point of view, the automatically derived ACS consists of successively dividing the initial set of class labels into two disjoint subsets of class labels by the most optimal binary classifier according to ESFS. Unfortunately, the number of such disjoint subset pairs increases exponentially. When this is feasible with a set of 4 or 6 class labels as it was the case with the Berlin and DES datasets, we cannot do it anymore if the cardinality of class labels is a much bigger number. Therefore, some heuristic rules also need to be found in order to be able to automatically derive the ACS that we proposed in this paper.
Another issue in machine recognition of vocal emotions is fuzzy and subjective character of vocal emotion. The judgment on emotional state conveyed by an utterance may be between some emotional states or even multiple according to person. Thus, ambiguous or multiple judgments also need to be addressed. A preliminary attempt of this issue has been studied in [