Stressed Speech Emotion Recognition Using Teager Energy and Spectral Feature Fusion with Feature Optimization

The objective of speech emotion recognition (SER) is to enhance man–machine interface. It can also be used to cover the physiological state of a person in critical situations. In recent time, speech emotion recognition also finds its operations in medicine and forensics. A new feature extraction technique using Teager energy operator (TEO) is proposed for the detection of stressed emotions as Teager energy-autocorrelation envelope (TEO-Auto-Env). TEO is basically designed for increasing the energies of the stressed speech signals whose energies are reduced during the speech production process and hence used in this analysis. A stressed speech emotion recognition (SSER) system is developed using TEO-Auto-Env and spectral feature combination for detecting the emotions. The spectral features considered are Mel-frequency cepstral coefficients (MFCC), linear prediction cepstral coefficients (LPCC), and relative spectra–perceptual linear prediction (RASTA-PLP). EMO-DB (German), EMOVO (Italian), IITKGP (Telugu), and EMA (English) databases are used in this analysis. The classification of the emotions is carried out using the k-nearest neighborhood (k-NN) classifier for gender-dependent (GD) and speaker-independent (SI) cases. The proposed SSER system provides improved accuracy compared to the existing ones. Average recall is used for performance evaluation. The highest classification accuracy is achieved using the feature combination of TEO-Auto-Env, MFCC, and LPCC features with 91.4% (SI), 91.4% (GD-male), and 93.1%(GD-female) for EMO-DB; 68.5% (SI), 68.5% (GD-male), and 74.6% (GD-female) for EMOVO; 90.6%(SI), 91% (GD-male), and 92.3% (GD-female) for EMA; and 95.1% (GD-female) for IITKGP female database.


Introduction
Speech emotion recognition (SER) is the task of recognizing the emotional aspects of speech irrespective of the semantic contents.Emotion recognition provides benefts to numerous institutions and aspects of life.It is useful and important for healthcare and security purposes.Also, it is vital for easy and simple detection of human feelings at a specifc moment without actually asking them.Emotion extraction from speech, or speech emotion recognition, is the technique of identifying the speaker's emotional state (SER).Te ability to recognize and interpret the speaker's emotional state is crucial for human-computer interaction.SER is used in a wide range of applications, including call centres, autos, medical services, e-tutoring and story-telling, and many more.SER can be used in a variety of ways.Depending on the context and the speaker, human speech encompasses a wide spectrum of emotions.Emotions can be categorized into six fundamental archetypal emotions based on their intensity [1,2].Anger, joy, surprise, contempt, fear, and sadness are types of emotions.If you are creating a speech recognition system, you need to take into account how the speaker's emotional state will be conveyed by extracting parts from their voice signal.In terms of speech elements that are infuenced by emotions, we can classify them as follows: qualitative, spectral, continuous, and Teager energy operator-based features.Te emotional content of a speech utterance has a signifcant impact on its pitch, zero crossing rate, and energy.Speech's vitality, articulation rate, spectral data, and basic frequency ft under this umbrella term (f 0 ).Observed emotion and the quality of the voice are closely linked.Te categories for these structures are voice level, pitch, temporal, and feature boundary structures.Te spectral analysis of a voice signal yields a short temporal representation of the signal's characteristics.Te emotional content of a spoken utterance determines the spectral energy distribution in the speech utterance.For instance, higharousal emotions, such as gladness (or) wrath, are connected with high energies in the higher frequencies, while lowarousal emotions, such as melancholy, are associated with lower energies in the same frequency range.Te nonlinear airfow in the vocal tract system is responsible for speaking.Te speaker's vocal tract system, which is responsible for producing sound, is afected by the speaker's muscle tension when he or she is stressed.As a result, nonlinear speech features are essential for accurately identifying human speech in recorded audio.By using the Teager energy operator (TEO) technique, a feature fusion of the TEO-Auto-Env and spectral features using k-NN classifer for SSER system improves the accuracy when compared to the existing ones.Te features that cannot be extracted by the existing ones can be extracted by combing TEO-Auto-Env and spectral features.
1.1.Related Work.Beginning in the early 1990s with the purpose of detecting displeasure or annoyance in the speaker's speech, the feld of voice recognition has since grown to include a wide range of applications.A variety of real-time applications require the ability to detect stressed emotions in speech.Tese applications include preventing car accidents, providing appropriate counselling to students, and assisting children's parents and other close relationships by training speech recognition systems on stressed speech [3].Many disasters can be averted if people are aware of their own high-stress emotions and can recognize them in advance.Until recently, the majority of research has focused on recognizing all of the distinct emotions, with little emphasis dedicated to the more severe ones, such as anger, fear, boredom, melancholy, disgust, and impatience, which can be painful.
So far, MFCC is the spectral feature providing promising results for speech emotion recognition.But for the depression detection, the main focus must be on the stressed emotions such as anger, sad, and so on.Hence, the stressed emotion detection was started with modifcation and feature fusion of these diferent speech features.For the stressed or depressed emotion recognition, the existing feature like MFCC was improved as modifed MFCC (M-MFCC), ExpoLog based scale, LPC improvised as one-sided autocorrelation LPC (OSALPC) [4], and a new technique with feature fusion of MFCC and short time energy features with velocity (Δ) and acceleration (Δ + Δ) [5].Feature extraction techniques were used for stressed emotion recognition whose performance was better compared to the existing MFCC and LPCC methods.Later, specifcally for anger emotion recognition, acoustic (pitch, loudness, spectral features) and linguistic (probabilistic and entropy-based words and phrases) cues [6] were introduced.Apart from these, other diferent feature extraction techniques such as a sinusoidal model-based feature extraction technique with frequency, magnitude, and phase features [7]; empirical mode decomposition method; feature optimization method [8] to select particular frames of the speech signal by choosing proper flter bank; hybrid biogeography-based optimization; and particle swarm optimization (BBO_PSO) [9] by the proper selection of higher-order spectral features were used for depressed emotion recognition.Yang and Lugger [10] proposed the combination of qualitative and voice quality features; Wen et al. [11] proposed weighted spectral local Hu parameters to overcome the disadvantage of MFCC feature; Wang et al. [12] proposed Fourier parameters; Setayeshi et al. [13] proposed a bioinspired ANFIS technique combined with MLP for SER for anger, happy and sad emotions; and, Ying and Xue-Ying [14] proposed glottal compensation to zero crossings with maximal Teager energy operator (GCZCMT) for speech emotion recognition and performed well compared to MFCC feature.Te Glottal Function Index (GFI) is a validated, reliable, and easily selfadministered four-item battery that is aimed specifcally at identifying the presence and degree of vocal cord dysfunction in adults.GCZCMT feature is a feature possibly and efectively distinguishing emotional state.It has high practical value and best suited for actual complex language environment.But few of the stressed emotions such as anger, disgust, sad, and so on, were not accurately detected using these features also.
Teager and Kaiser [15,16] developed a feature called Teager energy operator (TEO) to recognize strained emotions for the frst time based on the concept that hearing is the mechanism of energy detection.Emotional stress can be detected with the Teager energy operator (TEO).An energy profle-driven pitch contour [17] has been created for Lombard and fury emotion recognition based on this.Neutral and stressed speech can be distinguished using TEO-FM-Var, normalised autocorrelation envelope Area, and critical band autocorrelation envelope area [18,19].For clinically depressed or stressed speech detection, the TEO-CB-Auto-Env was employed in conjunction with several low-level descriptors (LLDs), such as pitch, formants, energy, and their delta and delta-delta as well as spectral features (spectral fux, entropy, and their centroid) and their delta.TEO-CB was found to exist.Tis procedure, however, is complicated because of the enormous number of factors involved.Multiple feature fusion techniques [20][21][22] combining glottal, prosodic, spectral, and TEO-based 2 Computational Intelligence and Neuroscience characteristics were later presented [20][21][22] for the detection of stressful emotions.[19] reported the infuence that classifcation accuracies have in speech analysis from a clinical dataset by appending acoustic low-level descriptors(LLD) belonging to prosodic (i.e.pitch, formants, energy, jitter, shimmer) and spectral features (i.e.spectral fux, centroid, entropy, and roll-of) along with their delta(Δ) and delta-delta(Δ-Δ) coefcients to two the baseline features of Teager energy critical band-based autocorrelation envelope and Mel-frequency cepstral coefcients [23] collected and analyzed the movement data of the jaw, the tongue tip, and the lower lip, along with speech, and research diferences in speech articulation among four emotion types: neutral, anger, happiness, and sadness.Te efectiveness of the articulatory parameters in emotion classifcation was also investigated.
Te highest accuracy achieved by existing work is 91%, which is overcome by this proposed technique.

Speech Emotion Recognition
System.Te preprocessing, feature extraction, and classifer blocks of the basic speech emotion recognition (SER) system are illustrated in Figure 1 as the system's building parts.Physical quantities in the speech signal are delivered to the feature extraction block after they have been processed in the preprocessing stage.Te features F 1 , F 2 , . .., F n are presented to the classifer in this section.Finally, this classifer is capable of distinguishing between diferent emotional states.
Because of this preprocessing, the feature extraction module will be able to process the speech signal more quickly and more accurately.Prior to image processing, there are three steps of preprocessing: flter, framing, and window.After the preprocessing procedure, the speech signal is used to extract physical properties including pitch, energy, and formants [26].Filtering reduces the amount of noise in a spoken signal either during the recording process or as a result of disruptions to the recording environment.As a way to boost the intensity of speech signals at higher frequencies, a preemphasis flter is used.Nonstationary signals, such as speech, are difcult to analyze since they are nonstationary by defnition.In this way, the voice signal can be analyzed independently of one another because it is divided into an equal number of samples.Te size of the frame is governed by the feature extraction method used.To avoid the mismatch between the frames, some overlap between the frames is permitted.When the incoming data signal is divided into segments, some discontinuities arise at the frame boundaries as a result (frames).Tis discontinuity is avoided by passing each frame via a tapered glass.
It is important to identify a specifc speech feature extraction technique in speech emotion recognition that can classify the emotions from speech efciently.So far, many speech features have been investigated for speech emotion, but the best speech features is not yet discovered.Figure 2 shows some of the examples by categorizing the speech features [1,2].However, the combination of speech features to represent the speech signal is the most common practice in speech emotion recognition.In this work, a new TEObased feature is proposed and is combined with the spectral features.Among these, the SSER system developed using the proposed feature fusion is compared with existing spectral features MFCC, LPCC, and RASTA-PLP-based SSER systems.
In this paper, a new feature based on TEO, i.e., TEO-Auto-Env is combined with MFCC, LPCC, and RASTA-PLP, and the combined features are further optimized using principal component analysis.Later, the emotions are classifed using k-NN classifer in the proposed stressed speech emotion recognition system.Te proposed methods provide better performance and are also simple to design compared to the previous existing methods.
Te following is the structure of this paper: Section 2 covers the database that was used in this study, followed by Section 3, which discusses the proposed feature extraction approach that makes use of TEO.Section 4 presents the simulation results of the proposed stressed speech emotion recognition system, which is based on feature fusion and optimization.Section 5 concludes the study by summarizing the fndings of the paper.

Database Description
Te speech emotion database is also one of the challenges in the analysis of speech emotion recognition.Te classifcation accuracy varies with diferent datasets.Tere is no standard database in all languages that is accepted by all the emotion recognition researchers so far.EMO-DB is the German database, i.e., mostly considered by many of the researchers and hence used in this paper.Apart from this EMOVO (Italian), IITKGP (Telugu), and EMA (English) databases are also considered.Te description of the databases is discussed in the following sections.

EMO-DB Database.
Tis is a Berlin Database of Emotional German Speech compiled by the University of Berlin.Five male and fve female performers between the ages of 25 and 35 were recorded in an anechoic room for the purpose of capturing emotional data.In the chart, the emotions were labelled as follows: anger (1), boredom (2), disgust (3), fear Computational Intelligence and Neuroscience (4), happiness (1), sad (2), neutral (1), and fear (3).Table 1 shows the performers each delivered a total of 10 diferent phrases onstage.Te following is a breakdown of the database distribution.

EMOVO Database. It is the frst emotional database in
Italian language [27].Te recordings were done in Fondazione Ugo Bordoni laboratories.Six actors with three male and three female speakers has spoken14 sentences in six diferent emotions, namely, disgust (D), fear (F), anger (A), joy (J), surprise (Su), sadness (S) apart from neutral (N) speech were recorded.Table 2 shows the distribution of EMOVO Database.Among these disgust, fear, anger, joy/happiness, sad, and neutral are considered for the analysis in this paper.

IITKGP Telugu
Database.Tis is the frst Telugu database created by IIT Kharagpur [28].Te recordings were done by radio artists with 15 sentences spoken by 10 different speakers.Te emotions recorded were anger, compassion, disgust, fear, happy, neutral, sarcastic, and surprise.Among this, only fve female speaker data consisting of anger, compassion, happy, and neutral are used in this analysis.Te distribution of these emotions is as shown in Table 3. [29] comprises of one male and two female native speakers of American English.A total of 14 sentences for male and 10 sentences for female speakers with fve repetitions of each sentence in four diferent emotions, namely, neutral, anger, sad, and happiness were recorded.All the emotions are considered in this analysis.Te distribution of these emotions is as shown in Table 4.

Stressed Speech Emotion Recognition (SSER) System Using the Feature Fusion of Proposed TEO-Auto-Env and Spectral Features
Te proposed stressed speech emotion recognition (SSER) system is as shown in Figure 3 to classify the stressed emotions such as anger, fear, disgust, sad, and so on, effectively compared to the existing methods.In this proposed system, the speech signal is given to the feature extraction block with the feature fusion of Teager energy-based feature, i.e., TEO-Auto-Env and spectral feature.

Teager Energy Operator (TEO).
An energy operator, known as the Teager energy operator, was developed by Teager [17] as a measure of speech signal energy.Teager energy operator(TEO) substantially shows the frequencies and immediate changes of the signal amplitude that's truly sensitive to subtle changes.Although TEO was frst proposed for modelling nonlinear speech signals, it was later extensively applied in the audio signal processing.It gives high reliability and accuracy.Teager's research showed that the fow of oxygen in the vocal tract is separated and follows  Tere is a large amount of air that passes close to the lips during air's propagation as it follows the walls of the vocal tract.One of the most important aspects of this model is the vortex.An unconstructed vocal tract with sound produced just at the glottis is what is assumed in the traditional speech production concept [23].However, Teager contends that the voice signal is modulated as a result of the active sound production of vortices around the artifcial vocal folds.
On the other hand, Teager later conducted a series of studies on the hearing process and came up with a measurement of the energy parameter that could be used to prove that speech modulation patterns exist in the environment.Tis is the frst time that Kaiser [16] has proven the energy operator in his study. or where "x(t)" and x[n] are the continuous and discrete speech signals.
When a voice signal is created under a stressful scenario, the nonlinear fow of air in the vocal tract system is disrupted, resulting in decreased vocal quality.When it comes to speech identifcation, nonlinear qualities like these are important to success.TEO is mostly used to discriminate between Lombard, furious, and loud emotions, as well as between neutral and negative emotions.Figure 5 illustrates what I'm talking about: the voice signal is not preemphasis fltered, but it is routed through the TEO block before reaching the receiver [30].When strained emotions pass over the TEO barrier, there is an increase in the amount of stress-related emotional energy released.In order to further magnify the supercharged voice stream, framing and windowing blocks are employed.Tese frames are taken into consideration by the autocorrelation function.

Autocorrelation Function (ACF).
Te ACF is a timedependent correlation between a signal and a delayed copy of that same signal in signal processing.How comparable two observations are to each other as a function of how long it has been since they were made.
When the function gets a signal with the name "x(t)," the argument is used to delay the signal until the end of the function.It occurs when the autocorrelation function receives frames of the Teager-energized signal and applies the autocorrelation function to them.If the correlation between adjacent frames is particularly strong, the energy of the audio signal can be amplifed even higher by detecting the correlation between adjacent frames.In order to proceed with further processing, the autocorrelated sequence is handed to the extraction block of the envelop.

Area under the Envelope of TEO Autocorrelation.
Tereafter, the area under the envelope of the TEO autocorrelation sequence obtained is calculated using trapezoidal numerical integration.Te resultant values obtained are the

Spectral Features.
Spectral features provided good accuracy so far in emotion classifcation.But, the drawback of these features is, they treat all the emotions similarly.But, for the stressed emotions, during the speech production process, the energy of the speech signal is deteriorated.Because of this, the complete features of these emotions were not perfectly extracted using the spectral features.Hence, there was a need of a feature to increase the energies of these stressed emotions [31].A feature based on TEO can be used for this purpose.TEO-Auto-Env is combined with each of the spectral features, i.e., MFCC, LPCC, and RASTA-PLP to extract the features as shown in Figure 6.Tese combinations of features are given to the classifer to detect the emotions.Terefore, a new feature based on TEO is proposed in this work, and this feature is combined with the spectral features in order to extract the features of the stressed emotions that were not efectively extracted by spectral features alone.Te spectral features considered for the analysis are MFCC, LPCC, and RASTA-PLP.

Spectral Features
(1) Mel-frequency cepstral coefcients (MFCC): most often used algorithms for spectrum transformations in voice recognition and emotional expression recognition are MFCCs.Using cepstral analysis, computers can imitate the human ear's perception.As mentioned in [32], in order to compute MFCCs, the speech must be divided into frames.
(2) Linear prediction cepstral coefcients (LPCC): in the vocal tract system, it is made up of a mixture of the excitation source as well as the time-varying components.To pinpoint the source and system components in the time domain so that they may be independently evaluated, linear prediction (LP) analysis is utilised.
(3) Cepstrum analysis: it is used to separate the source and system parameters of speech production process without the a priori information about source or system parameters.Te cepstrum is defned as the inverse discrete Fourier transform (IDFT) of log magnitude of the DFT of a signal, where x[n] is the input signal.
(4) RASTA-PLP: it uses RASTA fltering in perceptual linear prediction (PLP).Perceptual processing, such as critical band analysis, equal loudness preemphasis, and intensity loudness, is performed before executing autoregressive (AR) modelling.PLP coefcients [33] are created from LP coefcients before performing AR modelling.RASTA fltering [34,35], also known as bandpass fltering in the log spectral domain, was invented at the same time as PLP.By using this, the slow variations in the channel are suppressed.A general RASTA flter is defned by where the numerator is a regression flter of Nth order and denominator is an integrator.Each of these spectral features is combined with TEO-Auto-Env feature and is used for stressed speech emotion recognition.

Feature Optimization Using PCA.
Te curse of dimensionality can be alleviated before classifying emotions by selecting features through optimization.Training and classifcation performance sufers as a result of the higher-dimensional feature set's tendency to promote overftting.Additional advantages include a reduction in the difculty of collecting data, a boost in classifer unambiguity, and an increase in classifcation accuracy [36][37][38].Following feature extraction, this work uses the principal component analysis (PCA) optimization technique to choose the most important characteristics and improve the classifcation algorithm's performance.If there are "N" measurements or samples in an n-dimensional space to be compressed, PCA searches for "d" n-dimensional orthonormal vectors that represent the data in a best manner, where d ≤ n.Computational Intelligence and Neuroscience the transformed k × 1-dimensional sample in the new subspace.

Emotion Classifcation
Using k-NN.Te features selected using this PCA are used by the k-NN classifer for the classifcation of the emotions.Instead of fnding the probability density function as in Gaussian mixture model, k-NN implicitly fnds the decision boundaries of the diferent emotional features [39,40].Te Euclidean distance measure algorithm is used to fnd the nearest neighbors to a specifc emotional feature set boundaries.Te training data are labelled accordingly with their emotional class labels.Given a point "x" to be classifed, "k" nearest neighbors of "x" is selected and the point "x" is assigned to majority label of the "k" neighbors.

Simulation Results
Te SSER system is developed for four databases, namely, EMO-DB, EMOVO, IITKGP, and EMA for gender- From the results depicted in Figures 7-9, it is clear that the classifcation accuracies of the stressed emotions are increased when the spectral features are combined with the TEO-Auto-Env feature for both GD and SI cases.Among all the combinations, TEO-Auto-Env feature when combined with MFCC & LPCC gave highest classifcation accuracy for all emotions compared to the rest of the feature extraction techniques.
Only female speech of IITKGP database is considered in this work; hence, only GD-female case is considered.From Computational Intelligence and Neuroscience Figure 6, the stressed emotion anger with TEO-Auto-Env + MFCC + LPCC feature extraction technique is detected with an accuracy of 95%.

Comparison of the Classifcation Accuracy of the Proposed SSER System Using TEO-Auto-Env + MFCC + LPCC Feature
Extraction Technique with Previous Related Work for EMO-DB Data.In [11], the weighted spectral local "Hu" (HuWSF) moments were proposed for feature extraction of SER.HuWSF are commonly used in image feature extraction, whereas in speech, the frst absolute orthogonal invariant of Hu moments is utilised for speech feature description, as these can fnd the energy concentrated to the center of energy gravity of 2D data based on the degree of evaluation.Te SER system build with these HuWSF features provided a weighted average recall or classifcation accuracy of 74.71%, and later, these were combined with prosodic (PROS), zero crossing with peak amplitudes (ZCPA), LPCC, and MFCC to acquire an accuracy of 81.74% for all the emotions of EMO-DB database, whereas, by using the proposed method i.e., TEO-Auto-Env combined with MFCC and LPCC, the SSER system provided higher accuracy of 91.4% for EMO-DB data.
In [10,12,29], only six emotions of EMO-DB are considered apart from disgust emotion.In [10], the harmony features with combination of standard (qualitative), voice quality (VQ), and pitch interval (INT) features were used for SER and provided overall accuracy of 71.01%.In [29], discriminative band wavelet packet power coefcient (db-WPPC) features are used for SER, which provided an accuracy of 75.51%.Tese accuracies were less compared to the proposed SSER system i.e., 90.95%.In [12], Fourier parameters are used for feature extraction in SER.Tese parameters gave high accuracy in the detection of all the emotions of EMO-DB except for boredom emotion with 71.48% accuracy, whereas, in this proposed SSER system, the boredom emotion is detected with 91.5% accuracy.
In [13], a bioinspired ANFIS technique combined with MLP is used for SER, which gave an accuracy of 67.5% for anger and 52.5% for happiness emotions, and in [14], the correlation between glottal and auditory features of speech was considered, and based upon this, a GCZCMT feature is proposed for SER, which gave an accuracy of 83.25% for anger, 75.28% for happy emotions, and 86.96% for neutral, whereas the SSER system proposed using TEO-Auto-Env + MFCC + LPCC in this paper gave highest accuracy compared to these features with 93.8% for anger, 91.6% for happy emotions, and 89% for neutral.
From these comparisons, it is clear that the SSER system using the proposed feature fusion of TEO-Auto-Env combined with MFCC and LPCC gave the highest accuracy among all the other features.

Conclusion
Te SSER system developed based on the feature fusion of the TEO-Auto-Env and Spectral features using k-NN classifer provided improved accuracy in case of all the databases compared to the SSER system build using individual spectral features.TEO-Auto-Env feature is based on TEO, as it is basically designed to improve the energies of the stressed emotions.Because of this reason, when TEO-Auto-Env is combined with spectral features, the features that were not able to be extracted from the stressed emotions were extracted, and the combination of the TEO-based feature with spectral features yielded better performance.SSER system is developed for four databases, namely, EMO-DB, EMOVO, IITKGP, and EMA for gender-dependent (GD) and speaker-independent (SI) cases.Among all the combinations, the SSER system developed using the feature fusion of TEO-Auto-Env MFCC and LPCC gave the highest accuracy in the classifcation of stressed emotions with 91.4% (SI), 91.4% (GD-male), and 93.1% (GD-female) for EMO-DB; 68.5% (SI), 68.5% (GD-male), and 74.6% (GD-female) for EMOVO; 90.6%(SI), 91% (GD-male), and 92.3% (GDfemale) for EMA; and 95.1% (GD-female) for IITKGP female database compared to other feature fusions, which shows a favorable recognition performance in independent emotion speech recognition experiment.Also, the classifcation accuracy of the SSER system with the proposed feature showed higher accuracy compared to the features discussed in the literature.

Figure 4 :
Figure 4: Nonlinear model of sound propagation along the vocal tract.

3. 3 Figure 6 :
Figure 6: Comparisons of classifcation accuracies of SSER system for diferent emotions using the individual spectral features and the feature fusion of the proposed (TEO-Auto-Env) + spectral feature extraction techniques for IITKGP Telugu database for GD-female case.

Figure 7 :
Figure 7: Comparisons of classifcation accuracies of SSER system for diferent emotions using the individual spectral features and the feature fusion of the proposed (TEO-Auto-Env) + spectral feature extraction techniques for EMO-DB German database for (a) GD-male, (b) GD-female, and (c) SI case.

Figure 8 :
Figure 8: Comparisons of classifcation accuracies of SSER system for diferent emotions using the individual spectral features and the feature fusion of the proposed (TEO-Auto-Env) + spectral feature extraction techniques for EMOVO Italian database for (a) GD-male, (b) GD-female, and (c) SI case.

Figure 9 :
Figure 9: Comparisons of classifcation accuracies of SSER system for diferent emotions using the individual spectral features and the feature fusion of the proposed (TEO-Auto-Env) + spectral feature extraction techniques for EMA English database for (a) GD-male, (b) GD-female, and (c) SI case.

Figure 10 :
Figure 10: Comparison of the classifcation accuracies of the SSER system using the individual spectral feature and proposed feature extraction techniques of EMO-DB, EMOVO, IITKGP, and EMA database for (a) GD-male, (b) GD-female, and (c) SI case.