Deep Learning-Based Classification of Spoken English Digits

Classification of isolated digits is the basic challenge for many speech classification systems. While a lot of work has been carried out on spoken languages, only limited research work on spoken English digit data has been reported in the literature. The paper proposes an intelligent-based system based on deep feedforward neural network (DFNN) with hyperparameter optimization techniques, an ensemble method; random forest (RF), and a regression method; gradient boosting (GB) for the classification of spoken digit data. The paper investigates different machine learning (ML) algorithms to determine the best method for the classification of spoken English digit data. The DFNN classifier outperformed the RF and GB classifiers on the public benchmark spoken English digit data and achieved 99.65% validation accuracy. The outcome of the proposed model performs better compared to existing models with only traditional classifiers.


Introduction
Speech is a means of communicating information from one or more speakers to one or more listeners. e speech produced by a speaker carries data in the form of signals, which are being transported from the mouth of the speaker to the ear of the listener. Speech is made up of sequences of phonemes, which are uttered at an average rate of approximately 12 phonemes per second [1]. Speech communication has become a predominant model for information exchange and social interaction among humans. Speech recognition is an emerging technology in the area of natural language processing (NLP) by Jurafsky and Martin [2].
Classification of speech is one of the most essential issues in speech processing [3]. Classification is the procedure of labeling a given set of data into classes. e process is conducted on organized data as well as on unorganized data. It starts with predicting the class of given data points which are known as targets, labels, or categories. e essence of classification predictive modeling is to map input values, x, to category y, output values [4] using a mathematical function. Classification of isolated digits is the basic challenge for many speech classification systems. Limited studies have been conducted on the classification of the English digit data. e challenge with spoken digit recognition is a result of the following: (1) the spoken digits are of short acoustic duration, normally a few seconds of speech; (2) Some digits are acoustically identical to each other [5], Kopparapu and Rao [6]. e importance of this challenge has led several authors to research how to enhance digit recognition for different languages, which includes English [7], Portuguese [8], Arabic [9], and Mandarin [10]. e model proposes an intelligent-based system that will make use of a deep feedforward neural network (DFNN) with hyperparameter optimization techniques, an ensemble method; random forest (RF), and a regression method; gradient boosting (GB) for the classification of the spoken English digit data. e proposed DFNN performance was evaluated using hyperparameter optimization techniques such as adaptive moment estimation (Adam) optimization algorithm and stochastic gradient descent (SGD) optimization algorithm. Adam's optimization algorithm showed a better result than the classical SGD optimization algorithm. Optimization is a method of finding the best value of some function or model. Optimization for test cases aims at minimizing the number of test cases while delivering the best fault coverage [11]. Short-term Fourier transform (STFT) was used to extract features from the audio data before performing one-hot encoding to produce the class label.
In our previous work, the proposed model used only DFNN with optimization techniques for the classification of the spoken English digit data [12], but in this work, the classification approach has been extended to use RF, GB, and DFNN for the classification of the spoken English digit data. e essence is to compare the performances of three different machine learning (ML) algorithms and to determine the best approach amongst them for classification purposes.
e result from our experiment shows that DFNN is the best classification method compared to RF and GB.
Contributions to this research are as follows: (1) A brief review of RF, GB, and DFNN methods and techniques is presented. (2) e paper investigates three (3) different ML algorithms to determine the best approach for the classification of spoken English digit data. e rest of the paper is arranged in this form: Section 2 considers recent speech classification techniques and their achievements. Section 3 gives a comprehensive description of the proposed ML algorithms, methods, and techniques for the proposed model. Section 4 discusses the experiments, model training, and experimental results and presents a relative analysis of the proposed model. Finally, Section 5 presents the conclusion.

Related Works
Speech classification in recent times has left several authors with the challenge of having to investigate the best method for achieving optimum accuracy. e model proposes a novel approach that can be used to classify Bengali spoken digits using the convolutional neural network (CNN) [13]. e voice recordings of ten (10) individuals were classified considering gender, dialects, and age groups. e result of the classification accuracy is 98.37%, which shows the credibility of the proposed approach. e result here is bounded for Bengali spoken digits. e work proposes a speech pathology recognition system that will automatically analyze the voice system of patients [14]. NN and deep learning methods were used for the classification of speech signals, to distinguish between a voice signal that is normal or pathological. e Levenberg Marquardt algorithm was used for classifying voice signals, whereas the restricted Boltzmann machine algorithm was used to implement the deep learning classification of the voice signals. e restricted Boltzmann machine algorithm shows an accuracy of 98.00% compared to the Levenberg Marquardt algorithm with 92.00% accuracy. e accuracy of the proposed model can be improved when tested with the other ML algorithms during network training. e proposed model combines lexicon-based and machine learning methods for the prediction of hate speech, based on sentiment analysis [15]. e emotional facts found in the text assisted in improving the accuracy of hate speech detection, from 41.00% in the previous work to 80.64% on the test result. e proposed model could use deep learning optimization techniques alongside the lexicon-based and ML methods to improve hate speech accuracy.
A progressive rendering of a real-time speech emotion recognition application using the AlexNet image classification network was proposed in [16]. e baseline approach shows the result of 82.00% accuracy on the Berlin emotional speech (EMO-DB) data. e proposed model could not attain a high accuracy even with the AlexNet pretrained network.
e model proposed in [17] introduces a new multimodal deep learning framework that instinctively extracts features from textual-acoustic data for speech intention classification.
e proposed system was tested in a real medical setting to serve as a reference for future research. e model achieved an average accuracy of 83.10% when 6 different intentions were detected. e model proposed here has performed better than existing models that used manufactured features. e proposed model accuracy is not very high. e study in [18] studied a good deal of speech classification algorithms. A comparative analysis of five classification algorithms was conducted. Based on the result of investigations, a multilayer perceptron with 93.00% accuracy by the Robust scaler method was proposed. Achieving such accuracy for the proposed method is restricted to using the Robust scaler method to scale the multilayer perceptron. A deep feedforward multilayer perceptron was proposed in this work, and the accuracy was 99.65%.
In [19], a deep CNN was used to advance Pashto isolated digit recognition. Mel frequency cepstral coefficients (MFCC) were used in extracting features from the speech signal. e result shows an accuracy of 84.17% for testing, which is equivalent to a 7.32% improvement in comparison with existing works. e proposed approach is edged in Pashto isolated digit recognition.
In a recent study on Dari one-word speech recognition, CNN was used in recognition of the isolated words in Dari speech [20] using deep learning algorithms. MFCC was used for feature extraction during training. e test result shows 88.2% accuracy, which reveals that the proposed method predicts visualized words with high accuracy during training. e use of other deep learning techniques for the analysis of the Dari speech can improve the model accuracy.
Marcolla et al. [21] proposed a new approach known as "lie detection" for speech classification using a voice stress analysis method. e authors employed the long short-term memory (LSTM) network to analyze and classify a person's speech as authentic or not. e best neural network model in the proposed method showed a precision accuracy of 72.5%. e result is scientifically remarkable for such problems as voice stress analysis, which implies that it is possible to find patterns in the voice of people who are under stress. e precision accuracy is considerably not high and could be improved.
A multiclass classification was conducted on the spoken English digit dataset using support vector machine (SVM), K-nearest neighbor (KNN), and random forest (RF) [22]. RF performed better than SVM and KNN. With 10% testing data, 97.50% accuracy was obtained. Using ML methods with hyperparameter optimization techniques as proposed in this work yielded a high accuracy of 99.65% on the same dataset.
A speech classification module was developed in [23] that will identify the appropriate speech for generating a medical report. e evaluation of the proposed model was performed using CNNs and LSTMs. Several parameters were tested and the performance of the model on different speaker features was examined. CNNs show 92.41% validation accuracy on 2709 speech segment data and are more thriving than LSTM networks. e proposed model could possibly be evaluated using different types of machine learning algorithms to obtain optimal validation accuracy, as shown in this work.
Sánchez-Hevia et al. [24] analyzed the performances of various deep neural networks (DNNs) for age estimation and differentiating gender from speech in interactive voice response (IVR) systems.
e results of their experiment indicate good results for all the types of networks for gender classification, but combining CNNs and temporal convolutional networks (CTCN) gives a better result for the age group classification. eir best systems showed about 80% and 70% for precision and recall, respectively. Precision and recall accuracy is relatively high. e proposed research in [25] used an RF and SVM classifier on 200 images of the standard Odia database for simulation. e simulation result shows 96.3%, 98.2%, 88.9%, and 93.6% accuracy on the Odia character and the Odia numerical database, respectively. e result is bounded in the Odia database. e study by [26] presented an automated recognition system that will accurately classify authentic and forged signatures for offline signature verification. e proposed model was compared with six pretrained CNNs architecture based on transfer learning (TL) across a collection of publicly available signature samples. e outcome of their experiment shows 88% accuracy on the proposed model compared to other related networks and can be approved as a prototype for offline signature verification. e proposed model's accuracy could be improved using different machine learning algorithms to obtain optimal accuracy. Sethy et al. [27] proposed an automated hybrid system for handwritten character recognition. e proposed model was tested on three benchmark datasets; Odia characters, Bangla numerals, and Odia numerals. e overall performance of handwritten Odia characters is 99.01% and 98.1%, Odia numerals are 98.6% and 97.6%, and Bangla numerals are 97.6% and 96.3%, respectively. e result analysis shows the best performance for least-square (LS)-SVM compared to RF. e performance of the proposed system is high but can be improved.

Methods and Techniques
e section initiates a progressive step in developing the proposed model. e proposed method for classifying the English digit data using ML algorithms; DFNN, RF, and GB includes the steps depicted in Figure 1. Data preprocessing is a step in which the raw data is transformed, or encoded, to bring it to a state that is appropriate for a machine or a deep learning model, and it is the first and pivotal step while creating a model.

Reading Dataset.
e audio data is read using the library "Librosa." STFT features were extracted from the audio data [28]. e main idea behind the STFT feature extraction is to break up the longer time signals into smaller fragments of the same length and then compute the Fourier transform independently on each of the smaller fragments. e continuous form of STFT is expressed as (1) e discrete form of STFT is conveyed as (2) w(n) stands for the analysis window [29], and it is assumed to be non-zero. Figure 2 represents extracted STFT features.

One-Hot Encoding
Method. e proposed method in this work used one-hot encoding as a preprocessing step. e method is applied to the categorical data variables to convert them to a form that is appropriate for ML algorithms to perform an improved task of classification. is involves first mapping the categorical variables to integer values. en each of the integer values is represented as a binary vector, i.e., all are 0's except for the index integer which is 1. e conversion to this form is very necessary because many ML algorithms cannot work with the categorical data directly, it must first be converted to numbers. Figure 3 shows how each category value is transformed into a new column and assigned a "1" or "0" value, which is a notation for true/false. One-hot encoding of the audio data generates a target class label which was used as input into the proposed model.

Stochastic Gradient Descent.
Gradient descent is a method for minimizing a function J(θ), where J is the loss function and θ ∈ R n is the model's parameter vector. To minimize J(θ), one has to calculate the gradient ▽J(θ) with respect to the parameter θ. en the parameter θ is updated as follows: where the learning rate η controls the size of the steps to reach a local minimum. Formula (3) represents the steepest descent (batch-gradient descent) algorithm for minimizing J(θ).

Computational Intelligence and Neuroscience 3
For each training example x(i) and label y(i), SGD [30] executes an update of the parameter as SGD was developed to overcome the pitfalls of batchgradient descent. SGD is faced with the challenge of having to choose an appropriate learning rate, to avoid shifts at the point of convergence.

Adam Optimization Algorithms.
Adam [31], is a dynamic method for stochastic optimization that demands only first-order gradients with minimal memory requirement. Adam shows an edge over SGD by combining two other extensions of SGD; adaptive gradient (AdaGrad) [32], and RMSProp [33].
Suppose, we want to solve an optimization problem of the formula as and the second order of momentum (the bias-corrected β 1 , β 2 ∈ [0, 1] as in equations (6) and (7) are the hyperparameters that control the exponential decay rates of moving averages. For the equations (6), (7), β 1 � 0.9, β 2 � 0.999. e Adam optimization algorithm used for this study permits the network to achieve high accuracy by regulating the network's weights through an adaptive moment gradient change.

Random Forest
Classifier. RF is a supervised ML algorithm that is used widely in classification and regression tasks. It is derived from the concept of ensemble learning, which involves the combination of various classifiers to resolve complicated problems and improve the model performance.
RF and other ensemble methods do not need as much preprocessing as some other methods. RF consists of multiple decision trees, each of which output a prediction. It is often said that in a given forest, more trees make for more robustness. RF creates decision trees through the selection of data samples randomly to get the prediction from each of the trees and then arrive at the best result using balloting [34].
In RF, each decision tree, otherwise known as the base learner, can benefit from a random subset of feature vectors [35]. Consequently, the feature vector is described in the following formula: which is an n-dimensional vector. Let L(Y, f(x)) be the loss function. e main objective is to find the function f(x) that predicts the parameter Y. e goal of the loss function is to minimize the expected value of the loss. Squared error loss and zero-one loss are common choices in regression and classification applications. ey are defined in (9) and (10), respectively, [36]. [2400 rows × 10 columns] (2400, 10) 0 · 0 · 1 0 · · · · · · · · · · · · · · · · · · · · · · · · Computational Intelligence and Neuroscience e steps in implementing the RF algorithm are as follows: (i) Step 1-First, choose random samples from a given dataset. (ii) Step 2-Here, the algorithm builds a decision tree for each sample. A prediction outcome is computed from each of the decision trees. (iii) Step 3-is step performs voting for each of the predicted results. (iv) Lastly, the final predicted result will be selected as the most voted prediction.
RF was implemented on the proposed model using an RF classifier represented as 'CLF'. Here we set the number of trees in the forest to 100, which is default of n_estimators, while the maximum_depth is set to 5. is implies that the number of decision trees is 100. en the 'CLF' is fit to X_train and y_train, respectively, to train the model on the data. e model's accuracy when trained with the RF classifier showed a validation accuracy of 73.67%. e proposed RF algorithm and the corresponding flowchart are described in Algorithm 1 and Figure 4, respectively. Figure 5 explains the working of the RF algorithm.

Gradient Boosting Classifier.
e gradient boosting (GB) [37] classifiers are groups of ML algorithms that merge numerous weak models to produce a stronger predictive model. It is a concept from ensemble learning for solving regression and classification problems. GB combines several decision trees on subparts of the same dataset to form a stronger predictive model.
It integrates multiple machine learning models (mainly decision trees) and every decision tree model gives a prediction. Decision trees are used as the weak learners in GB. Decision trees solve the problem of ML by converting the data into a tree representation. If we align all the decision trees in a successive order, then it can be said that each subsequent model would minimize errors in the prior decision tree model. For a better understanding of the statements above, Figure 6 was used to illustrate. e first step in GB is to create an initial constant prediction value F 0 , where where L is the loss function, c is the predicted value. Since the target column is continuous, our loss function will be Here y i is the observed value, and c is the predicted value.
ere is a need to find the least value of c that minimizes the loss function. e proposed GB algorithm is defined in Algorithm 2 with the corresponding flowchart as represented in Figure 7. e parameters that were used for the GB classification in this work are defined in Table 1.

3.6.
e Proposed Deep Feedforward Neural Network Architecture. Deep neural networks (DNNs) have become a fundamental part of state-of-the-art ASR systems [38]. e DNN-based classification as proposed in the study enforced acoustic attributes that are plucked from the raw speech data [39]. In a feedforward neural network, information always travels in one direction [40]. ere are no feedback connections and no cycles or loops in the network. e proposed DFNN model used dense sequential fully connected layers that consist of three hidden layers with 256, 128, and 128 dimensions, respectively. e input and output layers are 1025 and 10 dimensions, respectively. In the first layer, the input layer is of 1025 dimensions whereas the input for the second dense layer is the output of the first layer, which is 256 dimensions. e third layer is related, the model instinctively considers the input dimension to be the same as the output of the last layer, which is 256. e last layer also known as the output layer with 10 dimensions represents 10 classes.

Computational Intelligence and Neuroscience
Hyperbolic tangent ( tanh ) activation was used at each level of the network layer except for the output layer. However, the output layer used the softMax activation function. e softMax activation function is often used in deep learning models as the last activation function of the neural network (NN) to regulate the network's output against the predicted output classes. e tanh activation function was also chosen due to its nonlinearity. e output of tanh is between the range of −1 to +1. Like the sigmoid, tanh in addition has a dispersing inclination issue. Tanh is also zero-centred, which enables modeling of inputs with strongly negative, neutral, and positive values. e DNN's top layer is made up of nodes that employ the softMax activation function [39]. e function permits the DNN to produce class probabilities for each node which sums up to 1.
Y represents the target mixtures, while W and, b represents the weight matrix and bias vector, respectively.
e model proposed here used categorical cross-entropy loss, which specifies multiple classes. Hence, it is a loss function for multiclass classification tasks. ey are used for optimizing classification models during training, to reduce loss function. Cross entropy loss is a key factor in deciding how many epochs will be used for a particular model.
Adam and SGD are the optimization algorithms for minimizing errors in the proposed DFNN. e results of the two algorithms were compared, and Adam showed better accuracy than SGD. e proposed DFNN structure is represented in Figure 8, whereas the algorithm for the proposed DFNN model is illustrated in Algorithm 3. e flowchart for the proposed DFNN model is illustrated in Figure 9. · · · · · · · · · · · · · · · · · · Figure 6: e architecture of gradient boosting.
(1) procedure RandomForest Classifier (X, Y) (⊳)X contains the STFT features of each audio sample, whileY contains the target audio class label (2) Read the dataset using the library "Librosa" (3) Extract STFT features from the audio (4) One hot encode the audio data to produce the class label. (5) Split the dataset into training and testing set with STFT features as the input and audio class as the target label (6) Start the random forest model (7) Set up the hyperparameters tuning: n_estimators, max_depth (8) RandomForestClassifier (Hyperparameters) (9) Fitting training and the testing dataset (10) Evaluate the model (11)

Dataset.
e dataset used for the model's training and validation is a well-founded freely accessible dataset, a collaboration that works with Pannous [41] to improve speech recognition. e dataset is from the Librosa library [42]. It consists of isolated digits with a total of 2400 different audio files in a WAV format for the model training. For the training of each of the proposed models, a training-validation data split of 75%-25% was used. e validation accuracy is fitted into the model's training output. STFT features were used as input and the audio class as the target label in the proposed model. e RF technique was the first to be implemented in the work using an RF classifier represented as 'CLF'. e hyperparameter tuning was set up with the number of decision trees � 100, which is the default of n_estimators, while the maximum depth was set to 5. e 'CLF' was fitted to the X_train and y_train, respectively, to train the model on the data. After training, the result of the RF classifier showed a validation accuracy of 73.67%. e model was trained next using the GB classifier with parameters set for n_estimators, max_features, max_depth, learning_rate, and random_state of values 20, 2, 5, 0.05, and 0, respectively. e learning rate was adjusted to 0.075, 0.1, 0.25, 0.5, 0.75, and 1 during training. e best validation accuracy of 81.80% on the training set and 49.00% on the validation set for a 0.75 learning rate were achieved. A sample screenshot showing the results of each digit's precision, recall, and f1-score were as shown in Figure 10.
For DFNN training, the epoch size was set initially to 20 epochs using the Adam optimization algorithm, and increased later to 30  (1) procedure GradientBoosting Classifier (X, Y) (⊳)X contains the STFT features of each audio sample, whileY contains the target audio class label (2) Read the dataset using the library "Librosa" (3) Extract STFT features from the audio. (4) One-hot encode the audio data to produce the class label. (5) Split the dataset into training and testing set with STFT features as the input and audio class as the target label.  Computational Intelligence and Neuroscience for 100 epochs, as demonstrated in Figures 11 and 12. Figure 11 shows the accuracy and the loss curve diagram of the model's performance for 100 epochs using Adam optimization algorithms. e model has achieved a validation accuracy of 99.65% and a minimal validation loss of 0.25%. Figure 12 is the accuracy and loss curve diagram of the model for 100 epochs using the SGD optimization algorithm. e SGD result has shown a validation accuracy of 98.42% and a validation loss value of 0.54%. Table 2 shows the accuracy comparison of the different ML classification methods used in training the dataset. In Table 2, it was noticed that the DFNN technique exhibited a validation accuracy of 99.65% compared to the other classification methods. e model's performance was compared with some traditional classifiers [22] such as SVM, KNN, and RF on the same dataset. e model proposed in this work attained 99.65% accuracy compared to the conventional approach, as demonstrated in Table 3.
(1) procedure Deep Feedforward Neural Network Classifier (X, Y) (⊳)X contains the STFT features of each audio sample, while Y contains the target audio class label (2) Reading the dataset using the library "Librosa" (3) Extract STFT features from the audio (4) One hot encode the audio data to produce the class label. (5) Split the dataset into training and testing set with STFT features as the input, and audio class as the target label. (6) Start the DFNN model (7) Epoch � N; audio � first audio (8)

Discussion
Research on speech classification is still an open issue as a result of the limitations of ASR systems. Recognition of isolated words/digits is practically arduous. A classification technique that involved three techniques; DFNN with hyperparameter optimization techniques, an ensemble method, i.e., RF, and a regression method, i.e., GB, was proposed in this study for the classification of spoken English digit data with the primary objective of determining the best method among them. Figure 5 shows the working of the RF algorithm. For a better understanding of the RF algorithm, knowledge of the decision tree algorithm is vital. e validation accuracy for the RF classifier after training shows a result of 73.67% accuracy, which is not high. is suggests that more decision trees should be created since the greater number of trees in the forest results in greater accuracy while overfitting is avoided. Figure 6 shows the architecture of the GB. e GB's performance was compared for different learning rates; lr_list � (0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1), where 'lr' represents the learning rate. Training the proposed model with variable learning rates achieved 81.80% on the training set and 49.00% on the validation set for a 0.75 learning rate. is is an indication that the GB was overfitting the training dataset which affects the accuracy. Figure 8 shows the architectural diagram of the proposed DFNN model. e model was trained first using the Adam optimization algorithm and retrained using the SGD optimization algorithm using variable epoch sizes. Increasing epoch size helps in enhancing the model's network accuracy. Epoch performs an essential function in the network training of a model [43]. e total amount of epochs to be  applied in network training would help to determine whether the data is overtraining or not. e performance evaluation in Table 3 suggests that the proposed deep feedforward method is optimal for spoken digit classification. A summary of the performance of the ML algorithms used for this work is depicted as a bar chart in Figure 13.

Conclusion
Classification of spoken English digit data was conducted using DL methods; ensemble, regression, and a DFNN method with hyperparameter optimization algorithms. STFT feature extraction and a one-hot encoding was implemented on spoken digit data to produce the STFT features as input and the audio class as the target label in the proposed model. Classification results of the training have shown that the DFNN model outperformed the RF and GB models with the validation accuracy of 99.65% compared to the 73.67% and 79.70% accuracy of RF and GB, respectively. Hence, the DFNN model is an efficient approach for the classification of spoken English digit data.
Data Availability e data used are available from the corresponding author upon request.  Computational Intelligence and Neuroscience 11