An Effective Artificial Intelligence-Enabled Error Detection and Accuracy Estimation Technique for English Speech Recognition System

Error detection and accuracy estimation in automated speech recognition (ASR) systems act a vital part in the design of human-computer spoken dialogue systems, as recognition error can hamper accurate systems in understanding the end user intentions. The major aim is to identify the errors in an utterance, and therefore, the dialogue manager can provide proper clari ﬁ cations to the user. Therefore, the design of accurate error detection and accuracy determination techniques becomes essential in the ASR system. With this motivation, this paper presents a novel arti ﬁ cial intelligence-enabled accuracy estimation and error detection technique for the English speech recognition system (AIEDAE-ESRS). The goal of the AIEDAE-ESRS technique is to perform three actions such as con ﬁ dence estimation, out-of-vocabulary (OOV) word identi ﬁ cation, and error type categorization. In addition, the AIEDAE-ESRS technique performs di ﬀ erent levels of preprocessing including sampling of input speech signal, bandpass ﬁ ltering, and noise removal. Besides, a new deep neural network with hidden Markov model- (DNN-HMM-) based speech recognition technology is designed, which also aims to estimate the accuracy and error. Finally, the hyperparameters of the DNN-HMM model can be optimally chosen by the use of ﬂ ower pollination algorithm (FPA) and thereby accomplished improved recognition performance. In order to demonstrate the better performance of the AIEDAE-ESRS technique, a series of simulations were conducted and the results are examined under varying aspects. English voice recognition system ’ s accuracy estimation and error detection were made possible using arti ﬁ cial intelligence (AIEDAE-ESRS). There are three steps in the AIEDAE-ESRS method: con ﬁ dence estimation; identifying out-of-vocabulary words (OOV); and categorizing mistake types. The simulation results reported the enhanced performance of the AIEDAE-ESRS methodology over current advanced approaches. Our AIEDAE-ESRS methodology outperforms existing methodologies by a factor of ten. The simulation results demonstrated that the AIEDAE-ESRS methodology outperformed previous approaches in terms of e ﬃ ciency. The improved experimental results indicated that the AIEDAE-ESRS technique produced superior results across a variety of measures.


Introduction
The speech signal is one of the essential and common ways of communicating between people. In these communications, the speaker's emotion performs an important role in the transfer of concept in such a way that a change in the emotions may result in distinct translations of speech [1]. Therefore, to make effective communication between man and machine, speech emotion recognition (SER) is becoming a hot research topic. In the selection of important features, together with accurate SER system, an effective way to decrease the data dimension is needed [2]. With the continued growth of science and technology, the global village is shrinking, and the usage of English has become increasingly widespread. The development of artificial intelligence computers that could understand English speech will significantly encourage the new experience and complete intelligence of human life and work eventually [3]. The speech emotion recognition (SER) system is built on CNNs and RNNs that have been trained on a database of emotional speech. Our primary objective is to offer a SER approach that is based on concatenated CNNs and RNNs and does not rely on any typical hand-crafted features. The literature on speech emotion recognition (SER) has employed a variety of approaches to extract emotions from signals, including numerous well-known speech analysis and classification techniques. Recently, deep learning approaches have been presented as a possible replacement for classic SER techniques. Language interaction and intelligent English speech recognition systems (SRS) affect their study and work life, as well as have promotion significance and extensive application in areas like language promotion, military, and education. Now, there are multiple implementation methods and system designs for SRS. There are different kinds of classification, primarily separated into specific-and nonspecificpersons SRS, continuous and isolated word SRS, embedded/server SRS, small vocabulary, and large vocabulary SRS. In everyday life, people's natural speech is depending on the speaker's need to break at the end of a sentence or add punctuation, and other parts could be continuously pronounced [4].
In the earlier SRS, the isolated word phonetic systems were based primarily on single words or characters [5]. Depending on the way the acoustic method is developed, we could separate SRS as specific-and nonspecific-person recognition. Specific-person recognition implies that the user needs to input a massive number of pronunciations and train recognition in advance. The nonspecific-person is that afterward the scheme is developed, the user does not need to input the trained information before and could recognize directly [6]. The deep learning (DL) method has different areas of application, and several achievements have been found. Another area where DL is effectively used is automated SRS. In automated SRS, better language and acoustic methods are integrated [7]. The SRS problems involve time-series data. In several fields, such as read continuous speech where usually the speech is recorded under clean conditions, the outcomes are satisfied with an error rate under 5%. Since in another field that has high speech differences, like distant conversational speech (meeting) or video speech, the outcomes are still not satisfactory exhibiting 50% of an error rate [8].
To handle these problems and improve the performances of inaccurate ASR systems, the automated correction and detection of the transcript error could be the only choice in some cases [9], especially while tuning the ASR systems by itself is impossible (for example, the system is purchased as a black box) or the manual correction is inconvenient or even impractical as in the case where the transcriptions are not the ultimate objective of the systems (for example, question answering, machine translation, and information retrieval systems). In that respect, ASR classification and error detection are also called confidence estimation [10]. The more commonly studied method is feature-based, where classification is constructed by the feature generated from distinct sources (that is, decoder and nondecoder characteristics) to differentiate the accurately from the inaccurately identified word. This paper presents a novel artificial intelligence-enabled accuracy estimation and error detection technique for the English speech recognition system (AIEDAE-ESRS). The AIEDAE-ESRS technique intends to accomplish three  actions such as OOV word identification, confidence estimation, and error type categorization. Furthermore, the  AIEDA-ESRS technique's architecture incorporates a deep  neural network with hidden Markov model-(DNN-HMM-) based speech recognition model. Furthermore, the flower pollination algorithm (FPA) is used to fine-tune the DNN-HMM model's hyperparameters. Flower pollination algorithm (FPA) is a nature-inspired metaheuristic algorithm that replicates the pollination activities of blossoming plants. The implementation of several FPA variants based on tweaks, parameter adjustment, and hybridization with other algorithms is addressed in this article. The design of FPA for hyperparameter optimization of the DNN-HMM model shows the novelty of the work. The experimental result analysis of the AIEDAE-ESRS technique takes place using benchmark dataset and investigated the results under several aspects.

Literature Review
In Alhamada et al. [11], the usage of DL in SRS was examined and an appropriate DL framework has been was recognized. A technique using CNN is employed to improve the efficiency of SRS. Han et al. [12] examined the efficacy of different DL-based acoustic models for conversation telephone speech, especially CNN-bLSTM, bLSTM, and TDNN systems. They estimated this model on research test sets, like recordings, Switchboard, and CallHome from a real-time call center applications. In Blaise. O. Yenke et al., due to the large variety of applications and interfaces or computing equipment that can enable speech processing, automatic speech recognition (ASR) is a very active research subject. It is true that well-resourced languages outnumber underresourced languages in most applications. It is evident that ASR may be used to enhance illiterate people's languages. Starting with a small vocabulary is one way to construct an ASR system for underresourced languages. Assertive speech recognition (ASR) with a limited vocabulary recognizes words or sentences in small groups.
Grozdić et al. [13] extended a method for whispered SRS that is the most difficult challenge in ASR. Specifically, because of the profound variances among acoustic features of whispered and neutral speech, the efficiency of conventional ASR system trained on neutral speech greatly reduces once whisper is used. Misbullah et al. [14] investigated the efficiency of SRS for dysarthric speakers using time delay DNN. Furthermore, examine the system performances by integrating dysarthria and normal speech corpus. Lastly, well-tuned hyperparameter of DNN structure gives potential outcomes on English dysarthria and Mandarin speech.
Ogawa and Hori [15] explored three kinds of ASR error detection processes, that is, OOV word recognition, error type classification (ETC), and confidence estimation, and also evaluated the detection rate from the ETC result. The simulation result shows that the DBRNN considerably outperforms conditional random field (CRF). Ogawa et al. [16] presented detection accuracy estimation method based on ETC. The ETC is an extension of confidence estimate.

Wireless Communications and Mobile Computing
In ETC, all the words in the detection outcomes (detected word sequence) for the targeted speech information are categorized into three classes: insertion error (I), substitution error (S), and correct recognition (C).

Materials and Methods
In this study, an effective AIEDAE-ESRS technique has been developed for the error detection and accuracy evaluation in SRS. The AIEDAE-ESRS technique involves three major processes, namely, preprocessing, DNN-HMM-based speech signal recognition, and FPA-based hyperparameter tuning. The utilization of the FPA helps to properly alter the hyperparameter of the DNN-HMM model which assists in significantly boosting the detection performances. Figure 1 demonstrates the overall working procedure of the suggested AIEDAE-ESRS technique.

Level I: Speech Signal Preprocessing.
The speech input is the original voice signal gathered by the voice tool; the preprocessing method chiefly consists of three factors: antialiasing bandpass filtering, eliminating the noise effect, and sampling the input original voice signal; the feature extraction method extracts the reflection in the voice. The acoustic parameter of the speaker's key features primarily includes short-term average zero-crossing rate, cepstrum, shortterm energy, and linear prediction coefficient. In the recognition phase, the speech feature parameter is attained, and the test template is made. In the test, the template is matched with the reference template as per some discriminative rules (i.e., semantic and grammar rules), later in the training phase, the feature parameter is processed for establishing a reference model, and the better reference template is attained as the detection outcome. Better matching results are closely associated with the matching template, quality of speech feature parameter, and speech technique.

Level II: Design of DNN-HMM-Based Speech Signal
Recognitions. In traditional DNN-HMM-based recognition, the probability is modelled by GMM under the maximal probability condition. Such potential models are constrained because GMM is statistically ineffective to model information that lies on or near a nonlinear in the data space. To conquer this limitation, we proposed a DNN-HMM method for recognizing speech, in which the outputs of the DNN are given to the HMM as substitute for the GMM. GMM simulates the observed probability distribution of a feature vector in the presence of a phone. It establishes a sound foundation for determining the "distance" between a phone and the audio frame being observed.

Wireless Communications and Mobile Computing
(2) A = fa ij g, the transition state likelihood distribution (3) B = fb i ðO t Þg, the observation probability, in which b i ðO t Þ signifies the likelihood of observing O t at state s i :B is denoted as a finite mixture: Let c im be the mixture coefficients for mth mixture in state s i , as well as ℵ elliptically symmetric density or log-concave, with covariance matrix U im and mean vector μ im for mth mixture element in i state In order to apply HMM, two issues must be resolved: (i) Learning issue: assuming a collection of ground truth X (represent as trained set), the learning process detects the group of variables λ * = fA * , B * , π * g; therefore, λ * = arg max λ PðX | λÞ that detects the model parameter that well fits the trained set. The forward-backward method is utilized for calculating ðX | λÞ [17]. It finds the model parameter that best fits the training data. In order to compute ðX | λÞ, the forward-backward approach is used In the event of speech recognition, train C HMM fλ c , ðc = 1, ⋯, CÞg for C discrete class. For novel speech input O, with PðO | λ c Þ estimated from the Viterbi model.

Structure of DNN-HMM Model.
The main variation among GMM-HMM and DNN-HMM is the utilizing of GMM (rather than DNN) to evaluate the observation probability. We employ the DNN for modelling pðq t | 0 t Þ; the following probabilities of the parameters provided the vector 0 t , i.e., feasible, while pðq t Þ is easier for estimating from a first state-level position of the trained set. Figure 2 depicts the framework of DNN technique.    In this process, the previous probabilities of each state ðq t Þ is estimated from (occurrence of) the trained set, and pðO t Þ is allocated a constant because the feature vector is considered independent of one another [19].
(c) For all the models λ c , the Viterbi model is executed to estimate the prospect pðO | λ c Þ. But, the likelihood b q t ðO t Þ is substituted with pðo t | q t Þ estimated by Equation (5) 3  , switch conditions, p). It also provides a balanced diversity and intensity of solution through the adaptation of levy flight (random walks punctuated by larger leaps) and switch conditions, which are utilized to transition between intensive local search and global search. Flower constancy was identified as a precise solution that might be differentiated. In the case of global pollination, the pollinator transports pollen from a great distance   Wireless Communications and Mobile Computing to a more suitable location. In another example, local pollination was carried out inside a smaller region of a unique bloom in shade water [20]. Global pollination is carried out through a possibility known as switch probability. Pollination occurs all across the world when a pollinator transports pollen from vast distances to higher fitting. Global pollination is accomplished by the use of a probability known as switch probability. In a tiny area of a unique bloom, local pollination is carried out in water shade. Pollinators like bees are vital to the sexual repro-duction of about ninety percent of wildflowers. Ecosystems depend on these plants to function. They provide food, shelter, and other resources for many animal species, including humans. Once the stage was removed, local pollination is substituted. In FPA technique, the following 4 rules are used (also shown in Algorithm 1): (1) Cross and live pollination is called global pollination as well as the carrier of pollen pollinator apples the LF  Therefore, the first and second rules are given by in which x t p is the pollen vector at iteration t; g * indicates a present solution from present generated outcomes; γ = a indicates the level factor to control phase size; and L denotes pollination power, which is related with a step-size of levy allocation. The LF is calculated as a collection of random computations that have the duration of all the leaps and use the levy likelihood distribution function with infinite variation. Following that, L represents a levy distribution: in which ΓðλÞ is the basic gamma function.

Wireless Communications and Mobile Computing
In the event of local pollination, the second and third rules are formulated as in which x t q and x t k are 2 pollens from several blooms on the same plant, if x t q and x t k come from the same species and are chosen from a homogenous population; this is represented as local random walks and is included by a standard distribution in zero and one [21].
FF acts as an important part of the optimization problem. It estimates a positive integer to represent how better the candidate solution is. In the work, classification error rate is considered as a minimalizing FF. The poorer solutions have high fitness scores and the richer solutions have less fitness scores.

Experimental Validation
In this study, the experimental result analysis of the AIEDAE-ESRS technique takes place using the MIT lecture English speech corpus, called MIT dataset [22]. The MIT corpus includes speech information from invited talks and systematic university classes. The length of a lecture exist between 45 and 90 minutes. First, the error detection result analysis of the AIEDAE-ESRS technique takes place under deletion error detection, confidence estimation, OOV word detection, and CSI classification in Table 1.  Figure 4 displays the comparative result analysis of the AIEDAE-ESRS system with present methodologies under OOV word detection. The figure stated that the AIEDAE-ESRS method has the capacity of achieving efficient outcomes with increased value of accuracy, NCE, and AFS. It is noted that the CRF and DNN models have shown minimum performance with minimal values of accuracy, NEC, and AFS. In line with this, the DURNN, DULSTM, and DBRNN systems have resulted in moderately closer accuracy, NEC, and AFS values. But, the AIEDAE-ESRS method has outperformed the other systems with the high accuracy, NCE, and AFS of 0.9708, 0.3720, and 0.7497, respectively. Figure 5 displays the comparative analysis of the AIEDAE-ESRS procedure with current methods under CSI classification. The figure described that the AIEDAE-ESRS method has the capacity of achieving efficient outcomes with the increased values of accuracy, NCE, and AFS. It is noticed that the CRF and DNN models have shown minimum performance with minimal values of accuracy, NEC, and AFS. In line with this, the DURNN, DULSTM, and DBRNN systems have resulted in moderately closer accuracy, NEC, and AFS values. But, the AIEDAE-ESRS system has outperformed the other techniques with the higher accuracy, NCE, and AFS of 0.8579, 0.4120, and 0.6796 correspondingly.  Figure 7 exhibits the accuracy graph analysis of the AIEDAE-ESRS technique on the test MIT speech recognition dataset. The figure portrayed that the AIEDAE-ESRS technique has reached improved training and validation accuracy with increasing amount of epochs. It is also noticed that the training accuracy is considered to be lower compared to the validation accuracy. Figure 8 demonstrates the loss graph analysis of the AIEDAE-ESRS technique on the test MIT speech recognition dataset. The figure depicted that the AIEDAE-ESRS technique has attained decreasing training and validation loss with a rise in the number of epochs. It is noticed the training loss is seemed to be higher than the validation loss.
Finally, a brief RMSE analysis of the AIEDAE-ESRS technique takes place under distinct sizes of training data is given in Figure 9 and By looking into the abovementioned figures and tables, it is ensured that our AIEDAE-ESRS methodology has gained maximal performances over the existing techniques.

Conclusion
In this study, an effective AIEDAE-ESRS technique has been developed for the accurate estimation and error detection in speech recognition model. The AIEDAE-ESRS technique involves three major processes, namely, preprocessing, DNN-HMM-based speech signal recognition, and FPAbased hyperparameter tuning. The utilization of the FPA helps to properly adjust the hyperparameters of the DNN-HMM model which supports to greatly increase the detection performance. The experimental result analysis of the AIEDAE-ESRS technique take place using benchmark dataset and investigated the results under several aspects. The simulation results reported the outstanding efficiency of the AIEDAE-ESRS methodology over the recent approaches. The improvements in experimental results reported the enhanced outcomes of the AIEDAE-ESRS technique based on various measures. With accuracy, NCE, and AFS values of 0.9921, 0.2640, and 0.6909, respectively, the AIEDAE-ESRS system outperformed the other techniques. With a TS of 100%, the AIEDAE-ESRS technique achieved a reduced root mean square error of 1.16 percent, whereas the CRF and DBRNN systems achieved a higher root mean square error of 2.01% and 1.73 percent, respectively. In the future, the performance of the AIEDAE-ESRS technique is additionally improved by the advanced DL models for speech recognition.

Data Availability
No data were used to support this study.