Retraction Retracted: Hybrid Algorithm for English Translation Speech Recognition Based on Deep Learning Model and Clustering

. Speech recognition is the most important research direction in human-computer interaction. It is the key to the connection between human beings and machines and the expression of intelligence and automation in the information society. Taking English as the research object, using the related knowledge of speech recognition, it is based on the hidden Markov model technology of deep learning and clustering analysis algorithm and evaluated according to the cross-language English phonemic recognition system of sparse autoencoder (SA) method. By studying the speech recognition algorithm of the English translation, the inﬂuence of the speech recognition environment on the accuracy of speech recognition is conﬁrmed. This provides a direction for humans to study speech recognition at a deeper level. Based on the language model of Transformer and the language model based on Seq2Seq, it sets diﬀerent vocabularies, and the data are collected in the laboratory and outdoors, respectively, and the posttest template library is formed after collection. In the task of restoring phonetic symbols to English characters when phonemes are modeling units, the error rate is the lowest. The error rate on the test set reached 9.54%, which was 6.97 percentage points higher than that of the syllable modeling unit.


Introduction
e research of speech recognition technology has a history of nearly 100 years since the initial prototype of speech recognition [1,2]. Speech recognition can not only make the computer receive and understand the information expressed by human beings more directly but also help people-to-people communication and human-computer communication through automated equipment and intelligent operation. It is an important bridge between people. e research of speech recognition occupies an important position in the field of scientific and technological development. HMM recognition technology has become the technology of modern speech recognition.
e vast majority of existing person-neutral, large-vocabulary, continuous speech recognition systems are based on HMM models.
With the interest of the modern artificial intelligence industry and the development of deep learning theory and technology, computer scientists apply various computing methods to study speech recognition. erefore, speech recognition technology has made great achievements in both theory and application. Speech recognition has moved from theoretical knowledge and laboratory to people's daily life, providing great convenience and expectation for daily life.
By describing the relevant theories of speech recognition technology, it covers the principles of speech recognition and the theoretical basis of deep learning. According to hidden Markov model technology and cluster analysis algorithm, according to the difference of syllables and phonemes, based on deep learning and modeling under different units, the acoustic model and language model of English speech recognition are designed.
e Transformer-based language model has a slightly lower error rate than the Seq2Seq-based language model in the task of restoring phonemes and syllables to English characters.

Related Work
Speech recognition technology is widely used in the commercial market and has high commercial value. How to use this technology in a low-cost and reliable way in daily life is its future development direction. Zhu et al. analyzed remote sensing data through deep learning, reviewed recent advances, and provided resources that make deep learning in remote sensing seem ridiculously easy. He encouraged remote sensing scientists to bring their expertise to deep learning. And he uses it as an implicit universal model to address unprecedented, large-scale, and influential challenges such as climate change and urbanization [3]. Montazeri Ghahjaverestan et al. proposed a method for detecting apnea and bradycardia in premature infants based on a coupled hidden semi-Markov model (CHSMM). For simulated data, the proposed algorithm was able to detect the desired dynamics with 96.67% sensitivity and 98.98% specificity. e results show that the CHSMM-based algorithm is a robust tool for monitoring apnea remission in preterm infants [4]. Zhang started with the influence of cultural context on Chinese-English translation and discusses the context in Chinese-English translation combined with practical work experience as well as the understanding and practice of translation activities from the perspective of cultural translation. Between the two languages, due to the profound influence of culture, translators gradually form their own unique and personalized cultural understanding and translation concepts in translation practice [5]. Bharathi and Selvarani analyzed the occurrence, propagation, and transition of errors from start to finish execution cycle through the hidden Markov model (HMM) technique. Attempts at the design level can help design engineers to improve the quality of their systems in a costeffective manner [6]. Zhihao adapted the TMS320DM365 series multimedia processors based on the current development and application direction of DaVinci digital image processing technology. e hardware circuit design of the system mainly includes a power management module, a serial port fault diagnosis module, and an Ethernet communication module. Finally, this paper studies and discusses the accuracy of trade English [7]. Jang and Hitchcock applied model-based cluster analysis to data on types of democracies, creating tools for typology [8]. Pakoci et al. used the largest existing Serbian speech database and the best n-gram-based language model made for general purpose, changing the parameters of the system to achieve the best word error rate (WER) and character error rate (CER). In addition to tuning the neural network itself, its layers, complexity, layer concatenation, etc. have explored other language-specific optimizations [9]. ese studies are instructive to a certain extent, but the studies are too single and can be further improved.

Theories Related to Speech Recognition
In the process of speech recognition, the original signal of speech data is mainly collected by the machine [10][11][12]. When the collected speech raw data samples are detected by the speech recognition system, more than half of the recognition errors are due to endpoints. erefore, the noisiness of the real environment in which the original signal is collected directly affects the difficulty of accurate endpoint detection of the speech signal [13]. erefore, during speech recognition, it is necessary to cut off the silence at the beginning and end of the collected speech samples so as not to affect the later detection of speech recognition.
Speech recognition technology is a type of pattern recognition. e basic principle is that the machine processes, analyzes, recognizes, and understands the speech signal and converts it into text. Speech recognition technology involves many fields. In addition to basic applications, it can be combined with other natural language processing technologies such as spoken language recognition technology, speech synthesis technology, and machine translation technology to build more complex and intelligent applications. e first step in speech recognition is speech signal preprocessing. Speech signal preprocessing is the premise and foundation of speech recognition, and it is also a very critical step in the feature extraction of speech signals. Only when the characteristic parameters that can represent the essence of the speech are extracted in the preprocessing stage of the speech signal, the best similarity effect language can be obtained by comparing the compared speech with the standard speech.

Relevant Principles of Speech Recognition.
e speech recognition system mainly includes two models: acoustic model and language model. Acoustic models are mainly classified according to the acoustic characteristics of speech signals. e language model performs semantic-level scoring on the feature discrimination results of the acoustic model. e two models are the key to the performance of the entire speech recognition system [14].
In the development of speech recognition technology, although different researchers have proposed many different solutions, the basic principles are the same. In the processing of the speech signal, any speech recognition system can use Figure 1 to represent its general recognition principle. e most important modules of the speech recognition system are speech feature extraction and speech pattern matching [15].
Speech recognition is to process the collected raw speech data samples and calculate according to the signal data and the acoustic parameters in the computer speech database. en, the relevant characteristic parameters of the original speech data samples are inferred. Whether it is the recognition link or the training link in the speech recognition system, it is necessary to analyze and extract the characteristic parameters of the speech based on the preprocessing of the speech signal [16]. en, it is compared with the training template library according to the installation rules of the audio signal function. It obtains the recognition result through the recognition algorithm. e quality of speech recognition results is directly related to the parameters and parameter selection in the template library. Its basic structure is shown in Figure 2. e preprocessing in the principle block diagram of the speech recognition system is mainly to prehighlight and segment the collected speech signal and remove other noises or interference signals. It detects the front and back ends of speech and retains valid speech fragments [17]. Similarity matching is performed on the extracted information according to the extracted speech segment to reduce the dimension and reduce the calculation amount of subsequent processing. It performs computational matching on amplitude, zero-crossing velocity, time-based energy, and frequency-based linear prediction coefficient t.

Dynamic Time Warping (DTW) Technology.
Dynamic time warping is to make the reference template and test template different in time through the principle of dynamic programming to achieve the best match. It is to bend two speech sequences with different times on the time axis so that the two speeches can be better matched. In the speech evaluation system, the similarity between the user speech and the reference speech can be calculated by comparing the difference between the two characteristic parameters. However, since the speech to be evaluated and the reference speech have obvious differences in speech length and pronunciation speed, if the two are directly compared, the result is bound to be inaccurate. DTW is a very classic algorithm in speech recognition. e idea of the algorithm is to stretch or shorten the unknown until it is the same length as the reference template.
In speech recognition, the most frequently used discriminant algorithm is the dynamic time warping (DTW) algorithm [18]. DTW obtains a dynamic form that combines dynamic and time-coordinated methods to obtain distances between vectors. is is a classic course in speech reporting [19]. is workaround is to define a reference sample for each file and further calculate the corresponding path to get corrupted by the smallest sign in the vector distance. e common distance of the dashboard is minimal when the vectors are parallel. e meaning of differences between markers and specimen features addresses the stochastic problem of speech signals. e parameters involved in the dynamic time warping DTW technology mainly include speech feature vector, frame distortion and frame matching distance, and other related parameters.
Assuming that the frame vector parameter in the template is M, is the speech feature vector of the mth frame, the test template has N frame vectors, and T(1), is the speech feature vector of the nth frame. A match comparison point is denoted by (n, m) at each intersection where the parameter template and the test pattern intersect. e frame distortion at this intersection is D[T(n), R(m)]. e main purpose of the DTW algorithm is to make the measured feature template nonlinearly map to the reference template by determining an optimal time warping function ϕ(i n ). is minimizes the cumulative distortion D, which satisfies the following formula: (1) e DTW algorithm to find the minimum distortion is shown in Figure 3.
After the DTW algorithm, the calculated MFCC parameters of the test speech and the standard speech are respectively used as the actual parameters of t, r in the function DTW(t, r), that is, as the parameter input value of the function DTW(t, r). By calculating the similarity between the two, the most similar voice to the test voice can be found in the standard voice library. Finally, the output of the program can achieve the purpose of speech recognition. DTW algorithms cannot fully exploit the temporal and dynamic properties of speech signals, making them suitable for relatively simple speech recognition systems such as isolated words and small vocabularies. DTW technology optimizes the calculation results as a whole without considering the local optimization problem, which is easy to use [19].

Hidden Markov Model Technology for Deep Learning.
Deep Learning (DL) originates from the deep understanding of knowledge and is a new direction in the field of machine learning [20]. Deep learning is often applied to various supervised mode tasks such as speech recognition, natural language models, and image recognition. e traditional HMM model is currently the most widely used model, mainly based on statistical signals. It is mainly used in the modeling of speech recognition systems. HMM is widely used in various fields of speech processing, such as endpoint detection, speech compression, speech enhancement, and speech recognition. is method is now the mainstream of speech recognition technology. } are known, how to choose the corresponding optimal state sequence Q � q 1 , q 2 , . . . , q t ? e problem is mainly to find the paths the model can take to generate this sequence of observations and take the path with the highest chance. In practical applications of this identification problem, optimization criteria are usually chosen to solve this problem.
(3) How to adjust the model parameter λ � π, A, B { } to make P(O | λ) the largest? is is a training process. It is used to train the HMM model parameters so that the recognition performance under the model parameters is the best.

Transformer Model.
e fully connected neural network is fixed in the input and output layers, and the length of the input and output sequences of the recurrent neural network must be the same. In order to better realize the tasks where the length of input sequence and output sequence is not equal, such as machine translation and speech recognition, the Seq2Seq model is proposed, which consists of an encoder and a decoder [21]. Early Seq2Seq models used LSTMs or RNNs to map one sequence as input to another output sequence. Transformer is also structurally a Seq2Seq model, originally used in the field of machine translation. Different from the design of RNN and CNN, before the sequence of Transformer is input to the encoder and decoder, positional encoding is required to obtain the timing information of the sequence. e multihead self-annotation mechanism combines the context with the remote words sequentially, while processing all the words in parallel. e entire model framework completely adopts the multihead attention-grabbing mechanism and neural network for tracking. Its speed and training effect are better than the Seq2Seq model of RNN.
Although RNN-based speech recognition systems have ample room for development, they still have shortcomings such as poor dynamic deformation, long training time, and difficulty in implementation. e recognition rate is not necessarily better than cognitive-based speech recognition. erefore, in the statistical model, this algorithm is only in the experimental research stage.

English Translation Speech Recognition System
A speech recognition system is essentially a model recognition system [22]. Speech recognition is a process of matching models and similar similarity measurement rules based on data in a database. It matches the acquired speech with the existing data model. e successful application of HMM technology in speech recognition lies in its powerful modeling ability for time series structure. But it still has certain limitations. In the process of speech matching, the English translation speech recognition system needs to calculate a large amount of speech data, and the HMM model algorithm needs to involve many parameters, so this makes the HMM model training time-consuming. However, the HMM model has high recognition accuracy and can meet the needs of real-time speech recognition in the daily time of the real society.

Speech Recognition System Based on HMM Model.
e successful application of HMM in speech recognition has completely changed the history of speech recognition and has far-reaching effects [23]. As a statistical model, HMM was introduced into speech recognition in the 1970s, and in recent years, it has successfully realized the modeling of complex problems such as speech recognition and biological sequence analysis. e emergence of HMM has made a substantial breakthrough in speech recognition systems. To understand the HMM model, it must first introduce the concept of Markov chains. e Markov chain describes the changes of N states in a finite-state machine within time T. Let S represent the finite state set, S S 1 , S 2 , . . . , S N , then the state X i of the state machine at a certain time t can only be equal to one of the states s i in the finite state set S, where t � 1, 2, . . . , T, i � 1, 2 . . . , N. e state X of the state machine in time T constitutes a state chain X � X 1 , X 2 , . . . , X T in chronological order, and its probability satisfies the following formula: A state chain X that satisfies the formula is called a Markov chain. Further, if the state chain X satisfies the "Markov assumption," the probability that the situation X t of the state chain X at a certain time t belongs to the finite set S is only associated with its previous situation X i�1 . And it   Security and Communication Networks does not matter any time after time t − 1. en, the state chain X � X 1 , X 2 , . . . , X T satisfies the following probability formula: At this time, the Markov chain composed of the state sequence X is called "homogeneous Markov chain." Since there is no time t � 0, the state X 1 of the state machine at t � 1 is determined by matrix π � [π 1 , π 2 , . . . , π n ]. e matrix π is the initial state probability distribution matrix. e components π i , i � 1, 2, . . . , N of π, respectively, represent the probability that the initial state X 1 in the homogeneous Markov chain is equal to the ith state s i in the finite state set, namely, In addition to the probability distribution matrix π of the initial state, a square matrix A � a ij of order N is defined. e value of the square element a ij represents the probability of one step transition from s i to s j , so the square matrix A represents the state transition matrix. e formula for calculating the element a ij is as follows: To sum up, a first-order Markov chain λ can be represented as λ � π, A { } by the initial state probability distribution matrix π and the state transition matrix A. HMM uses the Markov chain to simulate the change process of the signal and then indirectly describes the change through the sequence of observations. erefore, it is a double random process, which can well describe the overall nonstationarity and short-term stationarity of speech signals.

Hmm Topology and Classification.
HMM uses the states in the Markov chain to represent the pronunciation process of speech. During word generation, the system transitions from one state to another. An output is generated in each state until the word is output. According to the different state transition methods, HMM models have different topological structures. According to the structure, the common types are the first type, the HMM that experiences various states, as shown in Figure 4. e second, the two-transfer HMM, is shown in Figure 5: e third, three-transfer HMM, is shown in Figure 6: As can be seen from the figure, the states in the latter two topologies can only reside in the original state or perform state transitions from left to right. erefore, this topology is also called the left-right model, that is, the transition must start from the first state [24]. is left-to-right HMM model is commonly used in speech recognition because this left-toright HMM model is the main manifestation of the temporal structure.
e forward-backward algorithm is exactly the method used to solve the first problem of HMM. For a given HMM model . . , o r , according to the general idea, the method of calculating the output probability P(O | λ) is as follows: if the given state sequence is Q � q 1 , q 2 , . . . , q t , there are e formula is in the q t state, and the output is the product of the probabilities of o t . And because for a given HMM model, the conditional probability of generating a state sequence Q � q 1 , q 2 , . . . , q r is P(Q | λ), and its calculation formula is P(Q | λ) � π q1 a q1q2 · · · a qr−1qr . (7) en under the conditions of the given HMM model, the probability of outputting the observation sequence To perform (2T − 1)N T multiplications and N T − 1 additions is too computationally intensive. erefore, it is e second step: recursive calculation. For all e third step: terminate the calculation.
e schematic diagram of the forward algorithm of the HMM model is shown in Figure 7 Step 1: initialize. When 1 ≤ i ≤ N, there are e second step: recursive calculation. For all Step 3: terminate the calculation.
e recursive process of the backward algorithm of the HMM model is shown in Figure 8: erefore, the parameter reestimation problem and training problem are often solved by the Baum-Welch algorithm.
According to the HMM model definition, ε t (i, j) represents the probability at time t and i, given a specific model and training sequence O, and its expression is It is deduced that en, the probability that the Markov chain of the model is in state i at time t is In the formula, α t (i) and β t (i) are the forward probability and the backward probability, respectively. From this, it can be deduced that the reestimation formula of HMM parameters is It can be seen from the above formula that the training process of the HMM model is a process of finding the extreme value of the functional. At present, there is no analytical method for this kind of problem [25]. Because the given training sequence O is finite, there cannot be an optimal way to estimate the parameter λ. e Baum-Welch algorithm uses the idea of recursion to find the parameters that make P(O | λ) local maximum, which is the result of the reevaluation optimization. At the same time, HMM can also be used in isolated word speech recognition system in speech recognition, and the recognition rate is higher than the DTW method, which has a wide range of applicability.

Density-Based Clustering Algorithm
Steps. As far as the principle of clustering is concerned, both the hierarchical method and the division method are measured on the basis of the distance measurement standard [26]. e main idea of density-based clustering is to continue clustering as long as the number of objects or data points in the adjacent area exceeds a certain threshold. e accuracy of the grid cluster placement is related to the size of the unit cell. Unlike density-based methods, its classification statistics are not measured using distinct distances at all but instead classify data objects belonging to a relevant density domain according to whether they belong to that density domain [27].
(1) Input data and calculate Euclidean distance (2) e result obtained in step (1) is input, preprocessed, and given the parameter t used to calculate the cutoff distance d c . It calculates the distance d ij , and let d ij � d ji , i/j ∈ S, and determines the cutoff distance d c .

Experiments and Results.
e objective methods commonly used to evaluate the performance of speech recognition models include word error rate, sentence error rate, and character error rate [28]. e experiments test the model performance by taking the word error rate, which is the error rate for phoneme recognition in English speech recognition. e cross-language English phoneme recognition system is evaluated according to the sparse autoencoder (SA) method [29]. First, it compares the performance of SA and single-hidden-layer MLP in AF-based speech attribute detection. en, it compares their performance in cross-lingual English phoneme recognition. e TIMIT English data set is used as the training data for the source language. 70% of the 1000 English continuous speech sentences are extracted as pretraining data, and the remaining data are used as test data [30].
ese English continuous speech data sets can be downloaded from the Internet. e sampling rate of all original voice data is 8KHZ, each detection uses a 10 ms window function, and a 3 ms window is superimposed to extract a 39-dimensional MFCC specialization. e input layer of WSA and MLP has 39 nodes. From the TIM hit data set, including silence, it can get 34 English phonemes from English sentences. e 20 MLPs are all set into a structure of 18 hidden layer nodes and 2 output layer nodes, and then trained into 20 speech attribute detectors using the TIMIT data set. Likewise, 20 SA models are trained using the TIMIT and English data sets. In order to evaluate the results of the AF-based speech attribute embolizer and phoneme recognition, the evaluation criteria used were the attribute and phoneme unit-by-hour comparisons of speech [31]. For each frame, if the recognition results in the speech attribute detector and phoneme checker agree with the reference value, the score of S is incremented. e recognition accuracy (RA) is as follows: r represents the detection number of all speeches in the speech set. is evaluation criterion takes into account the phoneme recognition results and temporal information.
e accuracy rate and recognition rate of English phonetic attributes using SAs and MLPs methods are shown in Figure 9. "English" and condition observation sequence  "English + Tibetan" in Figure 9 represent the languages used to train the model. As can be seen from Figure 9, the SAs trained in a semisupervised way can detect AF speech attributes better than the other two methods, and it can recognize 14 Mandarin phonemes. However, the MLPs method and SAs trained in a supervised manner with English data can only recognize 10 and 12 Chinese phonemes, respectively. Furthermore, Figure 9 shows that the SAs trained on English and Mandarin data have higher phoneme recognition accuracy than the other two models. ese results demonstrate that the sparse autoencoder trained in a semisupervised pretraining manner can learn shared phonetic properties between English and Mandarin. And the learned shared speech attributes can be used to effectively improve the accuracy of speech recognition.
In this test, 5 groups of different vocabulary sizes were set, which were 10, 30, 50, 100, and 200 isolated words, and 100 random tests were performed on each group. e test voices this time came from classmates, and 5 sets of data were collected in the laboratory and outdoors. After collection, this test template library is composed. e recognition rate test results are shown in Figure 10.

Security and Communication Networks
Although the recognition rate in the outdoor environment is lower than that in the laboratory due to the influence of outdoor noise, the recognition results this time still meet the needs of practical applications. In addition, according to the test results, the larger the test vocabulary, the lower the recognition rate. However, the recognition rate can still reach more than 90%. In general, the recognition rate of the system can meet the actual use requirements.
According to the structure of the Transformer model, it belongs to the Seq2Seq model, so the Seq2Seq model is selected as the benchmark model, and the model performance on different modeling units is compared. e experimental results are shown in Table 1.
Transformer-based language models and Seq2Seq-based language models have the lowest error rates in the task of restoring phonetic symbols to English characters when phonemes are modeling units. e error rates on the test set reach 9.54% and 11.21%, respectively, which are 6.97 and 6.1 percentage points higher than the modeled units of syllables, respectively.
Under the same experimental conditions, the Transformer-based language model has a slightly lower error rate than the Seq2Seq-based language model in the task of restoring phonemes and syllables to English characters, and its test running speed is faster than that of the Seq2Seq model.
is is because the Transformer model has the advantage of parallel computing.
In order to test the performance of the speech recognition system combined with the language model and the acoustic model, the acoustic model based on CNN-CTC was combined with the language model of Transfomer and compared with the speech recognition system using only the acoustic model based on CNN-CTC. Table 2 shows the test results of the speech recognition system by selecting different modeling units.
e experimental results show that the English speech recognition performance of the combination of the acoustic model and the language model is better than the speech recognition whose modeling unit is a word. In addition, in a speech recognition system in which the modeling units are syllables and phonemes, it is better to use phonemes as the recognition unit for training. Although the calculation of the language model is saved when the word is the modeling unit, the recognition speed is faster, but when the phoneme is the modeling unit, the recognition effect is better than that of the word and syllable as the modeling unit model. e error rate of its language recognition system reached 42.53, which was 12.66 percentage points higher than that of the word modeling unit and 4.68 percentage points higher than that of the syllable modeling unit. HMM technology mainly needs to make a priori assumptions about the current state sequence distribution in speech recognition. e ability to model high-level acoustic phonemes is weak, which makes acoustically similar words easy to confuse.

Discussion
Language is the main way for human thought and emotion to communicate. It is the behavioral performance of information and intelligence level and the crystallization of human civilization and wisdom. Although the effect of deep   learning models on speech recognition is better than that of traditional models, the HMM model has a strong demand and dependence on data, and the size and quality of data directly affect the model effect. If it wants to realize the potential of the model, it needs to rely on a large amount of data for training. e implementation details and processing difficulty of speech recognition systems vary, but the basic technical routes are similar. No matter which of the above speech recognition systems is used, appropriate technologies in modeling units, speech signal preprocessing, feature extraction, system modeling, and pattern matching must be selected. As a statistical model of speech signal, HMM can reasonably imitate human speech process. In the future, humans can realize some things more conveniently and quickly through voice interaction and enjoy more modern services.

Conclusion
In order to improve the performance of English speech recognition, this paper mainly studies the design of acoustic model and language model for English speech recognition. A speech recognition system with syllables and phonemes as modeling units is realized by combining the CNN-CTCbased acoustic model and the Transformer-based language model.
In this paper, there is no other preprocessing algorithm design except for the anti-interference of Markov adaptive learning noise environment. Subsequent data augmentation methods can be used to perform time warping, frequency masking, time masking, etc. on the time domain or frequency domain of the speech. In this way, meaningful speech signals can be extracted from the noise background, and the performance of the model can be improved.

Data Availability
e data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest
e author declares no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.