Machine Learning for Predictive Analytics in the Improvement of English Speech Feature Recognition

The use of deep learning to improve English speaking has seen tremendous development in recent years. This study evaluates the noise that is present in the English speech environment, employs a two-way search method to select the optimum feature set, and applies a quick correlation filter to remove redundant features in order to increase the accuracy of English voice feature identification. In addition, this article designs a low-pass filter in the complex cepstrum domain to filter the room impulse response in order to obtain the estimated value of the complex cepstrum of the original speech signal. After doing so, the authors transform this estimated value into the time domain in order to obtain the estimated value of the original speech signal. In addition, this paper proposes a corresponding noise elimination model for the purpose of eliminating noise from English speech in a reverberant environment. It also designs a complex cepstrum domain filter in order to conduct simulation research on the different characteristics of the reverberation signal and the pure speech signal in the complex cepstrum domain. In conclusion, this study develops an English voice feature recognition model that is founded on a deep neural network. Furthermore, this paper uses experimental research to validate the validity of the algorithm model that was developed in this study.


Introduction
English speech enhancement based on the regression DNN network is proposed, and the experiment proves that the algorithm can achieve better performance than traditional English speech enhancement algorithms. However, although the English speech enhancement algorithm based on deep learning uses many noise types and training corpus in the training data preparation stage, there are still many problems in its promotion ability on real data, such as the distortion of English speech under low signal-to-noise ratio, the unstable effect of processing mismatched noise types, and mismatched speaking styles [1].
In the system environment disturbed by noise, the correct rate of English speech recognition is significantly reduced, resulting in the failure to achieve the ideal effect in practical applications, and the system is disturbed even more under the condition of low signal-to-noise ratio. In order to make the English speech signal detection system work normally, it is necessary to extract as much pure English speech as possible from the English speech signal contaminated by noise when the noise source is unknown. That is, under the premise of suppressing noise, the purpose of improving and protecting the quality of perceived English speech is achieved. This kind of English speech processing technology has great research significance and application value for the related fields of English speech signal processing. As far as the current English speech signal processing technology is concerned, the effect of English speech detection in a weak noise environment is relatively ideal. However, the detection performance drops sharply in a strong noisy environment. Therefore, the detection of English speech signals under the condition of low signal-tonoise ratio is still a subject to be studied in depth [2].
Analog signals are used to represent the English voice signal. However, because of the cut-off frequency, the English voice is only present in the storage device as a digital signal as far as the English voice receiver is concerned. As a result, it starts by analysing the analogue English speech that has been digitally transformed, which typically entails amplification and gain control, prefiltering, sampling, quantization, and coding [3].
At present, English speech signal processing technology is developing rapidly in the field of information research, and its research scope involves cutting-edge scientific research projects, which has important research and application value. Moreover, informatization has become a basic requirement of modern society. In the civilian field, microphone array English speech signal processing technology is widely used in multimedia exhibition halls with large spaces and the hearing aid market. The English speech processing of the microphone array can adaptively control the beam direction, suppress interference signals in unknown directions in multiple directions, and have higher resolution. Therefore, in recent years, the development of adaptive processing technology has become more rapid, and the technology has also been used in other fields. However, the related algorithms of the English speech signal processing of the microphone array require a lot of floatingpoint operations. In current applications, most of them use DSP processors to perform operations on the collected signals. Although DSP has strong floating-point operations, it has disadvantages such as poor real-time serial operations and susceptibility to interference. Therefore, it is not competent for the more demanding processing system. This paper employs an FPGA-based English voice signal processing design to achieve this. The fact that the processor chip is inexpensive, compact, and capable of multichannel synchronous high-speed operation is a benefit. The development of FPGA-based English speech signal processing can thereby address the inadequacies of the current processing system and has significant implications for a wide range of applications.
In view of this, based on the deep neural network, this paper studies English speech feature recognition technology and proposes a reliable English speech feature recognition algorithm to provide a reference for subsequent English speech feature recognition.

Related Work
Research on endpoint detection and speech enhancement of noisy speech signals has been conducted for more than 50 years, and significant progress has been made during this period. Voice endpoint detection technology is proposed by [4], which is mainly applied to the time allocation of communication channels in the communication transmission system developed by it. The literature [5] proposed a system for reducing noise in the communication environment. The system introduces the concept that the input voice signal with noise is superimposed by the pure voice signal and the noise signal and divides the sample voice signal into multiple subbands for processing and analysis. The system is actually a spectral subtraction technique for now, but it is only implemented in the analog domain. Thanks to the rapid development of digital signal processing algorithms and DSP (digital signal processing) hardware, speech signal detection methods based on spectral improvements have been greatly developed, so speech signal noise reduction technology has made great progress. The literature [6] proposed a "spectrum shaping" method, which uses amplitude clipping in the filter bank of the speech signal preprocessing stage to remove lowlevel excitations. This low-level excitation is considered a noise signal. The literature [7] proposed spectral subtraction, which is implemented in the digital domain. Spectrum subtraction was applied to statistical spectrum estimate in [8]. Nearly and simultaneously, a technique that combines noise reduction and speech enhancement was suggested in [9]. The literature [10] proposed a voice endpoint recognition technique that establishes distinct thresholds to identify the starting point and ending point of the signal by combining the short-term energy of the speech signal with the short-term zero-crossing rate. The literature [11] explored endpoint detection performance in greater detail and developed algorithms for performance comparisons using several energy characteristics of the signal, including square energy, logarithmic energy, and absolute value energy. The optimum spectrum amplitude estimation and the best spectrum phase estimation are suggested by [12] using statistical prediction theory. The study's findings are frequently referenced in noise reduction studies, but, at the same time, the primary approach to noise reduction has changed to focus on the challenge of foreseeing the spectrum amplitude of pure speech signals. More statistical spectrum estimation approaches have been created by researchers, such as the minimum mean square error (MMSE) logarithmic spectrum amplitude estimation method, the maximum likelihood (ML) spectrum amplitude estimation method, and the maximum a posteriori (MAP) method. The Linear Predictive Coding (LPC) model and Kalman filter were utilised in [13] to reduce noise and raise the signal-tonoise ratio of speech signals. The literature [14] provided more endpoint detection algorithms through the frequency domain spectrum analysis of the voice signal after using the Fourier transform to get the frequency domain information of the voice signal. The literature [15] advocated for the speech signal's short-term stationarity and held that its parameter properties would be true over a brief period of time.
Segmentation methods based on LPC coefficients, methods based on speech parameters, and segmentation algorithms based on parameter filtering have been successively proposed. The literature [16] proposed an algorithm based on artificial neural network, through fast convergence to determine the different weights of the signal; its detection performance is significantly improved compared with the early statistical decision-making algorithm. Literature [17] proposes applying wavelet transform technology to speech signal detection, which greatly reduces the computational complexity of the algorithm.
The literature [18] researched the least square method. This blind system identification method uses the method of decomposing eigenvalues in the frequency band for processing. The literature [19] developed an adaptive filtering 2 Mobile Information Systems method. This method can combine Least Mean Square (LMS) and adaptive filtering methods. However, the disadvantage is that there are many restrictive conditions, the common zero point between channels will hinder this method, and the rank of the correlation matrix of the sound source signal is required to be maximized. The literature [20] studied the use of multichannel methods for linear prediction. This method is to diagonalize the covariance matrix of the speech signal to obtain the correlation characteristics of the signal. The literature [21] proposed using a virtual model to simulate the impulse response of the room. This method is based on the stability of the channel. However, under normal circumstances, the environment will change randomly, and it is difficult to meet this requirement, so this method is more difficult to implement.

English Speech Feature Recognition Algorithm Based on Deep Learning
This paper introduces the data set, data preprocessing, and extracted features, and two effective feature selection methods are used in feature selection. In addition, this paper uses three different classifiers and compares the classification effects.
We normalized all the data, as shown in the following formula: where a(n) is the original sample, μ(n) and σ(n) are the sample and standard deviation of the nth segment of data, each segment is 1 minute long, and a(n) is the normalized sample.
After preprocessing, each piece of data is equally segmented, and each segment is 1 minute long, and then features are extracted from each segment of the data. In this paper, 16 features are extracted from the single-channel ECG signal.

Time Domain Characteristics.
The mean value of the RR interval without detrending, the mean value of the detrending RR interval, the standard deviation of the RR interval, the maximum value of the RR interval, the minimum value of the RR interval, and other features are extracted in this study based on the time domain. The fraction of RR intervals where the distance between two adjacent RR intervals is greater than 50 ms, the range of RR intervals, the root mean square of the distance between adjacent RR intervals, and the standard deviation of the distance between adjacent RR intervals are all factors to consider.

Frequency Domain Characteristics.
In addition to the time domain, this paper also extracts a set of important frequency domain features. In order to extract the spectral characteristics of the RR signal, this paper performs fast Fourier transform (FFT) processing on the RR sequence and obtains four frequency domain characteristics: the power value of the extremely low frequency band, the power value of the low frequency band, and the power of the high frequency band.

Nonlinear Characteristics.
In addition to time domain features and frequency domain features, this paper also extracts two nonlinear features: sample entropy and spectral entropy.
Multiscale entropy (MSE) is used to describe the structural complexity of time series. Many kinds of entropy can be used to calculate multiscale entropy, such as approximate entropy and fuzzy entropy under various time granularities. Multiscale entropy is increasingly used in sleep analysis. In this paper, sample entropy (SampEn) is used as the core of multiscale entropy calculation.
After the signal x i , i � 1: N of N data points is given, a coarse-grained time series y(t) is first generated, where t is the scale factor. The ECG signal is divided into a nonoverlapping window of length t 1 :1, and the average value is calculated.
Therefore, y (1) is the original signal, and y (t) is the coarse-grained sequence obtained by dividing the original sequence into windows of length t.
The calculation steps of sample entropy (SampEn) are as follows: First, the coarse-grained time series form a set of m-dimensional vectors in order (m is the number of mode bits, and m is set to 2 in this paper): We define the distance between x(i) and x(j) as d[x(i), x(j)], which is the largest difference between the two elements; namely, For each value of i, we count the number calculate the ratio of it to the total number of distance N − m, denoted by Then, the average value of C m i (r) is The algorithm adds 1 to the dimension to become m + 1 and repeats the previous steps to count C m+1 (r).
Finally, the calculation formula of sample entropy SampEn is

Mobile Information Systems
Spectral SpecEn describes the flatness of the power spectral density (PsD) and indirectly reflects the irregularity of the time series. Therefore, the larger the value of SpecEn, the flatter the shape of the PSD, and, accordingly, the more irregular it is distributed in the time domain. Conversely, the smaller the value of SpecEn, the denser the frequency spectrum and the lower the degree of irregularity of the PSD in the time domain distribution. It is also necessary to extract the spectral entropy as a feature.
In the sample training process, as the number of features increases, the length of time it takes to evaluate the features and train the model, as well as the model's complexity and promotion ability, all decreases. By removing unnecessary and duplicate features, feature selection can lower operating complexity.
This study divides the feature selection process into two phases. The optimum feature set for classification is first selected using the bidirectional search (BDS) algorithm, and the redundant features are then removed using the quick correlation filter.
Sequence forward selection (SFS) and sequence backward selection (SBS) are combined in the first step of the bidirectional search (BDS) method.

Bidirectional Search (BDS) Algorithm.
Sequence forward selection (SFS) : add each feature to an empty set A one by one in turn. Each time a feature is added, the accuracy of the feature classification in A is calculated. If the accuracy is higher than before adding, the feature is valid and is kept in A; otherwise, the feature is invalid, and the feature is removed from A.
Sequence backward selection (SBS) : remove each feature one by one from the full set S and calculate the accuracy of the feature classification in s after removing a feature. If the accuracy is higher than before adding, continue; otherwise, keep the feature in S.
Bidirectional search (BDS) : use forward and backward sequence selection methods to search at the same time. When the results of the two process searches are the same feature subset, the search stops.

mRMR Algorithm.
In the second stage, in order to evaluate the synergy between features and construct a set of optimal features, this paper adopts a filtering method based on mutual information and minimum redundancy and maximum correlation (mRMR) criteria.
The mRMR algorithm is based on mutual information. When two random variables x and Y are given and their probability density functions are p(x), p(y), and p(x, y) respectively, the mutual information is
The goal of the algorithm is to find a feature subset containing m(x i ) features.
The biggest correlation is where x i f is the i-th feature, C is the categorical variable, and S is the feature subset. The minimum redundancy is Objective function addition integration: That is, Among them, X represents the complete set of feature x j , s represents the set of selected feature x i (size m), C represents the class, and I represents the mutual information. The definition of I is as follows: Among them, p(x), p(y), and p(x, y) are probability density functions. These three functions are estimated by a kernel density estimator based on adaptive diffusion. This paper uses support vector machine (SVM), Ada-Boost, and random forest three classifiers to classify English speech features.
AdaBoost Method. In addition to SVM, this paper also uses the AdaBoost (AB) method. Boosting algorithm has a good classification effect. Boosting is an iterative algorithm whose purpose is to combine several classification models and integrate them into one classification model. This integration method is based on the weighted voting of the same classifier.
AdaBoost (AB) is a widely used boosting algorithm, which was first proposed by Freund and Schapire. AB can be used with other classifiers, but if AB is applied to a complex classifier, the prediction performance of new data will be greatly affected; that is, the ability of promoting it will be lost. Therefore, when the weak classifier is applied to the AB algorithm, the effect will be better.
After every m iterations, the AB algorithm reassigns a new weight w m k for each feature vector x k in the training set. Therefore, the m-th weak classifier will use the corresponding weights for training. Then, its classification performance is estimated with the error ε m . This error is used to determine the weighted voting result of the m-th weak classifier.
Therefore, the smaller the error ε m in these classifiers, the greater the contribution to the final classification. At the end of the iteration, the weight of the misclassified sample will be updated to w m+1 k . Then, the weights of all samples will be standardized to maintain the original distribution.
In this algorithm, the error ε m of the m-th iteration is defined as the sum of the weights of the misclassified samples divided by the sum of the weights of all the samples in the current iteration.
Random Forest. Random forest (RF) is a combination of multiple decision tree classifiers, each of which depends on an independently sampled random vector. Every decision tree in a random forest has the same distribution. As the number of decision trees in the random forest increases, the error of the random forest generated results gradually converges. The error of the random forest generated results depends on the strength of each independent decision tree in the forest and the relationship between the trees.

English Speech Feature Recognition System Based on Deep Neural Network
When performing English speech recognition in a classroom or in a relatively closed place, some of the sound waves emitted by the sound source are directly received by the microphone, and the other part will be reflected and absorbed after reaching the indoor walls, ceiling, ground, and other obstacles [22]. The attenuation of the sound signal after reflection is relatively small. Due to the different materials of various obstacles, the reflection coefficient is also different. In addition, the strength of the sound energy received by the obstacle is different, the signals received by the microphone will have a large amplitude compared with the original signal, and the phase will be different. From the reverberation process shown in Figure 1, it can be seen that reverberation is different from irrelevant external interference signals such as noise. The reverberation signal originates from the sound source signal and is a regular interference signal [23]. According to research on the complex cepstrum of the speech signal, the positions of the complex cepstrum of the sound source signal and the room's impulse response are different when the reverberant speech signal is translated into the complex cepstrum domain. While the latter is concentrated at both ends, the former is mostly concentrated closer to the midway point [24]. The estimated value of the complex cepstrum of the original speech signal must therefore be obtained by designing a low-pass filter in the   Mobile Information Systems complex cepstrum domain to filter the room impulse response, and this estimated value must then be transformed into the time domain to obtain the estimated value of the original speech signal. Figure 2 depicts the extensive cepstrum dereverberation procedure in this work. Designing a complex cepstrum domain filter is an important part of the process of speech signal dereverberation. The complex cepstrum domain filter is a low-pass filter in a broad sense. Moreover, its parameters determine the performance of dereverberation, including three parts, namely, the pass band, the transition band, and the stop band. Figure 3 shows the filter schematic diagram.
Among them, L is the length of the filter, M is the cut-off point of the passband, h is the length of the transition band, and h(n) is the transition band function. When M is 1/16 of h and h is 1/8 of L, the best dereverberation evaluation index is obtained, and the dereverberation effect is the best.
This paper downloads an English voice from the officially recognized voice library. The sampling frequency is 44100 Hz, and the length, width, and height of the room used in the experiment are 5m, 4m, and 3m, respectively. Moreover, this paper uses the mirror image method to simulate the room impulse response, and the room impulse response function is shown in Figure 4. The collected voice is convolved with the simulated impulse response function to obtain the reverberant voice, and the reverberant voice is framed and then a Hamming window is added. Among them, the frame length is 1024, and the frame shift is 1/4 of the frame length.
As seen in Figure 5, this filter is a low-pass filter appropriate for the cepstrum domain. When the highest cut-off point for the filter is 1/256 of the frame length and the bandwidth of the transition band is 1/16 of the frame length, it is discovered that good evaluation results for the speech signal obtained after dereverberation may be obtained.
According to the distance from the sound source to the microphone array, it is divided into a near-field model and a far-field model of the microphone array. When the signal source is far from the array, the wave path difference of the signal reaching each element is relatively small, and the signal can be treated as a plane wave model. The difference is that when the signal source is close to the microphone array, the signal reaches the array element in the microphone array with a larger amplitude difference. At this time, the waveform arriving at the array should be a spherical wave model. Figure 6 shows the near-field and far-field models of the microphone array.

Sound source
Near field Far field Figure 6: The near-field and far-field models of the microphone array.

Mobile Information Systems
The overall implementation scheme of the FPGA-based microphone array signal processing system is shown in Figure 7. First, a microphone array is designed as the voice signal collection terminal. This paper uses 4 low-cost omnidirectional electret microphones as the elements of the microphone array to convert the voice signal into an analog signal output. Then, a signal acquisition system with signal acquisition and AD conversion functions is designed.
The model in this paper is based on the foundation of deep neural network. The results of the deep neural network in this paper are shown in Figure 8.

Performance Verification of English Speech Feature Recognition Model Based on Deep Neural Network
This study uses deep neural networks to construct a model for English speech feature recognition. This model can perform English voice denoising using a neural network approach in order to accomplish the recognition of English speech features even in situations when there is classroom reverberation. As a result, this work initially assesses the impact of English speech denoising before counting the impact of English speech feature recognition in the system performance test. In order to determine the denoising effect of English speech, this study collects numerous sets of English speech data via the network and runs tests with the system that it has built, as shown in Table 1 and Figure 9.
From the analysis results of the above chart, it can be seen that the English speech feature recognition model based on the deep neural network constructed in this paper has a better effect. After that, this paper conducts the evaluation of the English speech feature recognition effect of the system constructed in this paper. The results obtained are shown in Table 2 and Figure 10.  From the above experimental research results, it can be seen that the English speech feature recognition system constructed in this paper has a certain effect.

Conclusion
This paper studies the English speech detection algorithm based on the nonstationary strong noise environment. The windowing of the English speech signal can make the speech signal processing easier, and different window functions have different effects. Linear predictive analysis includes autocorrelation method and covariance method. The covariance approach is less reliable than the autocorrelation method, which is better suited for interpreting English voice    signals. In this study, the filter bank addition and overlap addition are introduced for the short-term synthesis of English voice signals. Additionally, the concatenation and addition approach is chosen to handle the voice signal due to its simplicity after evaluating the two methods' degree of complexity. This work also conducts simulation research on the various properties of the reverberation signal and pure speech signal in the complex cepstrum domain, examines the basic idea of complex cepstrum domain filtering, and builds a complex cepstrum domain filter. Finally, this paper constructs an English speech feature recognition model based on deep neural network and verifies the reliability of the algorithm model through experimental research [25, 26].

Data Availability
The data used to support the findings of this study are included within the article.