Correction Method of Spoken Pronunciation Accuracy of AI Virtual English Reading

In order to improve the pronunciation accuracy of spoken English reading, this paper combines artificial intelligence technology to construct a correction model of the spoken pronunciation accuracy of AI virtual English reading. Moreover, this paper analyzes the process of speech synthesis with intelligent speech technology, proposes a statistical parametric speech based on hidden Markov chains, and improves the system algorithm to make it an intelligent algorithm that meets the requirements of the correction system of spoken pronunciation accuracy of AI virtual English reading. Finally, this paper combines the simulation research to analyze the English reading, spoken pronunciation, and pronunciation correction of the intelligent system. From the experimental research results, the correction system of spoken pronunciation accuracy of AI virtual English reading proposed in this paper basically meets the basic needs of this paper to build a system.


Introduction
e virtual spoken English system has become an important English communication tool. With the continuous development of artificial intelligence technology, the AI virtual spoken English tool has gradually moved from theoretical research to real-world applications and the more widely used AI virtual English teaching pronunciation correction.
Most English speech synthesis models based on pronunciation mechanisms contain three main modules. Among them, the pronunciation movement model simulates the morphological structure of the pronunciation organ, the cooperative pronunciation model simulates the dynamic characteristics of the pronunciation organ, and the acoustic model simulates the aerodynamic process to generate the corresponding English speech signal. Any inappropriate approximation of these three main modules will affect the English speech quality. We try to build a more accurate pronunciation movement model to approximate the morphological characteristics of the articulation organs, so as to get a better pronunciation synthesis system. At present, there are two mainstream modeling strategies that are physiological models and geometric models. e physiological pronunciation model uses the finite element method to simulate the biomechanical properties of soft tissue and is embedded in the muscle structure to drive the model. However, the physiological pronunciation model faces the high computational load of the finite element module and the inappropriate distribution of the pronunciation organs and related muscles, which makes the control of the physiological pronunciation model extremely complicated. e geometric pronunciation model models the contours of the vocal organs, and the shapes of the vocal organs and vocal tract can be directly controlled by a predefined parametric set [1].
is parametric set is obtained through statistical analysis. Compared with the physiological pronunciation model, the geometric pronunciation model does not care about the biomechanical properties of soft tissues and the unplaning function of related muscles, so the calculation cost is greatly reduced, and the control of the vocal tract shape becomes simple.
erefore, the geometric model is more suitable for occasions where there is no need to understand and analyze the internal structure of the pronunciation organs for English speech animation applications [2].
Although in most cases clear sound is sufficient to complete people's basic communication, what visual information provides is a more effective and vivid communication effect. In addition, when the voice is missing or unclear, visual information can help people guess and understand what the speaker wants to express. For example, for people with hearing impairment, effective lip reading or speculation and judgment based on changes in the speaker's facial expressions can help them understand the speaker's meaning accurately [3].
Based on the above analysis, this paper combines intelligent voice technology to construct the correction system of spoken pronunciation accuracy of AI virtual English reading, explore the effectiveness of the model, and improve the correction effect of spoken English reading.

Related Work
On the basis of speech visualization, speech-driven face modeling and animation technology are of great significance to improve the teaching effect of the multimodal Mandarin pronunciation teaching system [4]. In recent years, many 3D speaker simulation technologies have been proposed, which can be basically divided into the following six categories: based on vector graphics animation, based on raster graphics system for animation rendering, based on the data-driven synthesis, and based on anatomically modeling the head, based on deformation algorithm, and based on machine learning [5]. e technique 3D speaker modeling based on vector graphics animation uses a simple vector graphic animation to show the outline of the main facial articulation organs (mouth, tongue, teeth, and soft palate, etc.) [6]. e method 3D speaker modeling for animation rendering based on a raster image system uses complex polygons to form a human head model. e advantage is that the raster image system can provide a high rendering level and a more realistic head model. e disadvantage is that the time-varying motion parameters are difficult to calculate, and the raster image system is very expensive and animation renders a long time [7]. e method 3D speaker modeling based on data-driven synthesis uses digital image processing technology to extract features from digital images [8]. Literature [9] established a sound-to-speech reversal model based on generalized variable parameters-hidden Markov (GVP-HMM) to achieve 3D speaker modeling. Literature [10] modeled a 3D speaker from the anatomy of the head. Based on the physiological structure of the face, a muscle model is proposed, and the muscle vector is used to simulate the movement of the muscle to generate facial expression animation. e disadvantage of this method is that the muscle parameter derivation mechanism is very indirect, the measurement is also very complicated, and the control parameters of muscle characteristics are only partially visible. e 3D speaker modeling based on the deformation algorithm calculates the position of the deformation point of the entire face by capturing a small amount of facial control point displacement [11]. is method puts the face into a regular control grid, such as an N × N × N cube, and establishes the corresponding relationship between the cubic control grid and the object to be deformed. Finally, the control grid can be moved to obtain the deformation of the deformed object to control the local deformation and global deformation of the object to be deformed according to the local movement and global movement of the control grid. First, it calculates the coordinates of the point to be deformed relative to the neighboring control point and gets the position of the point to be deformed from the displacement of the control point [12]. Concerning the 3D speaker modeling based on machine learning, using artificial intelligence techniques such as machine learning to learn the correspondence between speech or text and the movement of the articulator and expression movement [13], using any speech or text to drive the 3D head model, this method avoids immersive real-person data collection. is method is currently in the research stage.
ere have been many studies on 3D speakers abroad. Literature [14] developed a FAP-driven facial animation Italian speaker head model based on the MPEG-4 standard, which is automatically trained based on real data. e three-dimensional kinematics information is used to create a lip joint model and directly drive the speaker head model. e virtual speaker ARTUR developed in [15] shows the movement of the tongue and teeth, the pronunciation organs in the oral cavity. e visual speaker developed in [16] uses an electromagnetic pronunciation capture device to collect five control points of the tongue, two control points of the soft palate, and six control points of the mouth to simulate developmental pronunciation. e model of the pronunciation organ is obtained by three-dimensional reconstruction of nuclear magnetic resonance images. Literature [17] developed a visual pronunciation system based on a physiological model based on the deformation of the biological characteristics of each muscle in the face and vocal tract to simulate the movement of the articulation organs. Literature [18] developed a face animation system, which uses text/speech as the driving data of the system and uses the hidden Markov model to extract features of the speech signal. e speech is represented by the Meyer Frequency Cepstral Coefficient (MFCC). According to the information, the keyframe sequence of audio-viseme mapping is obtained through MFCC training, and a realtime synchronized face animation system is obtained according to the mapping relationship. Literature [19] developed a text-driven 3D Chinese pronunciation system, collected pronunciation corpus through EMA equipment, trained pronunciation model and acoustic model based on the hidden semi-Markov model (HSMM), and obtained a 3D network of pronunciation organs through MRI. e lattice model realizes the Chinese pronunciation system of simultaneous pronunciation. Literature [20] realized the correct pronunciation animation of the 3D human model and chose to use the EMA data as the support and the Dirichlet Free Deformation (DFFD) algorithm to drive the 3D human head talking model.

Statistical Parametric Speech Synthesis Based on Hidden Markov Chain
Generally speaking, unit splicing technology often does not involve the processing of speech signals, and the quality of synthesized speech often depends on the database produced.
Since this paper only focuses on parametric speech synthesis technology, nonparametric speech synthesis methods are not in the scope of this paper. Parametric speech synthesis technology mainly uses data to train the model, so that the model can learn the mapping function from text to acoustic parametric from the data set. Compared with nonparametric speech synthesis technology, in the prediction stage, it no longer depends on the data set, and the model directly synthesizes the text into speech. Among parametric speech synthesis models, statistical parameter speech synthesis based on hidden Markov chains is the most popular technology, which is generally divided into text analysis module, acoustic module, and vocoder synthesizer. e process of the statistical parameter speech synthesis model based on the hidden Markov chain is shown in Figure 1.
Character to phoneme conversion converts words into phonetic representations, which are generally described by phonemes.
e prosodic unit is composed of adjacent phonemes, which can generally reflect the speaker's mood in the speech and the mood of the sentence (declarative sentence, interrogative sentence, imperative sentence, etc.) and other pieces of information. e prosody feature embodies the pitch, length, and intensity of the speech. Adding prosodic information to the input helps to enhance the naturalness of the synthesized voice. Since the independent phonemes cannot model the context information, it is not conducive to the synthesized speech quality. erefore, after the input is converted into phoneme, contextual information is often added to the phoneme information, which mainly includes phoneme, stress-related factors, and locationrelated factors. e common context information in English is shown in Table 1.
In the statistical parameter speech synthesis model based on the hidden Markov chain, the function of the acoustic module is to convert the phoneme-level context sequence output by the text analysis model into corresponding acoustic parameters, and the acoustic model often contains Mel cepstrum coefficients, basis frequency, and vocalization sign. Among them, the acoustic model is modeled by hidden Markov chains.
In the training stage, the acoustic parameters such as the cepstral coefficient sequence, the fundamental frequency sequence, and whether to pronounce the sequence in the audio are extracted through the corresponding signal processing algorithm. Each different context corresponds to a different state (hidden variable) in the hidden Marco chain and, at the same time, introduces the beginning and end substates in each state. e state is used to describe the context of prosody and linguistics. e acoustic parameter sequence corresponds to the observation value of the state in the hidden Markov chain, and the distribution of the observation value of each state is a multidimensional Gaussian mixture distribution. At the same time, the fundamental frequency information contains the fundamental frequency and whether to speak or not. Among them, the fundamental frequency is continuous, and the vocalization flag is discrete. erefore, it needs to adopt multispace mixed distribution. It is worth noting that, according to the hidden Markov model, the probability of the duration of the state sequence is shown in (1) Among them, K is the total number of states of the hidden Markov model obtained from the input context sequence, λ is the parameter in the hidden Markov model, and p k (d k ) is the probability that the state k continues for d k basic time slots, and its probability expression is Among them, a kk is the transition probability from state k to state k. is paper is based on the maximum conditional probability criterion. When calculating the maximum value of formula (2), the probability of the duration of each state will decrease proportionally as the number of continuoustime slots increases. erefore, the duration of each state is 1 time slot. A hidden semi-Markov model is introduced to model the state duration, which makes the duration of each state obey a Gaussian distribution. And the mean and variance in the distribution of the duration of each state are the results after the most iteration in the forward-backward algorithm.
At the same time, the context sequence has different effects on acoustic parameters such as cepstral coefficients, fundamental frequency, and duration of the state. erefore, it needs to establish a separate decision tree and corresponding problem set for each parameter. e training is based on the maximum conditional probability criterion.
is will cause the output acoustic parameters of the model in a given state to be the mean value of the Gaussian mixture distribution, which will result in a large step between different states. As a result, the coherence of the entire synthesized speech deteriorates. In order to improve the coherence of synthesized speech, the first-order and secondorder difference values of acoustic parameters are introduced into the observations of hidden Markov chains.
As the number of context factors in Table 1 increases, the number of factors that can be combined will increase exponentially as the number of factors increases. is leads to an exponential increase in the number of states in the hidden Markov chain, and training such a model often requires a larger training data set. At the same time, the increase in the number of states will more easily lead to the problem of uneven data distribution. For example, some combinations have a large number of training data, while some combinations have a small number of training data, which Advances in Multimedia ultimately leads to insufficient training of the model. When there is no training in the test phase, the incorrect acoustic parameters predicted by the model will affect the quality of the synthesized speech. In order to improve the generalization performance of the model and solve the problem of data sparseness, decision trees are often introduced into the model. Each leaf in the decision tree corresponds to a state in the hidden Markov chain. In the process of training the decision tree, the pruning strategy is adopted, and some leaves in the decision tree will correspond to multiple contexts so that the final state number of the model is reduced, the data distribution space is reduced, and the model training is more adequate. In the prediction phase, when encountering a context without training, the model can also determine the state corresponding to the context according to the decision tree. According to the training data set to train the parameters in the hidden Markov chain model, the mathematical expression of the maximum conditional probability of the observation is shown in formula (3). Among them, O, w, λ, q, a ij , and b q are, respectively, the acoustic feature (Meier cepstrum coefficient, fundamental frequency) sequence, context feature sequence, parameters in the hidden Markov chain, and the state in the hidden Markov chain, state transition probability, and observation state probability.
During training, the number of frames T of the observation value and the state sequence of the hidden Markov chain are known, and the forward-backward algorithm and the EM algorithm are used to obtain the parameters λ. In the testing phase, this paper first analyzes the text input required, extracts the context sequence, and obtains the hidden Markov model sequence according to the context sequence and the duration of the corresponding state of each context. Subsequently, the state duration is used to adjust the hidden Markov model sequence into a frame-level sequence, combined with the speech parameter generation algorithm I col to obtain smooth acoustic parameters, Finally, a vocoder is used to synthesize speech.   e position of the current phoneme in the current syllable e number of phonemes in the current syllable, the previous syllable, and the next syllable e type of accent in the current syllable, the previous syllable, and the next syllable Whether the current syllable, the previous syllable, and the next syllable are stressed e position of the current syllable in the current word and current phrase e number of syllables in the current phrase, the previous phrase, and the next phrase e number of accented syllables in the current phrase, the previous phrase, and the next phrase e number of syllables from the current position to the previous and next stressed syllable Part of speech of the current word, the previous word, and the next word e number of syllables in the current word, previous word, and next word e position of the current word in the current phrase e number of words before and after the current position in the current phrase e number of words in the previous phrase and the next phrase e number of syllables in the current, previous, and next phrase e position of the current phrase in the main sentence e distance from the current position to the stressed syllable e number of phonemes, syllables, words, and phrases in the current sentence Given a text input, the context sequence obtained from the input text is w, and the probability that the model generates observations (the observations correspond to the acoustic parameters) is shown in Among them, q max is the state sequence, T ′ is preset, which determines the duration of the synthesized speech, and λ max is the acoustic model based on the hidden Markov chain.
e vocoding synthesizer restores the frame-level acoustic features (fundamental frequency, Mel cepstrum coefficients) predicted in the acoustic model to the time domain signal through a digital filter, that is, the final speech. Its mathematical expression is as follows: Among them, x (n) is the synthesized speech signal, h (n) is the formant filter, whose parameters are determined by acoustic parameters such as the Mel cepstrum coefficient, and e (n) is the excitation signal, which corresponds to the output of the acoustic model. Models based on deep learning have gradually emerged in many fields such as image classification and segmentation, video understanding, machine translation and understanding, and speech recognition and synthesis, and their performance indicators have been constantly refreshed. It is worth noting that after the size of the data set increases to a certain extent, the performance of traditional machine learning algorithms no longer increases significantly as the size of the data set becomes larger, as shown in Figure 2.
Due to the great similarity between adjacent points of the speech signal, this will bring great redundancy to the model implicitly learning the alignment of text and audio. erefore, 80 power values in the specified frequency range under the Mel scale are used to represent 1024 points in each frame. is article assumes that T is used to represent the number of slots of the decoder; then, the decoder will eventually generate a prediction value of 80 * T dimension. e output of the decoder is processed by the postprocessing module to synthesize the final speech. e rest of this section will be expanded with the encoder, decoder, and postprocessing network in the Tacotron model. e overall model structure of Tacotron is shown in Figure 3. e decoder of the Tacotron model is composed of a cyclic neural network and a preprocessing module. It mainly uses the output of the encoder in the Tacotron model as its input to predict the acoustic parameters at the next moment. e cyclic neural network is composed of a cyclic neural network combined with the attention mechanism and a twolayer cyclic neural network, as shown in Figure 4.
e Tacotron model splices the context obtained by the attention mechanism with the decoder output value at the previous moment and uses it as the decoder's next moment input (the spliced value may need to be projected to the specified dimension using a fully connected neural network). e structure of the attention mechanism in Tacotron's decoder is shown in Figure 5, and the specific mathematical operation process in the attention mechanism is shown in formulae (7) to (11).
In formula (7), score(s i−1 , h j ) is a score function, which is used to calculate the similarity between the state value s at each time in the decoder and the output value h at each time in the encoder. e common form of the score is shown in In formula (9), c i is the context information at time i. In formula (10) Advances in Multimedia of the decoder at timeà, c is the input of the decoder at time i, and f is the cyclic neural network. In formula (11), g is a fully connected neural network, which takes the main state value g and splicing value of the encoder as input to obtain the output of the decoder at the current time. e Seq2Seq model combined with the attention mechanism makes it have longer sequence modeling capabilities. But with different attention mechanisms, there will be different performances under different tasks. Formulae (7) to (11) correspond to the attention mechanism in the Tacotron model, that is, the output value gy of the decoder at each moment. And its solution process is as follows. First, according to the state value s i−1 of the decoder at the previous time i − 1 and the output h of the encoder, the attention mechanism is used to obtain the context information c i , and then c i and n are used as the input of the recurrent neural network to obtain s i . Finally, s i and c i are used as the input of the fully connected neural network to    predict the output value of the decoder at time i. In addition to the attention mechanism used in Tacotron, researchers have also designed other forms of attention mechanism. e solving process of the output value of the decoder at each moment is as follows. First, it clarifies the state value s of the decoder at the current moment and the output h of the encoder, then obtains the context variable c i according to the attention mechanism, and then uses s i and c i to obtain y i . In addition, a location-based attention mechanism is proposed, as shown in A local attention mechanism is proposed; that is, the decoder only pays attention to the local information within a specific window size of the input sequence at each moment. e width of this window is generally much smaller than the length of the input, which can save a lot of calculations. is article presents three strategies for calculating local attention mechanisms. (1) It empirically specifies the size of the window 2 * D + 1. e model calculates the position p of the center point of interest of the decoder in the encoder output sequence at each moment. en, the area of the calculation context is [max(p i − D), min(p i + D, L encoder )], where a � L encoder /L decoder is the length of the sequence output by the encoder. (2) It empirically specifies the size of the window 2 * D + 1. We assume that the correspondence between input and output satisfies a monotonic increase; then, each output moment of the decoder is at the center point position p i � a * i in the encoder. Among them, a � L encoder /L decoder . L decoder is the length of the time sequence of the decoder, and then, the area of the calculation context is At the same time, in order to reflect the relationship that the closer the point to the center point, the greater the impact on the output; that is, the response is on the weight α. We assume that the weight α at each position obeys a Gaussian distribution with mean p i and variance σ.
is relationship is shown in Among them, s is the position index of the encoder, and the variance σ � D/2 and α i,s are obtained using formula (8).
e main focus position at the current moment is p i and the focus point p i−1 at the previous moment i − 1; then, p i ≥ p i−1 . e specific process is as follows: we assume that the focus point of the last time i − 1 is p i−1 ; then, the range of the focus position at the current timeà is p i−1 , . . . , L encoder . We assume that the position of the attention point of i at the current time satisfies the Bernoulli distribution, and a Bernoulli distribution experiment is carried out from p i−1 . If the output of p j is 1, where p j ∈ p i−1 , . . . , L decoder , then, the position of the attention point of i at the current time is considered to be p j , e context information at the current moment is h pj , as shown in Figure 6. e historical alignment information is taken into account in calculating the context information at the current moment. e newly added historical alignment information helps to further strengthen the model's ability to model long sequences. at is, score(s i−1 , h j ) in formula (7) becomes as shown in Among them, f i � F * α i−1 . F adjusts the α dimension to the specified dimension.

Research on Correction Method of Spoken
Pronunciation Accuracy of AI Virtual English Reading e correction system of spoken pronunciation accuracy of AI virtual English reading is constructed on the basis of the previous algorithm improvement. e core framework of the AI virtual English reading system is shown in Figure 7.
We can directly create a subprocess to call and realize data transfer through an anonymous pipe. e process of AI virtual English reading is shown in Figure 8.
When the main process needs to call other subroutines according to the system logic, the system first creates an anonymous pipe, sets its read handle to A and write handle to B, and sets the startup information of the subprocess according to these handle pointers. After the preparatory work is completed, the child process can be created. After the child process starts, it first sets the read handle of the process to B and the write handle to A according to the startup information, which is just the Advances in Multimedia 7 opposite of the main process. After the interface setting is completed, the system can run the main program and wait for the input command. e main process waits for the child process to start, then writes the command through the B handle, and reads the output result through the A handle. When the child process gets the command, it calls the corresponding function according to the logic and writes the output result to the A handle. If it is the end command, it exits the process. In this way, the subroutine call is completed. After the above model system is constructed, the performance of the AI virtual English reading system is verified with experiments, and English reading and spoken pronunciation are analyzed and evaluated through simulation research.
e scoring results are shown in Table 2 and Figure 9.
On the basis of the above research, the effect of pronunciation correction in English reading is evaluated, and the results are shown in Table 3 and Figure 10.     Figure 8: Calling of subroutines of the spoken pronunciation evaluation engine for AI virtual English reading.    From the above research, we can see that the correction system of spoken pronunciation accuracy of AI virtual English reading has good practical effects.

Conclusion
Traditional speech synthesis technology is often divided into nonparametric speech synthesis technology and parametric speech synthesis technology. Nonparametric speech synthesis technology is mainly based on unit selection. e main idea is that speech is spliced from speech unit fragments, and the speech unit database is made with sufficient coverage. In the prediction stage, the text is transformed into a phoneme sequence marked with prosodic features (fundamental frequency, duration, etc.). Using the set loss function as the evaluation criterion, the optimal speech unit is selected from the database, and the selected speech unit sequence is spliced into the final speech. is paper combines the intelligent voice technology to carry out the AI virtual English reading oral pronunciation accuracy correction model, to verify the performance of the AI virtual English reading system, and to analyze the English reading and oral pronunciation and pronunciation correction. From the experimental research point of view, the correction system of spoken pronunciation accuracy of AI virtual English reading proposed in this paper basically meets the basic needs of this paper to build a system.

Data Availability
e labeled dataset used to support the findings of this study is available from the corresponding author upon request.

Conflicts of Interest
e authors declare no conflicts of interest.