Study about Chinese Speech Synthesis Algorithm and Acoustic Model Based on Wireless Communication Network

Chinese speech synthesis refers to the technology that machines transform human speech signals into corresponding texts or commands through recognition and understanding. This paper combines the classic VAD and GSM VAD1 algorithm simulations, improves on the above two algorithms to recognize and collect speech, and analyzes their Chinese pro ﬁ ciency by amplifying the signal through a ﬁ lter, in order to study the adulthood of Zhengzhou University in Southeast Asian students (mother tongues are Indonesian and Thai) as the research objects, to explore the relationship between the Chinese phonetic pro ﬁ ciency and the acquisition motivation of Southeast Asian students. This article combines algorithm and language disciplines. According to the results of Praat and SPSS: 55-80 points account for 70%, 55 points below 20% and 80 points above 10%, we ﬁ nd that intrinsic motivation plays a role in CSL acquisition, a vital role. Intrinsic motivation can help mature learners from Southeast Asia to acquire Chinese better and better. The earlier you learn Chinese, the higher your motivation, and the easier it is to set your Chinese learning goals. The greater the enthusiasm for learning Chinese, the better the Chinese scores (such as HSK test scores and Chinese phonetic test scores). Therefore, the Chinese pro ﬁ ciency of international students has a great relationship with their interest in Chinese language, that is, the greater the interest in Chinese, the stronger their motivation to learn, and the Chinese pro ﬁ ciency will be very good.


Introduction
Language is an important tool for humans to acquire knowledge, learn, and express thoughts and emotions. Language is one of the most important and effective forms of information transmission for human beings. Language and voice are also related to people's cognitive activities, symbolizing the culture of a social country. Therefore, the study of language and phonetics is of great significance to the progress of science and the development of society. As people enter the information age, modern technologies and methods have emerged at a rapid pace, enabling them to receive, store, and process audio data more efficiently and quickly and speed up the research of audio processing technology. Textto-speech technology is a process of processing text-like input signal sequences through a dedicated synthesizer to create natural, high-quality, and high-quality audio output. Speech synthesis acts as a machine to output humanmachine voice interaction. Disciplines are related to linguistics, computer science, neuroscience, computer science, psychology, and many other subjects.
Voice and text search engines are now widely used, and image extraction applications are being tested in different search engines, but audio extraction is still in its infancy.
With the widespread adoption of voice assistants, such as Apple's voice assistant Siri and Microsoft's voice assistant Xiaona, and the success of iFLYTEK's voice input method, voice recognition is gradually developing. It is widely used in business, but there are many problems with using traditional sound. Keyword recognition: first, convert all audio data into text and then perform keyword recognition. With the increase of audio data, the workload is large and the memory efficiency is reduced. At present, the problem of multilanguage integration is not solved. Vocabulary (OOV) and resource language are excluded. Poor recognition ability: therefore, more powerful multimedia recognition technology is the foundation to support the development of the mobile Internet. In speech recognition, the flexible and variable structure of the Chinese language acoustic model and the framework of statistical training can reflect the random characteristics of speech, so it has become the mainstream of speech recognition, especially large vocabulary, continuous sound, and nonspecific speech recognition technology. The establishment of the Chinese language acoustic model is mainly to solve the state transition matrix and the characteristic observation matrix.
In fact, since the early 1950s, research on text-to-speech at home and abroad began this field. The Audry system developed by Zhou and Yu in the laboratory can automatically recognize 10-digit English numbers. This marked the beginning of formal speech recognition research. In the 1960s, improvements in computer hardware promoted the rapid development of speech recognition research. But the algorithm processing level at that time was relatively low [1]. Li et al.'s invention provides a method for compressing a neural network acoustic model. The method includes the following: dividing the row vector of the weight matrix W of the output layer of the neural network acoustic model into the number of subvectors according to the specified dimensions. First, quantize the layer vector, obtain the firstlevel codebook, and replace the subvectors of the matrix W with the vectors from the first-level codebook to obtain the matrix R * . Finally, the matrices W * and R * are used to represent the weight W matrix and its algorithm. This neural network is too complex and difficult to use [2]. Since the above algorithms and calculations are too complicated, we will see subsequent improvements and optimization algorithms. Zappi et al. have proposed a text-to-speech algorithm based on the statistical audio model selection unit. In the process of forming the model, it first separates the audio parameters such as the frequency spectrum and fundamental frequency of the voice data in the archive. It is then decomposed and annotated into theories in the corpus to estimate the corresponding contextual phonemes. Statistical acoustic model is prone to loss of sound accuracy [3]. This invention discloses a learning method, a text-to-speech method and device for a separate multispeech model for text-to-speech. The model learning method of this embodiment reduces the target user's voice in the process of learning the target user's voice model. The data scale requires a small amount of user voice data to form multiple personal voice models, including the voice characteristics of the target user. The speech recognition standard of this research has been significantly reduced [4]. Koguchi et al. focus on the variability of speech signals caused by the phenomenon of cooperative pronunciation in Chinese speech recognition. Therefore, a method for constructing a syllable acoustic model has been proposed. In order to alleviate the problem of scattered training data, an intrasyllable overtone syllable model is proposed to initialize syllable parameters. Subsequently, an intersyllable conversion model was introduced to solve the problem of cooperative pronunciation between syllables, but the applicability of this research is not wide enough [5]. Zhuokun et al. use this algorithm to take advantage of the powerful computing power of GPU to improve the calculation speed of matrices and vectors when learning networks. The optimized network can process multiple data streams at the same time and practice a few example sentences to speed up the training process, but the speed is quite fast and the algorithm quality is not up to the standard [6].
This article focuses on the application of neural networks in detecting voice activation. In modern voice communication, although voice activation detection methods already exist in many communication systems, with the advancement of technology, voice communication methods are becoming more and more diversified. At the same time, there are more and more application scenarios, and the application environment is becoming more and more complex. Moreover, the existing voice activation detection methods are not suitable for these complex signal-to-noise ratio conditions. In this paper, combining the user experience of neural network in image processing, using existing sound detection methods, through scaffolding design, speech processing, algorithms, and simulation experiments, a neural network-based voice activation detection is proposed. Comparison of real-time testing in real-time and nonreal-time environments: in this paper, Southeast Asian adult students (native speakers are Indonesian and Thai) in Zhengzhou University were selected as subjects to explore the relationship between Chinese phonetic level and acquisition motivation of Southeast Asian students. This proves that the voice activation detection algorithm proposed in this paper has good accuracy and speed.

Chinese Speech Synthesis Algorithm Theory
2.1. Classic VAD Algorithm. The voice activation algorithm based on algorithm model and the word as the modeling unit trains related keyword models and several filler models for each word. All keyword models and algorithm models together form a recognition network. The advantage is that the search network is small, and the recognition is fast, does not require language model support, and has a higher recall rate than the subword model. G. 729 is a low-bit-rate speech coding technology in the ITU-T specification. It uses a conjugate structure linear prediction coding method and a driving code to transmit at 8 kbit/s. G. 729 is also the technical specification of this figure [7]. The compressed transmission format of silent frames in this technical specification: the voice activation detection algorithm is called the standard algorithm of the voice activation detection algorithm. G. 729 discrete transmission mode supports 8 kbit/s G. 729 encoder. First, it shows the voice activation detection algorithm and then the voice activation detection algorithm, which is used to detect the audio and nonaudio segments of the input audio signal. The audio part is transmitted at the normal encoding rate, and the nonaudio part is sent with 15-bit encoding per frame, and then, the nonvoice part is replaced with comfortable sound reproduction when receiving [8,9]. According to the above method, all the received 2 Wireless Communications and Mobile Computing sounds are in WAV format, monophonic, 16 kHz sampling frequency, 16-bit linear quantization; the key word (word) sound length is 1-2 s, and the sound length is two keywords (phrases) as the sound statistics of 5-10 seconds are shown in Table 1.
The function of the voice activation detection algorithm in G. 729 is to detect the audio and nonaudio parts of the audio segment and the nonaudio segment using different data rates during the call and then synthesize the voice VAD call reception and determine whether it is correct and directly affect the call quality. G. 729 parameters calculated for the speech compression algorithm in Annex B include linear spectrum frequency parameters.
Full-band power and low-band power and zero-crossing rate: the Levinson-Durbin algorithm is used to calculate the autocorrelation coefficientfRðiÞg % of the input speech signal, where q = 12, then convert the calculated autocorrelation coefficient into coefficient reflection, and then convert the reflection coefficient after obtaining the autocorrelation coefficient converted to linear frequency parameter {LSF}&o where p = 10. The calculation methods of the two parameters for the full-band energy E and low-frequency energy E are as follows: In formula (1), N = 240, which is the width of the LPC window. E f = 10 * log 10 1 In formula (2), h is the impulse response of a finite impulse response filter with a cut-off frequency of 1000 Hz, R is the Topolitz autocorrelation matrix, and the autocorrelation coefficients are diagonally distributed. The calculation method of ZC zero crossing rate is shown in formula (3).
In formula (3), xðiÞ is the input signal before processing, sgn ðxÞ is the signal function when x < 0, when x = 0, sgn ðxÞ = −1, and when sgn ðxÞ = 0, x > :0, sgn ðxÞ = 1, and M = 80 after obtaining the required parameters. The decision does not directly depend on the given parameter value. Instead, the long-term average of these extracted parameters is used to track changes in background noise characteristics.
The parameter change of each frame is calculated according to next formula. Finally, check whether the input frame is an audio signal according to the set threshold. The changes to these settings are as follows: Spectrum harmonic distortion measurement Full-band energy change measurement Low-band energy change measurement Excess rate change measurement In Equation (4), LSF represents the average value of the linear spectrum frequency parameters, in Equation (5), E f represents the average value of the total energy of the frequency band, and E I in Equation (6) is the average value of the low-frequency spectrum energy, and in Equation (7), ZC represents the average value of the zero-crossing rate. The analysis process of the algorithm research program in this paper is shown in Figure 1.

GSM VAD1
Algorithm. Even the classic VAD algorithm introduced above still has certain deficiencies in algorithm and response time, so a new improved algorithm and optimization algorithm will be introduced below. The voice activation detection algorithm recommended by the European Telecommunications Standards Institute (ESTI) GSMAMR is another classic voice activation detection algorithm. It is a voice action detection algorithm based on mutual judgment of multiple parameters. This algorithm is also implemented in the 3GPP standard. Voice-activated GSM AMR is mainly composed of two branch models: one is the ENS VAD algorithm provided by Europe called VAD1, and the other is the Motorola VAD algorithm provided by Motorola, called the VAD2 part. Here, we mainly introduce VAD1. The VAD1 algorithm is an adaptive multirate voice activation detection algorithm. The algorithm structure diagram is shown in Figure 2. From Figure 2, it can be seen that the calculation of the parameter VAD1 consists of the following parts: filter channel.
Then, the signal level in each subband is calculated separately, and the subband is calculated through the subband. The frame power of the input signal is obtained from the signal level [10,11]. Motorola VAD algorithm estimates the parameters of the domain model. Here, the decision tree parameter merging algorithm based on random segment model proposed by its algorithm is used to bundle the parameters of the transition model to improve the training and the open-loop pitch gain obtained from the relevant calculation is used to pass the pitch. The indicator indicates the presence of the tone. The signal complexity in complex signal analysis is obtained by analyzing the pitch correlation vector, and then, the composite signal indicator is used to indicate the presence of the composite signal (such as a music signal).
Subband division and level calculation pass is a filter bank can divide the input audio signal into 9 different subbands. The structure diagram of the filter bank is shown in Figure 3. In Figure 3, the filter bank is composed of three filters 5 and 3, and each filter divides the input signal into high frequency components. The sampling rate is calculated as 2 : 1, that is, each bandwidth of the high-bandwidth section of each filter is twice the bandwidth of the low-band section. It divides the input audio signal into 9 bands according to frequency. The lower the frequency, the smaller the bandwidth of the band, and the narrower the band [12].
For sound level detection, the purpose of the pitch detection function is to detect vowels and intermittent signals. It is implemented based on the comparison of the open loop delays of the subloops calculated by the voice encoder. If the difference between the open loop delays of consecutive subbands is less than the threshold, the delay counter will be accumulated; if the sum of the count delay counters of two consecutive frames of input speech is large enough, the volume indicator will be set to 1, indicating that the field exists [13].
Sound detection: this is because the volume detection cannot detect the sound level of the audio signal. The purpose of the volume detection adjustment is to detect the volume of the input audio signal. At the same time, the height detection can also detect other signals, and the pitch detection can be realized by comparing the input pitch increment in the open loop with a defined constant threshold. If the open-loop tone expansion is greater than the set threshold, the tone flag is set to 1, indicating that a tone is detected.
Complex signal analysis and detection complex signal analysis are used to detect relevant signals after high-pass There are problems that cannot solve the problem of language matching with less resources, multi-language integration and OOV Unsupervised recognition algorithm based on segment features and syllable based-DTW Ensure system performance and improve system identification efficiency filtering. If Comfort Noise is used to replace these signals, it cannot be accurately detected due to noise detection and distance detection. The sound will not be natural enough. If the highest normal correlation is obtained from the high-pass filtered speech signal, the position of the complex signal is marked as 1, indicating that there is a complex signal [14]. Background noise estimation: the background noise estimation is updated by the input amplitude level of the  5 Wireless Communications and Mobile Computing previous speech signal frame. And in the noise evaluation update delay, the purpose of the frame is to avoid sudden failure of the initial speech position detection and destroy the noise detection or loudness signal. The noise estimate will not be updated.

Description of Sound
Attributes. The most commonly used method to solve the problem of cooperative pronunciation in speech recognition is to establish a backgroundrelated acoustic model, which is typically a three-tone submodel, which takes into account the influence of two pronunciation units before and after on the current pronunciation unit. The three-tone submodel can achieve better results in solving the problem of cooperative pronunciation. The Mel Cepstrum Coefficient (MFCC) takes into account the characteristics of human hearing, that is, the masking effect of human hearing. Weak frequency components can be masked by neighboring stronger frequency components. It has good perception and good antinoise ability. The function diagram is shown in Figure 4.
Preemptive processing is shown in formula (8). The essence is to use a high-pass filter to process the audio signal. It has two functions: one is to improve the high-frequency part of the signal and maintain the frequency spectrum of the signal. Signal: the frequency band between the low frequency and the high frequency uses the same signal-tonoise ratio to calculate the frequency spectrum, and the second is to remove the high frequency part of the offset signal. And highlight the high frequency mode [15,16].
The audio signal can be considered a signal. The quasistationary can be converted into a frame and the sound in the frame after subframe processing can be regarded as a steady-state signal. In the processing of subframes, the frame interval is usually about 10 ms to 25 ms. The overlapping area is defined between adjacent images and is 1/2 or 1/3 of the image length. After framing the signal, hamming window analysis is usually selected to improve the accuracy of the analysis. If the signal after the frame is SðnÞ, n = 0, 1, ⋯, N − 1, where N is the frame size, and SðnÞ after windowing, the fast formulas (9) and (10) can be used.
Among them, the Hamming window is transformed into the value of a, which is generally 0. 46. After preemphasis, framing and windowing, and fast Fourier transform (FFT) are performed on each frame of the signal, the characteristics of the speech signal must be observed in the frequency domain. This is done to get the frequency spectrum of each frame, and the power spectrum is obtained by calculating the modulus and square of the frequency spectrum. The discrete Fourier transform (DFT) of the speech signal is expressed as Equation (11).
Among them, xðnÞ is the input audio signal, and N is the number of Fourier transform points. The Mel filter bank can reflect the perceptual characteristics of the human ear. The conversion between linear frequency and Mel frequency is shown in Equation (12).

B f
ð Þ = 2595 log 10 The Mel filter bank consists of several bandpass filters in the frequency spectrum. As shown in Equation (13),   (14).
Then, the Mel filter bank is composed of a group of triangular filters, as shown in Figure 5.
The center frequency f ðmÞ can be defined in the form of Equation (16).
Among them, f n is the highest frequency in the frequency range, f l is the lowest frequency, N is the window width of DFT or FFT, f is the sampling frequency, and B −1 is the inverse function of B, which can be expressed by Equation (17).
According to Equation (18), the logarithmic energy of the filter is calculated, and the result approximate to the homomorphic transformation can be obtained.
The output power or amplitude of each filter channel is related. And you can use the discrete cosine transform (DCT) to obtain the coefficients of the decorative cavity, as shown in Equation (19).
Among them, c ð0Þ is the 0-order MFCC, which reflects the energy spectrum, and L is the MFCC order, and the value is 12-16 during the feature separation process. Assume that different speech bubbles are not related to each other. Therefore, the information between bubbles must be relevant and evolving [17]. In actual use, the relationship between bubbles is close to the second-order and second-order differential coefficients. The ceps characteristic is called the static characteristic, and the spectral difference of the static characteristic is called the dynamic characteristic [18,19]. The complementary advantages and disadvantages of static and dynamic attributes can improve memory performance. The calculation of the difference is shown in formula (20).
Among them, d t is the t-th difference parameter, C t is the coefficient to be obtained, where t and N are the sequence, and M is the time difference of the derivative. Good quality means it is easy to understand and looks easy. Good quality means you have to pay a little attention to listening. But it sounds simple: quality means moderate concentration, but you can figure out that it looks light, fatigue level, low quality means you have to work hard to figure it out, and the MOS scoring method is based on a 5-point system (34 points), as shown in Table 2.

Wireless Communications and Mobile Computing
The test sound quality is divided into 5 levels: excellent (5 points), good (4 points), medium (3 points), poor (2 points), and poor (1 point). As a very important parameter, it is usually set as a constant in the standard gradient algorithm. However, in practical applications, it is difficult to determine an optimal learning rate that is suitable from beginning to end. If the minibatch optimization algorithm is used, the baseline model (CPU) learning rate ðα = 0:1Þ adjustment strategy is no longer suitable, test the recognition rate in the actual recognition system to observe the impact on the model performance, and verify the effectiveness of the parallel optimization algorithm.
DRT reflects the intelligibility or intelligibility of speech. The test method includes testing the pronunciation of characters or words with the same vowel. For example, the vowels of the Chinese characters "you" and "li" have the same pronunciation [20]. It will be better if the correct sound quality is distinguished. DRT score is the percentage of all testers who can obtain accurate test results from audio measurement. It is generally believed that the MOS value corresponding to DRT score exceeding 95% is 5 points, the MOS value corresponding to 85%~94% is 4 points, 75%~84% correspond to 3 points, and the MOS value is 75%~84%. 65%~74% correspond to 3 points, MOS score is 2 points, and less than 65% corresponds to MOS score 1 point. DAM and DRT type tests use percentage scores as a comprehensive assessment of speech quality, as a measure of acceptance of words from many aspects. Therefore, the processing of sound quality is extremely important. The detection of sound quality determines the accuracy of Chinese speech synthesis. In the system set up below, it is necessary to have a higher requirement for sound quality to detect and score it.

Voice Sample Collection.
A phoneme is the smallest unit in speech, and the pronunciation methods and parts of several phonemes of a speech keyword are different. Each phoneme unit is composed of multiple consecutive frames of speech, and the feature parameters of the phoneme have greater differentiation, that is, several feature vectors in the phoneme segment are similar, and vice versa, there are greater differences. Considering that traditional template matching methods also use frame as a unit for keyword recognition, when the length of the voice is long, the recognition efficiency is reduced. Therefore, this article proposes a new segment feature. Ten foreign students who came from South Asia participated in our research. There are five men and five women. In order to prove that there is no relationship between Chinese critical period acquisition and Chinese proficiency, we should consider the important role of age in CPH. The subjects were all adult international students, aged 20-30 years old, and their average age is 25.3 years old. In the second place, the language background and country should be strictly controlled to ensure its accuracy and effectiveness. It is worth noting that the Southeast Asian adult learners are most be never learned Chinese before they came to China, and if they mastered Chinese before teenagers and before they came to China, the data will be meaningless. Additionally, volunteers should come from the same nations and speak a comparable or same mother tongue to assure the experiment's rigor. The function diagram is shown in Figure 6.
Firstly, the international students who came from Southeast Asia are invited to participate in the Chinese phonetic experiment. The purpose of Chinese phonetic experiment is testing the Chinese production ability of foreigners. By the way, Chinese as the interlanguage is used in the whole experiment. We recorded the test process with tas-camdr44wl and collected the recorded data. In order to accurate the Chinese phonetic production ability of Southeast Asian adult learners, the Chinese phonetic paper includes two parts: vocabulary and phrase. After they finished the Chinese phonetic production, we can acquire 10 recordings. Finally, we just need to analyze the recordings of the participants and export data. As the second step of the experiment, the subjects had to fill in a questionnaire after completing the Chinese phonetic test. More importantly, the questionnaire is used as a tool to test the Chinese learning motivation, especially in part B of the questionnaire. This research is aimed at gaining the intrinsic motivation and extrinsic motivation of foreign students, who play a more important role in the relationship between Chinese learning motivation and Chinese proficiency. Third, after completing the questionnaire, the experiment ended.
From the results in Table 3, the number of filled-in forms is gradually increasing, which improves the average keyword recall rate. However, by increasing the number of filling patterns, a single keyword is very different. The reason for this phenomenon is that the speech that forms the filling pattern is randomly shuffled into 5 parts, and each group of speech is divided into 5 parts, and the nonkeyword coverage related to the keyword is destroyed. Different and each related filling mode has different absorptive capacity for nonkeywords [11]. However, according to the evolution trend of the average recall rate, the filling mode is better when the filling mode is 5, which proves that the more speech exercises without keywords, the higher the absorption rate. The importance of training model algorithm efficiency draws the keyword mention rate error in Table 3 as a transition curve, as shown in Figure 7.
The operation process proved that the supervised recognition algorithm based on the filling model must be supported by a large number of self-annotated speech. Especially when it is necessary to train an acoustic model with sound insulation performance, relying on the data is more solid although the modeling unit can guarantee the recognition rate of the algorithm. But this does not solve the OOV problem, and the identification process depends on the framework. When most words are found and the number of keywords is low, the perception and performance are low. It can be seen from the figure that when the number of states of the speech model increases to about 7, the decline in segmentation dispersion is already very weak. Therefore, from this perspective, in the speech model based on the average probability, the number of states is 6~8 is a suitable choice.

Classical VAD Algorithm Simulation.
The learning rate is a very important parameter in the learning process of neural networks. In the first stage of model training, the algorithm performs simulation learning with high efficiency. As the model approaches the convergence point, the training process will not change. Before the model converges in this experiment, the learning rate will be very low [21]. The number of nodes in the hidden layer of the network is set to 600, and the training set is divided into 500 categories, so that you can fully train your model. Each model has undergone 12 rounds of training, with a batch value of 64. This means that the network trains 64 examples of words at a time. By changing the training rate and the parameter values in the training data warehouse, we compare model convergence with changes in model performance after parallel optimization at different training rates. Select the matching learning mode. The abscissa is the simulated training cycle [22]. This article has 12 learning simulation cycles. Starting from the seventh cycle, the learning efficiency is adjusted to half of the previous one. The ordinate is each different theoretical model. Framework can eliminate the uncertainty of each collected language to a certain extent, and the PPL value in the experiment is inversely proportional to the performance of the model.
It can be seen from Figure 8 that when the value of the training phase is smaller, the model convergence effect is  better. The difference lies in the performance of the model. When the learning rate increases, the network can learn better and better. It can be seen that the value gradually increased to a certain level during this period. The performance of the model tends to deteriorate. This indicates that a higher learning rate will over-learn the network and affect the overall capacity of the network. This article also tries to use a higher learning rate.
This can prevent the model from converging. In the experiment, when the experimental value is set to 1.3, a good experimental effect is obtained, but it is very different from the setting value of 0.1; especially when the batch value is 32, the total learning the efficiency is 1.3. When the learning efficiency is evenly divided among other keyword phrases, it can still be used as a training cycle to achieve learning progress, so there will be no problem of low learning rate.
The results are shown in Table 4, and the RNN-like model based on CPU is trained. It can be seen from Table 1 that when the setting value is 1, the training efficiency of the GPU-based model is increased by more than 2 times compared with the training. The CPU and system recognition rate will not drop too much. Demonstrated a successful implementation, the RNN classroom training on the GPU was successful, and some accelerated training results were achieved. When the batch value is 64, the number of words processed by the network per second increases by nearly 19 words compared to batch = 1 [23,24]. Multiply the performance on the processor driver by 38. The system recognition rate after actual recognition optimization is relatively close to the system recognition rate in the basic model. Although there has been a decrease, the magnitude is not large. When inserting a 3G model, the recognition rate 10 Wireless Communications and Mobile Computing decreases to a lesser extent and significantly improves training efficiency, which makes RNN can be used to train large data warehouses. Based on the recognition algorithm of LVCSR with words as the modeling unit, the speech signal is recognized and converted into text form, and then, keyword recognition is realized by text search.

GSM VAD1 Algorithm
Simulation. This article has established a voice database by itself, with a total of 10 Chinese keyword set. The voice sources include recording and webcasting. The voice types include keywords only and keywords with keywords. Record in a quiet environment and draw up the content of sentences containing keywords. Each keyword involves 3~5 sentences, which are read several times. However, how to choose the number of states of the speech model for the speech feature vector? That is, how many eigenvectors are more appropriate? As the number of cluster centers increases, the change trend of the classification gap can be seen. When the number of states is L, if the number of states continues to increase and the classification gap d is not much different, there is a classification gap d. The statistics that continue to increase are not important. Based on Fisher's algorithm, this paper calculates different Chinese diffusers in different states. The four curves in Figure 9 represent the state numbers of the sounds "bai," "bang," "dao," and "gai." The distribution of the graph can be seen as the number of states of the speech model increases to an approximate value [25,26]. Seven, the fragmentation of descending word segmentation is already very small, so from this perspective, the number of states in the speech model is 6-8 according to the average probability, which is the correct choice. When the threshold is low, the next step is to train a model with fewer training samples. But these models do not have enough training samples to fully train them. This not only does not optimize the parameters of the model, but also produce the opposite effect, so the accuracy of the system is low. When the threshold is increased to about 400, the recognition accuracy is the highest. Due to the ever-increasing threshold, some models have training examples that allow the model to be fully trained. But because the sample size is less than the defined threshold, the model is not optimized. Therefore, the recognition rate is reduced. Table 5 shows the recognition results after adding the intersyllable transformation model to the context-free syllable model system [10]. Compared with the contextindependent syllable form, after adding the change model, the system word error rate is further reduced. However, since the transition model considers the problem of covoting between syllables, the effect of enhancement is not clear. There are two reasons: On the one hand, syllable pronunciation is not as serious as the pronunciation. On the other hand, it may be necessary to improve the accuracy of the  Figure 9: "bai," "bang," "dao," and "gai" in different state optimal segmentation deviation curves.

Experimental Results and Analysis
4.1. Analysis of Algorithm Results. Because in the speech signal, the speech segment within the same phoneme can cover multiple consecutive frames of speech and the pronunciation has a certain degree of continuity. Therefore, in this paper, the average value of the frame features of the phoneme segment is represented as the segment feature. The detailed theory of the segment feature is described above. As already introduced in the content, this section mainly describes the process of extracting segment features. The basic system we use for comparison is a splicing synthesis system based on traditional cost functions that select units and use the same sound library. The cost table related to the system is manually edited by audio experts. Choose from 11 common text-to-speech application areas. Select 20 sentences in each field. They are combined with two synthetic systems, respectively, and the combined results are evaluated by 5 auditors with a MOS score of 1 to 5, all of whom are professional assessors outside the system. Researchers do not know the content of the pretest. During the test, the synthesized sounds of the two systems were played randomly. You can view the statistical data and final average scores of the two scoring systems on different aspects of the text.
(1) After using the statistical model to select the unit, the comprehensive results of all aspects have been continuously improved, and the average MOS score has increased by about 0.5 points. We also combined the results of the two systems to perform a T-test.
And the results also proved that the improvement of this effect is significant (p < 0:05) (2) Comparison of Speech MOS Scores. After text synthesis in different fields uses the unit selection method based on noise model statistics, the maximum difference between them is reduced from 0.312 points of the basic system to 0.228 points, better stability of synthesis effect.
(3) Due to the uneven distribution of training data and too few training samples corresponding to some syllables, sufficient training cannot be performed. Therefore, the recognition accuracy of the syllable model directly trained with the HTK tool is much lower than other models The experimental results show that the training of the class-based RNN model in the CPU is used as the baseline. It can be seen that when the batch value is 1, the training efficiency of the model on the GPU is increased by more than 2 times compared with the training on the CPU. However, the recognition rate of the system has not greatly decreased, indicating that the class-based RNN training on the GPU has been successfully implemented, and a certain acceleration training effect has been achieved: when the batch value is 64, the number of words processed by the network per second is comparable compared to batch = 1.

Rational Analysis of Foreign Student's Foreign Language
Ability. We analyzed the Hanyu Pinyin data through Praat and then used SPSS to analyze the connection between the results of the Hanyu Pinyin experience and the motivation to learn Chinese. On the other hand, in Praat Basics, the pronunciation lines of Southeast Asian Chinese learners are similar to those of standard Chinese learners. With the same digital analysis result as you can see, the sound curve has the same direction. (Because we have too much pinyin information, the researchers chose the most representative number, as shown in Figure 10.) This paper believes that Praat has produced valuable data. Besides, we also calculate the accuracy of Chinese phonetic production, and the results are shown in the figure below. As shown in the histogram of Chinese phonetic production results, 70% of the subjects scored between 55 and 80, 20% below 55 and 10% above 80. HSK means Chinese Proficiency Test, and the full name is Hanyu Shuiping   This makes students tired and easily distracted. The boring content of classroom teaching leads to a dull classroom atmosphere, and it is difficult for students to grasp the key points of learning and make students lose interest in classroom learning. Students who lose interest in classroom learning believe that the problem is not caused by the traditional teacher-centered teaching method, but it also includes a list-based method to describe the teacher's contempt for the course or the student. Because some teachers also pay attention to lectures, there are few interactive links, but the content of the tutorial is concise and clear, and there are also key points and highlights, which can help students master the textbook and stay focused. Most of these types of problems occur in the mid-to-long term. When you are learning Chinese in the primary stage of China, the teacher will focus on practicing in the elementary school class. It is too often used to describe educational materials and has more interactive links.

Conclusion
In this paper, by combining the classic VAD and GSM VAD1 algorithm simulations, it is concluded that the HSK score is positively correlated with the voice experimental test score. The initial age of Chinese acquisition is negatively correlated with intrinsic motivation (interest). The intrinsic motivation (interest) of Chinese learning is positively correlated with HSK scores. In Chinese acquisition, there is an inverse relation between initial age and extrinsic motivation (status, etc.). It seems that the relations are complex; however, actually there is a kind of connection: the initial age of Chinese acquisition affects the motivation of Southeast Asian students to learn Chinese. As far as the intrinsic motivation of Chinese acquisition is concerned, the earlier foreign students come into contact with Chinese, the more obvious the impact of intrinsic motivation on HSK and Chinese phonetic test scores. The more enthusiasm and interest they have in learning Chinese, the better their Chinese performance will be. On the contrary, they have low motivation and interest in learning Chinese, and their Chinese performance is poor. In terms of external motivation, the starting age of Chinese acquisition has an impact on the money, status, purpose, and other external motivation of Southeast Asian students, but the impact is not significant. According to the result of Praat and SPSS, we found that the intrinsic motivation plays a crucial role in CSL acquisition, and the intrinsic motivation can help mature learners who came from Southeast Asia acquire Chinese better and better. The earlier they learn Chinese, the higher their motivation will be, and the better it will be for them to set up their Chinese learning goals. The more motivated they are to acquire Chinese, the better their Chinese scores (such as HSK test scores and Chinese phonetic test scores); the higher their interest in acquiring Chinese, the better their Chinese performance/performance; the external motivation has little influence on the Chinese acquisition of Southeast Asian foreign students, while the internal thing has a profound influence on the Chinese acquisition of Southeast Asian foreign students. Occasionally, external incentive alone is insufficient to impact overseas students' Chinese phonic level. For instance, encouragement and recognition may not be major influences. If learners of CSL desire it, their intrinsic drive will be critical. External motivation and intrinsic motivation, on the other hand, can be changed into one another, and external motivation can also transform to pressure, affecting learners' study. In conclusion, the intrinsic motivation plays a vital role in Southeast Asian mature students' CSL acquisition. How Chinese teachers should rethink on how to use internal motivation to increase the skill and level of CSL learners and especially how to help students develop self-confidence are problems that need long-term research.

Data Availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.