Automatic Scoring of Spoken Language Based on Basic Deep Learning

,


Introduction
In recent years, information technology has been widely used in the field of education. In language education, the popularity of English education in China is getting higher and higher, and the traditional language education methods are difficult to meet people's needs [1]. In this context, Computer Assisted Language Learning (CALL) has become a research hotspot [2]. CALL system not only is used in online education, but also includes English education platforms such as text, image, audio, and video, which also play an important role on the Internet. Instead of teachers automatically revising students' test questions and homework between classes, teachers are freed from taking time to revise. e automatic correction system like now has almost reached the completely correct level in the correction task facing objective problems. As for composition questions and oral questions, automatic revision is still the research focus that should be broken through. Oral problems can be divided into two types [3]. One is retelling, reading aloud, and reciting what is known. Another point is that candidates are free to play games around specific problems and topics. We are often called "open spoken English." With the development of speech recognition technology, the first question can be well evaluated by comparing and analyzing the examinee's pronunciation with the standard pronunciation [4], such as using the classical Goodness of Pronunciation (GOP) algorithm. In addition, it is necessary to comprehensively evaluate candidates' answers from multiple dimensions, for example, fluency, rhythm, intonation, richness of vocabulary, and meaning. For a long time, the research on open oral scoring technology has not made great breakthrough. With the development of machine learning technology, some scholars have studied how to apply it to automatic oral evaluation.
us, the famous automatic scoring system of speed competition appeared [5].
In universities, the English skills training service system of universities is used for examination in the teaching of situational English. At least mid-term and final evaluations are conducted every semester. In two exams, each teacher is usually responsible for the educational tasks of multiple classes. Because of the complicated manual grading method, teachers' burden is aggravated, and their educational energy is insufficient. If we study the intelligent correction system needed in the oral test of senior high schools in China [6], we can greatly reduce the pressure on teachers, and teachers can put more strength into practical teaching activities to improve their teaching ability.
ere are extremely few ways to score speakers with speech disorders. We study an automatic speech score, which is a kind of assessment for speakers with language disorders [7]. With the development of society and the integration of global economy, people's demand for English learning is increasing day by day, so the research on automatic assessment of oral proficiency is particularly important. In the previous automatic evaluation system, recording conditions are a challenge for learners' pronunciation, noisy sounds, etc. In addition, it is necessary to deal with nonfluent, nongrammatical, and spontaneous sounds with unknown potential text. To solve these series of problems, we propose a method of combining speech recognition system based on deep learning with Gaussian process (GP) scorer, which is a measure to evaluate the performance of rejection scheme [8].  [9]. Like the previous vector space model, LSA method also uses vectors to represent words and documents and judges the relationship between words and documents according to the relationship between vectors, which leads to two shortcomings: (1) vector space model uses correct sentence matching.

Related Technology Research
(2) You cannot ignore the meaning of a word and provide semantic search. LSA solves the above problems by statistically analyzing a large number of text libraries and mapping documents from sparse n-dimensional space to low-dimensional space. Vector space is called inclusion space. e document modeling process using LSA method is as follows, shown in Figure 1: (1) Analyze the document set and create a word-document matrix (2) Singular value decomposition of word-document matrix (3) Dimension reduction of the matrix after singular value decomposition e TF-IDF is calculated by the following formula: Matrix S is an m × m dimensional diagonal matrix, and each value on the diagonal represents the importance of various topics, and this value is also called a singular value. en, in Step 3, the matrix after SVD decomposition only stores the largest K topics of U, and a dimension descent process is performed, in which only K topic vectors corresponding to S and V are maintained. As shown in Figure 2, the resulting Matrix A can be expressed by the following formula: If you use query text to calculate the similarity of all the text in the document set, you need to map the query text to the meaning space:

Word Embedding.
To score the spoken language, considering the learning model, it is necessary to use the neural network model to score the spoken content of the examinee [10,11]. e existing model scoring has the following main problems: (1) there is only one word in the number vector; so if there are N words in the text, it needs to use the N-dimensional vector for coding. erefore, if the number of nonrepeated words in the text is large, the dimension of the vector becomes large. In addition, as the number of neurons increases, the computation becomes more complex. (2) Simple hot coding scheme cannot describe the meaning relationship between words. Words can be represented by low-dimensional vectors. For words with similar meanings, the vector displays are also close, as shown in Figure 3. More abundant information can be embedded into low-dimensional vectors, which are represented by single hot coding and term embedding, respectively. When using neural network to solve text problems, the network architecture shown in Figure 4 is usually used. e first layer of the network is the word embedding layer, which transforms the words in the input text into word vector representation. For example, the length of the word embedding vector is set to 50 for the text containing 20 words, and it becomes 20 through the word embedding layer. Twodimensional matrix of ×50. e word embedding layer is interpreted as a dictionary model, and the word index and its corresponding word vector graph are stored in the dictionary. is model can be obtained through data training or loaded with trained models. Word 2 vector and GloVe are the commonly used models of preparation training language [12]. Based on the latter, this paper constructs a scoring model of oral content. e weighted value can be abstractly understood as signal strength. After weighted addition, these input signals are processed by an "activation function" to generate an output signal Y. e learning ability of neural network is strong because of its great activation function. If the activation function is not used, the network can only perform simple linear  Scientific Programming transformation, so the performance of this network is limited. On the other hand, the activation function introduces nonlinear elements into the network, which makes the neural network approximate to various nonlinear curves arbitrarily and makes the network have strong representation ability. e general active functions are Sigmoid, Hyperbolic Tangent, and Lireer, as shown in Figure 6, and there are function graphs of these three active functions. e information transfer relationship between nodes is explained in the following formula:

Research on
where x i represents the output value of the i-th node of the previous layer (or the input value of the current node j), wij represents the weight value between the i-th node of the previous layer and the j-th node of the current layer, b j represents the paranoid value of the j-th node of the current layer (the paranoid value is introduced to make the model converge better), and y j represents the output value of the jth node of the current layer.

Basic Concepts of Deep
Learning. e differences between deep learning and traditional machine learning are as follows. Feature items are fully automated, so people do not have to go all out to find a more suitable initial input feature. Data becomes higher-level and more abstract display form through the network. is process is the core step in traditional machine learning. In Figure 7, the process can be described simply (or as close as possible to the expected result).
Briefly introduce some core terms contained in the above figure.
(1) Loss function: it is used to calculate the difference between the predicted data and the actual data of neural network. In the current field of deep learning, various deep learning models have been developed. In this paper, we pay attention to convolution neural network and cyclic neural network.

BP Neural
Network. BP neural network has strong nonlinear mapping ability and can approximate any discontinuous function with high precision [13]. It is an extremely effective model to solve problems such as regression and classification. e training process of BP neural network is mainly divided into two stages [14]. e first stage is the forward propagation of signals, from the input layer to the hidden layer and finally to the output layer. In the second stage, backpropagation algorithm is used to backpropagate the error from the output layer to the hidden layer, and finally the weight and bias voltage are adjusted to the input layer in turn. When a network is trained using a large amount of data including a plurality of samples, the mean square error is exemplified as a loss function, and the mean square value of the error after forward propagation of the training data is as follows: where y k (i) represents the true output value of the i-th data sample, and t k (i) represents the predicted value obtained after the i-th data sample passes through the neural network. e BP neural network uses learning rate and gradient descent algorithm to update the connection weights and  polarization values of each layer. e whole backpropagation process can be explained by the following formula:

Convolution Neural Network.
One-dimensional convolution neural network is well applied to sequence data, such as audio signals and text data, and in some cases, the performance of this network can match that of cyclic neural network [15,16]. e computational cost is usually quite small, and the model can achieve better performance. As shown in Figure 8, as the operation principle of one-dimensional convolution network, feature is the data length of each feature. e network output data format after convolution operation is samples. e new step is the length of the feature sequence after the convolution operation, and filters are the number of convolution kernels.

Cyclic Neural Network. Cyclic neural networks (RNN)
can circulate information in the network, but unlike networks such as CNN, their output only considers the influence of the previous input and does not consider the influence of other time inputs. In RNN, the output of each moment is not only related to the input of the current moment, but also related to the input of the previous moment. e network has the function of "storage." erefore, RNN is extremely suitable for processing sequence data, especially text data. h t in terms of o t can be calculated by the following formulas: Conventional RNN model is only applied to the processing of short sequence data. In order to solve the problem of insufficient "long-term storage" capacity in traditional RNN networks, some researchers improve the model, which is called Short Term Storage Network (LSTM). LSTM model selectively adds new information and selectively forgets previously accumulated information by introducing grid control mechanism. A new state c t is introduced in LSTM network for circulating information transmission. e states of the hidden layer and the memory cell are represented by the following equations: e states of the three gate controllers can be calculated from the following equations:

Overall System Design.
Combining deep learning technology and object-oriented design idea, the oral scoring system designed in this paper includes six modules as shown in Figure 9.
(1) Oral scoring module: call the scoring mode module, load the training scoring mode, automatically correct the oral data, and save the scoring results in Excel file form.
(2) Sound noise reduction module: in order to make the results of speech recognition and feature extraction more accurate, the examinee's spoken language is noise reduced. Intelligent spoken language evaluation refers to the dynamic process from audio to total point output and can be described as the scoring system in Figure 10 [17]. e speech recognition engine first performs noise reduction processing through the sound noise reduction module and then transfers the beautiful recording to the corresponding text content. e general scoring system fits the characteristic value according to the scoring model. Two scoring models are used here. Speech Scoring Model and Text Scoring Model are designed to improve the accuracy of the scoring system. In addition, in the actual correcting environment, the teacher also evaluates the examinee's conversation from the level of sound and content.
is design is consistent with the manual scoring method. e design of the core module of the system is described in detail.

Design of Speech Noise Reduction Module.
Because of the problem of the recording device, the recording of spoken language is often mixed with current sound and noise. is affects the correctness of subsequent feature extraction and speech recognition. Traditional noise reduction methods use spectrum subtraction or adaptive filtering. In recent years, due to the successful application of learning in the field of sound processing, the use of deep learning technology in reducing sound noise has been improved and is popular. In this paper, RnNoise, an open source noise suppression library, is used to realize the header noise reduction module, in which RNNOIS uses grid control loop unit to realize noise reduction neural network, and GRU is a variant of LSTM. By introducing grid control mechanism, GRU network can store information for a long time. RnNoise uses beautiful sound data (English conversation recordings) and noise data (computer fan sounds, office noises, street people noises, etc.) to train the model. erefore, a wider range of signal-tonoise ratio is obtained, and the noise reduction effect becomes more remarkable. In addition, RnNoise is made in C language. In the speech noise reduction module, the RnNoise wrapper is made by using Python language, and RnNoise is integrated into the system.
Microsoft uses a local recognition engine. e recognition speed is the fastest, but the ambiguity is extremely high.

Data Cleaning.
Because of the oral fluency of the examinee and the recognition error of the speech recognition engine itself, there are often recognition results that affect the accuracy of the text scoring model in the speech recognition text. For example, this video is about the Chinese and China great wall, um; the great wall is built by the king in dynasty. ese features include the number of syntax errors and the depth of syntax tree. In addition, there are also onomatopoeia words like uh and um. In addition to these onomatopoeia words, you can be more specific about the grammar of the text without affecting the entire text content.
To build the topic model of LSA, "stop words" such as "the," "is," and "at" must be removed [18]. ese stop words have little substantive meaning for the topic model. In addition, the generated model can be more efficient.

Feature Extraction.
Feature extraction is an extremely important step before machine learning, which determines the reliability and accuracy of the evaluation model. In this paper, in feature screening, the importance of each feature can be measured by calculating the Pearson correlation coefficient with manual scoring, and the feature with correlation coefficient below 0.2 should not be selected [19]. In this paper, there is generally no fixed reference answer for open oral scoring, so when choosing features, besides the features of similarity in meaning, we mainly choose the features of common type. As shown in Table 1, each feature finally selected and used here will be briefly described.
In this paper, the characteristics of four scales are extracted to evaluate the oral scoring model. e characteristic of speed is often called Rate of Speech (ROS), which is mainly used to explain the fluency of spoken language and calculated by the following formula: where N words represents the total number of words contained in the examinee's spoken language, t represents the total duration of oral recording, and t s represents the mute duration of recording.
Besides the characteristics of sound speed, the number of quiet sounds during recording can also reflect the fluency of oral English of the tester to a certain extent. In the evaluation of pronunciation quality, the probability characteristic after   Scientific Programming pronunciation is adopted by many oral scoring systems. is paper uses this characteristic to explain the correctness of the examinee's pronunciation. In addition, when extracting effective spoken language, the proportion of long-term recording can also reflect a certain degree of rich spoken content. In the oral evaluation of traditional reading problems, the standard oral sequence corresponding to the benchmark text is usually displayed, the test speech is forced to be configured, and the postprobability average of each phoneme is calculated by the classical GOP algorithm. However, there is no reference text in the open oral score, so it is necessary to combine the speech recognition engine with the speech model of standard English pronunciation training and calculate the average postprobability as the feature of pronunciation quality. Chapter structure and other features are not suitable for text scoring model of text design. For such short text, sentence structure is a very good alternative, and the depth of grammar tree is used to describe the structural features of sentences. Candidates who are not used to dialogue will have a lower depth of grammar tree than usual. ere are algorithms to calculate the similarity of the meanings of commonly used articles. Vector Space Model (VSM) [20], Latent Meaning Analysis (LSA), and Latent Directory Distribution (LDA) are three methods that are based on the word back model, but the degree of meaning varies depending on the method. As a result of the actual test, it is found that the topic model of LSA is more effective in the data set used here. As shown in Figure 11, it is the process of building the topic model of LSA.
Some common part-of-speech tags are shown in Table 2. ere are no grammatical errors in famous English original novels. is paper refers to the method in EASE, an open source composition scoring system. After the part-ofspeech tags of Sherlock Holmes' novel collections are displayed, the combination of 3 Yuan tags and 4 Yuan tags is taken out, and the extracted results are saved as a retrieval library of local tag combinations. If you cannot find it, the grammar is wrong. We use the following formula to calculate the correct rate of text syntax:

Data Conversion.
Deep learning can automatically extract features, so feature engineering is not needed. As shown in Figure 12, the quantization flow of the entire text removes onomatopoeia words by first performing data cleansing on the speech recognition text and eliminates duplicate words in the text due to recognition errors. Mel Frequency Cepstrum Coefficients (MFCC) are extracted from spoken recording data as input to the sound scoring model. MFCC contains integrated voice information. Figure 13 is a schematic flowchart showing converting spoken speech recording data into MFCC feature vectors:

Scoring Model Module Design.
Using Keras deep learning framework, all neural networks in this study are constructed. Keras is a highly neural network framework made by Python and can run on TensorFlow, CNTK, or ano.

Scoring Model Based on BP Neural Network.
rough repeated experiments, the number of hidden layers and the number of neurons are determined. When the training results do not converge, the number of hidden layers or layer nodes is increased. After the results converge, reduce the number of nodes appropriately and observe whether better results will be obtained. Taking the text scoring model as an example, the sound scoring model with the number of input segments other than 4 has the same structure.

Scoring Model.
If the manually extracted features are always invalid, and the correlation between manually extracted features and manually evaluated features is low, it is difficult for the trained model to fit the data accurately. Deep learning technology can automatically mine features, and the learning data can be displayed deeper, which can break through the boundaries of artificial design features. Combine these two networks to construct speech scoring mode and text scoring mode. e computational cost of cyclic neural network is very high when dealing with very long sequence data, so one-dimensional convolution neural network is used as preprocessor step before LSTM network, and shortening sequence can take out higher-level feature display to deal with LSTM layer. As shown in Figure 14, the design of the speech scoring model includes two consecutive convolution blocks. Finally, all connection layers pseudocombine the one-dimensional vectors to output corresponding speech evaluation results. Scientific Programming e design of the text scoring model is shown in Figure 15, and the neural network model shares five layers of networks. e first layer is the word embedding layer, which is defined by GloVe model. e second layer is a one-dimensional flip layer for reducing the length of the network input sequence and extracting more effective features. e third layer network is the LSTM layer, and the LSTM network can select "stored" and "forgotten" information. And it is a one-dimensional vector after pseudooutput MeanOverTime processing and outputs the evaluation result of spoken content.

Means for Evaluating System
Performance. In this paper, Pearson correlation coefficient is used to evaluate the performance index of oral evaluation, which is used to evaluate the correlation of different vectors. Its mathematical expression is as follows: e second evaluation index is the difference of manmachine scoring, which is mainly used to describe the difference between manual and machine scoring. Its calculation formula is as follows: e third evaluation index is accuracy. is paper establishes the maximum value of man-machine evaluation error to determine whether the evaluation result is correct or not.

Effectiveness Analysis of Feature Extraction.
ere are Pearson correlation coefficients for different features in the speaking score, and the results are shown in Table 3 and Table 4. As can be seen from the following two tables, the characteristics of speech types are numbSilence and speakingRatio. is shows that when grading oral English, teachers are most concerned about the fluency of oral English and the long effective time of oral English. In particular, fluency is characterized by recording the more stops, and the lower it is, the lower the score is. is shows that, for oral content, teachers are more interested in candidates' vocabulary grasp and rich conversation content. ParsetreedDepth and goodGrammerRatio features are affected by the recognition accuracy of the speech recognition engine.    On the other hand, teachers also make great efforts to analyze the grammatical errors and sentence structures of candidates' dialogues when scoring manually, and the relationship between these two characteristics and manual scoring is low.       Table 5, the performance of Pearson correlation coefficient and accuracy of BP model is better than that of CNN + LSTM model, and the evaluation of the two machines is highly correlated. In the average difference index, CNN + LSTM is slightly better than BP model.

Conclusion
Firstly, this paper introduces the overall design and scoring process of the scoring system. After that, the detailed designs of voice noise reduction module, speech recognition module, data processing module, and scoring model module are explained, respectively. en, we analyze the experimental results of the oral scoring system and evaluate the performance of the scoring model. is paper introduces three evaluation indexes to evaluate the performance of the model. ere are Pearson correlation coefficient, average score difference of manmachine evaluation, correctness of scoring model, and so on. After using these evaluation indexes to analyze the training and evaluation results of the evaluation model, it is found that the comprehensive evaluation performance of BP model is higher than that of CNN + LSTM scoring model when the data set is small. e spoken language scoring model is based on deep learning or other algorithm models, and there are different scoring effects under different algorithms, which lead to different scoring differences. erefore, the later work to solve this problem needs to combine the advantages of different algorithms for fusion research.

Data Availability
e experimental data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest regarding this work.