Natural Language Processing with Improved Deep Learning Neural Networks

As one of the core tasks in the ﬁeld of natural language processing, syntactic analysis has always been a hot topic for researchers, including tasks such as Questions and Answer (Q&A), Search String Comprehension, Semantic Analysis, and Knowledge Base Construction. This paper aims to study the application of deep learning and neural network in natural language syntax analysis, which has signiﬁcant research and application value. This paper ﬁrst studies a transfer-based dependent syntax analyzer using a feed-forward neural network as a classiﬁer. By analyzing the model, we have made meticulous parameters of the model to improve its performance. This paper proposes a dependent syntactic analysis model based on a long-term memory neural network. This model is based on the feed-forward neural network model described above and will be used as a feature extractor. After the feature extractor is pretrained, we use a long short-term memory neural network as a classiﬁer of the transfer action, and the char-acteristics extracted by the syntactic analyzer as its input to train a recursive neural network classiﬁer optimized by sentences. The classiﬁer can not only classify the current pattern feature but also multirich information such as analysis of state history. Therefore, the model is modeled in the analysis process of the entire sentence in syntactic analysis, replacing the method of modeling independent analysis. The experimental results show that the model has achieved greater performance improvement than baseline methods.


Introduction
e study of grammar in computational linguistics refers to the study of specific structures and rules contained in language, such as finding the rules of the order of words in sentences and classifying words [1]. Linear laws in these languages can be expressed using methods such as Language Model and Part-of-Speech Tagging. For the nonlinear information in the sentence, we can use Syntactic Structure or Dependency Relation between words in the sentence to express. Although this analysis and expression of sentence structure may not be the ultimate goal of natural language processing problems, it is often an important step to solve the problem [2], which is used in such as search query understanding [3], Question Answering, QA [4] and Semantic Parsing and other issues have important applications.
erefore, as one of the key technologies in many natural language application tasks, Syntactic Parsing [5] has always been a hot issue in the field of natural language processing research, and it has significant research significance and application value.
Syntactic analysis is mainly divided into two types: syntactic structure parsing and dependency parsing [2]. e main purpose of syntactic structure analysis is to obtain a sentence parsing tree, so it is often referred to as full syntactic parsing, sometimes referred to as full parsing. e main purpose of dependency syntax analysis is to obtain a tree structure representation of the dependency relationship between words in a sentence, which is called a dependency tree.
In the 1940s, researchers introduced the term "neural network" in order to express biological information processing systems [6]. e simplest one, the feed-forward neural network, also known as the multilayer perceptron model, has achieved good results in many application tasks, but due to the high computational complexity of the model, training is more difficult. With the continuous improvement of computer performance, it is possible to train large-scale and deep neural networks. As a result, the Deep Learning method has made a huge breakthrough in the research of multiple fields of machine learning. Deep learning learns from large-scale data to intricate structural representations. is learning is achieved by adjusting network parameters through error-driven optimization algorithms between different layers of artificial neural networks through backpropagation. In recent years, deep convolution network has made great breakthroughs in graphics and image processing, video and audio processing, and other fields. At the same time, recursive networks have also achieved good results in sequence data such as text and voice [7]. e recurrent neural network initially achieved good results in handwritten digit recognition [8]. e well-known word vector algorithm Word2Vec was originally obtained from the language model learned from RNN [9]. Due to the gradient disappearance defect of recurrent neural network (RNN), Long Short-Term Memory (LSTM) was proposed [10]. Due to the recent popularity of deep learning methods, LSTM has also been applied to work such as dialogue systems [11] and language models [12]. e neural network model with attention mechanism proposed recently [13] has attracted the attention of researchers.
is attention mechanism has been successfully applied to machine translation [14] and text summaries [15] and has achieved certain results. e main contributions of this paper are the following: (i) We propose a feed-forward neural network in which the parameters propagate unidirectionally (ii) We use a neural network model as a classifier and use the reverse propagation algorithm as the learning algorithm (iii) We proposed a well-organized dataset to evaluate the proposed framework Rest the paper is structured as follows: Section 2 describes related work and critically analyzes and compares the work done so far. Section 3 is about the proposed methodology describing the materials and methods adopted in this study. Section 4 is about the validity of the proposed methodology and experimentation and discussions made about the results produced. e work done is finally concluded in Section 5.

Related Work
Concepts such as neural networks originated in the 1940s. After the 1980s, backpropagation was successfully applied to neural networks. In 1989, the backpropagation algorithm was successfully applied to the training of a convolutional neural network. As of 2006, the graphics processing unit was used in the training of convolution neural networks. As a result, a new upsurge of neural network research has been set off. e early neural network models of the 1940s were very simple, usually only had one layer and could not be learned. It was not until the 1960s that early neural networks were used for supervised learning, and the model became slightly more complicated and had a multilayer structure. In 1979, Fukushima [16] first proposed the concepts of convolution neural networks and deep networks. After that, related pooling and other methods were proposed one after another. In 1986, the backpropagation algorithm was proposed by Rumelhart et al. [17]. It greatly promoted the development of neural network research. e second is the emergence of several public datasets. e majority of the public datasets make the neural network no longer a toy model. In the field of computer vision, there is the famous ImageNet [18]. In the field of natural language processing, there is the dataset published by Twitter 2 and the data of Weibo 3 in the Chinese field.
Bengio et al. [19] proposed the use of a recurrent neural network to build a language model. e model uses the recurrent neural network to learn a distributed representation for each word while also modeling the word sequence.
is model has achieved better results in experiments than the optimal n-gram model of the same period and can use more contextual information. Bordes et al. [20] proposed a method for learning Structured Embeddings using neural networks and a knowledge base. e experimental results of this method on WordNet and Freebase show that it can embed structured information. Mikolov et al. [21] proposed continuous bag of words (CBOW): In this model, to predict the words in a sentence, the concept of word position in a sentence is used; this work also proposes a skip-gram model, which can use a word in a certain position in a sentence and predicts the words around it. Based on these two models, Mikolov et al. [21] open-sourced the tool word2vec4 to train word vectors, which has been widely used. Kim [22] introduced the convolution neural network to the sentence classification task of natural language processing. is work uses a convolution neural network with two channels to extract features from sentences and finally classify the extracted features. e experimental results show that the convolution neural network has a significant effect on the feature extraction of natural language. Similarly, Lauriola et al. [23] has critically studied and analyzed the use of deep learning in Natural Language Processing (NLP) and the models, techniques, and tools used so far have been summarized. Fathi and Shoja [24] also discuss the application of deep neural networks for natural language processing.
Tai et al. [25] proposed a tree-like long and short-term memory neural network. Because traditional recurrent neural networks are usually used to process linear sequences, and for data types with internal structures such as natural language, this linear model may lose some information. erefore, this model uses long and short-term memory neural networks in the analysis tree and has achieved good results in sentiment analysis.
In summary, the key limitations of existing deep learning-based approaches to natural language processing include the following: deep neural network models are difficult to train because they need large amounts of data, training requires powerful, expensive video cards, lack of a uniform representation method for different forms of the data, such as text and image and the ambiguity resolution in natural language text at the word, phrase, and sentence level. Moreover, deep learning algorithms are not good at inference and decision making, cannot directly handle symbols, they are data-hungry and not suitable with small data size, difficult to handle long-tail phenomena, black-box nature of the models makes them difficult to understand, and computational cost of the learning algorithms is high. Apart from the limitations, the good about deep neural networks include the following: efficiency in pattern recognition, data-driven approach, performance being high in many problems, little or no domain knowledge needed in system construction, the feasibility of cross-modal processing, and gradient-based learning.

Material and Method
In this section, we are going to discuss the recurrent neural network-based model.

Feed-Forward Neural Network.
As the first proposed neural network structure, the feed-forward neural network is the simplest kind of neural network. Inside it, the parameters propagate unidirectionally from the input layer to the output layer, as shown in Figure 1 as a schematic diagram of a fourlayer feed-forward neural network.

Recurrent Neural Network. Recurrent Neural networks
have been a hot research field in neural network research in recent years. e reason why Recurrent Neural Networks have become a research flashpoint is that the Feed-forward Neural Network or Multilayer Perceptron cannot grip data with time series relationships well. e time recursive structure of the Recurrent Neural Network permits it to learn the time series information in the data so that it can well solve this kind of job (see Figure 2).
For each moment, the activation value of the hidden layer is calculated recursively as follows (t from 1 to N, n from 2 to N, N is the number of hidden layers): Among them, W is the parameter matrix (for example, W ih n represents the connection weight matrix from the input layer to the Nth hidden layer), b is the bias vector, and σ is the activation function.
Calculate the output sequence of the hidden layer, and you can use the following formula to calculate the output sequence: y t � y y t .
(2) e output vector y t is used to estimate the probability distribution Pr(x t+1 |y t ) of the input x t+1 at the next moment.
e loss function L(X) of the entire network is expressed by the following formula: Similar to the feed-forward neural network, the partial derivative of the loss function to the network parameters can be obtained by using backpropagation through time, and the gradient descent method is used to learn the parameters of the network, which is shown in Figure 3.
Due to the advantages of recurrent neural networks in time series, in recent years, many researchers in the field of natural language processing have applied recurrent neural networks to research such as machine translation, language model learning, semantic role tagging, and part-of-speech tagging and achieved good results.

Realization of Learning Algorithm and Classification
Model. As an essential part of the syntactical analyzer, the role of the classification model is to predict the analytical action. e role of the learning algorithm is the parameters of the learning model from training data. In this model, we use a neural network model as a classifier, obviously use the reverse propagation algorithm as a learning algorithm. In this section, the precise implementation of the classification model will be introduced, and some details of the model learning will be described later.
e role of the embedded layer of the network is to convert the sparse representation of the feature into a dense representation. e embedding layer is divided into three parts: word embedding layer, part-of-speech embedding  e three embedding layers obtain input from three different features corresponding to the input layer. It is worth noting that compared with the size of the dictionary, the value set of part of speech and dependency arc is relatively small, so the dimension of part of speech and arc embedding in the embedding layer is smaller than the dimension of word embedding. Specifically, the word feature in the analysis pattern c is mapped to d w as a dimensional vector e w ∈ R dw , and the embedding matrix is E w ∈ R d w ×N w . Among them, N w is the dictionary size.
Similarly, part-of-speech features and dependency arc features are mapped to e p ∈ R d p and e l ∈ R d l after the conversion is completed, the layer outputs 48 dense features, each of which is a real vector. e hidden layer in the model connects the 48 output features x h of the embedding layer end-to-end to form a feature vector and perform linear and nonlinear transformation operations on it. Specifically, the nonlinear transformation function is a cubic activation function: Among them, W 1 ∈ R d h ×d x h is the parameter matrix of the hidden layer, and d x h � 18 * d w + 18 * d p + 12 * d l , b 1 is the bias vector. e last layer of the network is the softmax layer, whose role is to predict and analyze the probability distribution of actions: Among them, W 2 is the parameter matrix of the softmax layer, b 2 is the bias vector and τ is the set of all actions in the dependency syntax analysis system.
After obtaining the probability distribution of the analysis action predicted by the model, the loss function of the network can be calculated. e same as the general multiclassification problem, we use the cross-entropy loss function: In fact, the classification task is to select a correct action from multiple analysis actions, so the loss function is simplified as follows: where A is the correct analysis sequence action set of the batch, λ is the regularization parameter, and Θ is the model parameter. e classifier in the dependency syntax analyzer is a neural network classifier, and its learning algorithm is the same as the general neural network learning algorithm, which is a backpropagation algorithm. Using the backpropagation algorithm, the gradient of the loss function to the parameters can be obtained, and then the gradient descent method is used to update the parameters of the model.

Experiments and Discussion
In this section, we are going to discuss the dataset and the experimental setup and evaluate the framework.

Long and Short-Term Memory Neural Network.
e recursive neural network is used to translate the input sequence to an output sequence, such as a sequence identification problem or sequence forecast problem. However, many of the actual use tasks expose difficulty in training recursive neural networks. Sequences in these issues often extent a lengthier time interval. Bengio et al., since the gradient of the recursive neural network, will ultimately "disappear," the recursive neural network that wants to learn a long-distance memory is more difficult, as shown in Figure 4.
To solve this problem, Hochreiter and SchmidHuber [10] proposed Long Short-Term Memory, LSTM. In this model, the concept of "door" is added so that the network can choose when "Forget" increasing new "memory." As a variant of the recursive neural network, the longterm memory neural network in the design is to solve the gradient disappearance of ordinary recursive neural networks. e usual recursive neural network reads an input vector x t from a vector sequence (x 1 , x 2 , . . . , x n ) and calculates a new hidden layer state h t . However, the problem of gradient disappearance results in an ordinary recursive neural network that cannot be modeled on long-distance dependence. Long short-term memory neural networks introduced "Memory Cell" and three "Control Gate," which used to control when to choose "memory," when to choose "Forget." Scientific Programming Specifically, the long and short-term memory neural network uses an input gate, a forget gate, and an output gate. Among them, it determines the proportion of the current input that can enter the memory unit, and the forget gate controls the proportion of the current memory that should be forgotten.
For example, at time t, the long and short-term memory neural network is updated in the following way: At time t, given input x t , calculate the value of input gate i t , forget gate and candidate memory C t according to the following formula: where σ is the component-wise logistic function and ⊙ is the component-wise product.
At the same time, the value and output value of the new memory cell are given as follows:

Experimental Data.
Since batch training is required, and the analysis sequence lengths of sentences of different lengths are not the same, we have adopted a mask method for training. Even so, because the length of some sentences is too long, other sentences in the batch have been processed and have been waiting for the long sentence to appear. erefore, to train the model more quickly, we removed sentences with more than 70 words in the training process. Such sentences have a total of 76 sentences, accounting for 0.2% of the number of sentences in the training dataset. We believe that this will not affect the effect of the final model. After removing part of the training data and verification data, the actual data used is shown in Table 1.

Evaluation Index.
e analysis of phrase structure usually uses accuracy, recall, and F1 value for evaluation: (2) Recall Rate. e accuracy rate in phrase structure analysis refers to the percentage of the number of correct phrases in the analysis result to the total number of phrases in the test set:

R �
Number of correct phrases in the analysis result The total number of phrases in the test set .

Experimental Results and Analysis.
In addition to the comparison with the baseline method, this topic is also compared with two other classic dependency parsers: Malt Parser and MST Parser. For Malt Parser, we used the stackproj and nivreeager options for training, which correspond to the arc-standard analysis algorithm and the arceager analysis algorithm, respectively. For MST Parser, we report the results in Chen and Manning (2014). e test results are shown in Table 2.
It can be seen from the table that the dependency syntax analyzer based on the long and short-term memory neural network has achieved certain effects in modeling the analysis sequence of sentences. is model has achieved 91.9% UAS accuracy and 90.5% LAS accuracy on the development set of Penn Tree Bank, which is about 0.7% improvement over the greedy neural network dependency parser of the baseline method. On the test set, our model achieved a UAS accuracy x 1 x 2 x t x t+1 x t+2 Figure 4: Ordinary recurrent neural networks cannot handle long-distance dependencies. A is the total number of sentences; B is the number of projectable sentences; C is the percentage of projectable sentences; D is the number of sentences up to 70; E is the percentage of sentences used to projectable sentences.
Scientific Programming rate of 90.7% and an LAS accuracy rate of 89.0%, which is about 0.6% improvement over the greedy neural networkdependent syntax analyzer of the baseline method. Compared with the most representative transfer-based dependency parser, Malt Parser, our method has a relative improvement of about 1.4%; compared with the famous graph model-based MST Parser, our model can obtain 0.5 on the development set. % Improvement, the UAS accuracy rate on the test set is comparable, and the LAS accuracy rate has been improved by 1.4%. e experimental results show that, compared with the greedy feed-forward neural network, the dependency syntax analysis model based on the long and short-term memory neural network performs better. Different from the greedy model, this model uses long and short-term memory neural networks to model the entire sentence and can use historical analysis information and historical pattern information to help classify analysis actions, thereby improving the performance of the dependent syntax analyzer. e results of testing on the Pennsylvania Tree Bank are shown in Table 3. In the testing process, this article uses the column search technique, and the corresponding beam size is 12.
It can be seen from the data in the table that the dual attention mechanism can effectively reduce the number of errors in the output results. In the effective output, the F1 value of the model reached 0.827, and its change with the training process is shown in Figure 5. Various errors change with the training process, as shown in Figure 6.
By linearizing the phrase structure tree in natural language, the phrase structure analysis task is transformed into a sequence-to-sequence conversion task. A simple implementation of the sequence-to-sequence model is carried out, and it is found that the end-to-end analysis still needs the rule restriction on the decoder side. To this end, we propose a dual attention mechanism model, that is, a sequence-tosequence model that introduces attention mechanisms at the input and output at the same time. Experiments show that after the introduction of the dual attention mechanism    Scientific Programming model, the performance of the model on the test set is greatly improved in Table 4.

Conclusions
Syntactic analysis is an indispensable part of tasks such as question answering systems, search string comprehension, semantic analysis, and knowledge base construction. is paper studies a neural network model of dependency syntactic analysis based on transfer learning. is model uses a feed-forward neural network as the classifier in the dependency syntax analyzer and adjusts its parameters by analyzing the model to achieve better results. e experimental results show that after improvement, the effect of the model is increased by 0.1 to 0.2 percentage points. We propose a dependency syntax analysis model based on long and short-term memory neural networks. is model is based on the neural network model and used as a feature extractor. Specifically, the model is based on the characteristics of the long and short-term memory neural network and uses it to memorize the analysis state and analysis history in the transfer-based dependency syntactic analysis process so that the model can capture and utilize more historical information. In addition, the model models the analysis process of the entire sentence in the dependent syntax analysis and improves the greedy model to model the independent analysis state. e experimental results show that compared with the baseline method, the model obtains an improvement of 0.6 to 0.7 percentage points.
rough the work experience and error analysis, we can further study the dependency syntax analysis model based on the long and short-term memory neural network, and we found that the attention mechanism can be introduced into the model.

Data Availability
e data used to support the findings of this study are included within the article.

Conflicts of Interest
e author declares no conflicts of interest regarding the publication of this paper.

Number of errors
Wrong word count Tree structure error Total number of errors