Key n -Gram Extractions and Analyses of Different Registers Based on Attention Network

,


Introduction
A register refers to the common vocabulary, sentence structure, rhetorical means, and other characteristics of the language used when people use language to communicate in various social activities for different persons in different environments [1]. With the development of the information age, much register information is produced in production and life. In various Internet applications, the register has played a pivotal role. To better realize automatic text processing, we need to distinguish different registers. As a component of texts, words in the sentence contain rich semantic information, which play very important roles in distinguishing different registers. However, previous studies have demonstrated that n-gram based words have better results than words in register classification tasks [2].
Key n -gram extraction can be thought of as extracting n-grams to distinguish different registers.
The existing models are mainly based on words, and there are few studies on the extraction of key n-grams (n ≥ 2). Many keyword extraction models have been put forward and have achieved significant effect, due to the development of deep learning models and the attention mechanisms [3][4][5][6]. To some extent, each feature extraction model has its advantages and disadvantages.
In terms of keyword extraction, previous scholars mainly proceeded from two aspects, one is the feature extraction and the other is the model design. The features extracted are mainly the word frequency, term frequency-inverse document frequency (TF-IDF), latent Dirichlet allocation (LDA), synonym set, NP phrases, syntactic information, word2vec, or other specific domain knowledge, such as tags and candidate keywords [7]. The model design for these features are mainly from three aspects, statistical language models, graph-based models, and machine learning models.
1.1. Statistical Language Models. These models combine linguistic knowledge with statistical methods to extract keywords. Such keyword extraction models are based on the word frequency, POS, lexical chains, n-grams, etc. [6,8,9]. The advantages of these methods are simple implementation and effective extraction of keywords. Unfortunately, the features chosen by these methods are often based on frequency or countability, without considering the semantic relationship between words and sentences.
These methods lead to some problems, etc.; highfrequency words sometimes are not the keywords, for example, a lot of stop words in linguistics appeared many times (usually, most stop words in Chinese are auxiliary words), but they are not important words for registers. Even though some models select high-frequency words by removing stop words, they are not accurate in the semantic expression of the registers. Especially in a novel with a lot of dialogues in it, we know that in conversation, according to the context, many words are often omitted. If the stop words are removed, the meaning of the sentence is changed completely. Similar problems exist in the features using TF-IDF methods.
1.2. Graph-Based Models. Compared with statistical language models, these models map linguistic features to graph structures, in which words in sentences are represented by nodes in graphs and the relationship between words is represented by edges. Then, the linguistic problems are transformed into graphical problems and the graphical algorithms are applied to feature extraction. In recent years, researchers have tried to use the graphical model to mine keywords in texts. Biswas et al. proposed a method of using collective node weight based on a graph to extract keywords. Their model determined the importance of keywords by calculating the impact parameters, such as centrality, location, and neighborhood strength, and then chose the most important words as the final keywords [10]. Zhai et al. proposed a method to extract keywords, which constructed a bilingual word set and took it as a vertex, using the attributes of Chinese-Vietnamese sentences and bilingual words to construct a hypergraph [11].
These models based on graphs transform abstract sentences into intuitive images for processing and use the graph algorithm to extract the keyword. But the disadvantage is that these algorithms are based on the strong graph theory knowledge, which requires researchers to have strong linguistic knowledge and graph theory knowledge. Only in this way can these two theories be well connected. Besides, the graphs built from the texts usually have thousands or even millions of nodes and relations, which brings efficiency problem to the graph algorithms.

Machine Learning Models.
With the development of the Internet, as the size of the corpus grows larger, there are more and more corpus-based research [12,13]. It is also an inevitable trend to use a machine learning model to mine its internal laws. Many scholars employed machine learning models to extract keywords. Uzun proposed a method based on Bayesian algorithm to extract keywords according to the frequency and position of words in the training set [14]. Zhang et al. proposed extracting keywords from global context information and local context information by SVM algorithm [15]. Compared with the statistical language models, these early machine learning algorithms based on the word frequency, location information, and global and local context information have made significant improvements in feature extraction. In fact, from the features selected by these models, scholars have tried to consider the selection of features from more aspects. It is just that these features need to be extracted artificially [14][15][16][17]. With the development of a computer hardware and the neural network, more complex and efficient models emerged, that is, a deep learning model. Meanwhile, various feature representation methods appeared, such as word2vec and doc2vec. Many scholars began to use deep learning models to extract keywords. Wang and Zhang proposed a method based on a complex combination model, a bi-directional long short-term memory (LSTM) recurrent neural network (RNN), which has achieved outstanding results [3][4][5]. It can be said that keyword extraction based on a deep learning model not only improved the accuracy of keyword extraction significantly but also enriched the corresponding feature representation. The disadvantage is that the models like the LSTM model has a high requirement for computer hardware and generally needs a long time to train them.
Attention mechanism is proposed by Bahdanau et al. in 2014 [18]. The models with the attention mechanism are widely used in various fields for its transparency and good effects in aggregating a bunch of features. Then, Bahdanau applied the attention mechanism to the machine translation system, which improved the accuracy of their system significantly. In this process, attention mechanism was used to extract the important words in sentences [18]. Pappas and Popescu-Belis proposed a document classification method that applied the attention mechanism to extract words distinguishing different documents, and the classification accuracy rate was greatly improved [19]. The significantly improved classification accuracy implied that the words extracted by the attention mechanism can distinguish different documents well. Similarly, the application of the attention mechanism in other fields also proves this point [20].
By analyzing and summarizing the advantages and disadvantages of these models, we propose a simple and efficient model based on attention mechanism and multilayer perceptron (MLP) to extract key n-grams that can distinguish different registers. Here, we call this model the "attentive ngram network"(ANN) for short, whose structure is shown in Figure 1. The model ANN consists of eight parts, the input layer, embedding layer, n -gram vector, attention layer, ngram sentence vector, concatenation, classification, and output. In other words, the input layer is the sentence we want to classify, the embedding layer is to vectorize the words in the sentence, and the n -gram vector is to convert the word vector into the corresponding n-gram representation. The attention layer is to score n-grams in the sentence. The ngram sentence vector is a weighted summation of n-gram 2 Journal of Applied Mathematics vectors to form a sentence vector and the result of the attention layer. Concatenation concatenates sentence vectors from n-grams with different n as inputs to the classifier. Classification is a classifier, and the output layer includes three parts, the category of sentences, n-grams, and n-gram-corresponding scores. In Figure 1, we will use an example to illustrate it. Experimental results show that our model ANN achieves significant and consistent improvement comparing with the baseline model. In particular, our work has contributed to the following aspects: (1) Using attention mechanism to extract key n-grams which can distinguish different registers (2) Compared with machine learning methods such as SVM and Bayesian, the classification accuracy has been significantly improved by using ANN based on semantic information (3) With the training process of ANN, attention mechanism has low scores on stop words, which can automatically filter the stop words 2. Methodology 2.1. Attentive n-Gram Network. In computer science, deep learning has become a popular method and has shown its powerful modeling ability in many areas, such as computer vision and natural language processing [21]. Among the basic neural network design patterns, there is a structure called attention mechanism which can automatically analyze the importance of different information. In the field of natural language processing, such as in machine translation, peo-ple use the attention mechanism to calculate the source keywords [18]. In our case, the task is to analyze which keywords or 2gram phrases bear key information for differentiating registers. We first conduct a classification task on texts of different registers and apply the attention mechanism to keywords. Attention mechanism aims to calculate the importance of the words helping to identify the registers, which the higher weights will be assigned to prompt the classification task. Words with higher weights are more important to the register, in contrast to those that appear in every registers, e.g., stop words.
Formally, we can suppose to have a word dictionary W = fw 1 , w 2 , ⋯, w n g and a set of sentences S = fs 1 , s 2 , ⋯, s m g. Each sentence s i consists of a sequence of words s i = fw i,1 , w i,2 , ⋯, w i,l i g, where l i = js i j is the length of the sentence s i . Here, we highlight vectors in bold. For example, w i , s i are the vectors of word w i and the sentence vector of sentence s i . Word vectors in our model can be randomly initialized or pretrained word2vec, which corresponds to the embedding layer in Figure 1 2.1.1. Attention Mechanism on n-Grams. The attention mechanism in our model takes n-gram vectors as inputs and returns sentence vectors as outputs. In particular, the vectors of k-grams are formed by the concatenation of k word vectors. For example, the sentence s i = fw i,1 , w i,2 , ⋯, w i,l i g can also be represented as k-grams s i = fg k i,1 , g k i,2 , ⋯, g k i,l i+1−k g, where g k i,j is the jth k-gram of the sentence and its vector is Input layer The overview of the attentive n-gram network. As an example, the input sentence is s i = {这是我妹妹。}. The embedding layer converts the words in s i into the corresponding vector s i = fw i,1 , w i,2 , w i,3 , w i,4 , w i,5 g. Then, n -gram vectors come from the concatenation of these word vectors. In the figure, as an example, the 2-gram vectors s 2 i = fg 2 j,1 , g 2 j,2 , g 2 j,3 , g 2 i,4 g come from the concatenation of every two adjacent word vectors. When k = 1, the attention layer scores each word w i,j in the sentence s i to obtain the score a 1 i,j . Similarly, when k = 2, the score corresponding to each 2-gram g 2 i,j in s i is a 2 i,j . Weighting the sum of all w i,j ðg 2 i,j Þ by the weights a 1 i,j ða 2 i,j Þ, we obtain the sentence vector s 1 i ðs 2 i Þ. In the concatenation layer, s 1 i and s 2 i are concatenated together to get the sentence vector s i , which is the input to the MLP classifier. After classification layers, we get the output, probabilities of the sentence belonging to each category. The symbols are described in detail in Section 2.1. The attentive n-gram network (ANN) structure with 1,2-grams. 3 Journal of Applied Mathematics n -gram vectors in Figure 1. Then, the attention network first uses a fully connected layer to calculate the latent attention vector u k i,j of each k-gram g k i,j : where A k ∈ ℝ t×kv , b k ∈ ℝ t are the parameters of the attention network and t denotes the hidden layer size of the attention network and v is the size of the word vectors (kv is the size of k-gram vectors). tanh is the activation function which introduces nonpolynomial factors to the neurons [22]. It has been proved that multilayer feed forward networks with a nonpolynomial activation function can approximate any function [23]. Therefore, it is common that people use a nonpolynomial activation function after a fully connected layer. The u k i,j is the hidden attention vector which contains the information of word importance. Then, a weighted sum is conducted over the latent attention vector: where h k ∈ ℝ t are the weights over the dimensions of u k i,j and the parameters of the attention network. The result u k i,j is the score attention mechanism giving to the k-gram g k i,j . Note that u k i,j ∈ ð−∞,∞Þ, if scores of different k-grams are directly used to do a weighted sum over all word vectors to form the sentence vector, the length and scale of the sentence vectors will be out of control. To normalize the weights, a softmax function is conducted over all u k i,j (In mathematics, the softmax function, also known as softargmax [24] or normalized exponential function [25], is a function that takes as input a vector of K real numbers and normalizes it into a probability distribution consisting of K probabilities.): where a k i,j ∈ ð0, 1Þ is the attention weight of the k-gram g k i,j in sentence s i . Note that ∑ j a k i,j = 1. Each of the words in the sentence can be scored by equations (1), (2), and (3). For example, s i = {这是我妹妹。}, in Figure 1, when k = 1 , the score a 1 i for each word in sentence s i is In other words, equations (1), (2), and (3) belong to the attention layer, through which n-gram scores in a sentence are scored, corresponding to the attention layer in Figure 1.
The k-gram sentence vector s k i is formed as follows: In general, the sentence vector s k i in k-gram comes from a weighted sum of the k-gram vectors g k i,j . But the weights are dynamically generated through the attention network. Different k-grams have different weights in different sentences. The attention network will learn how to evaluate their importance and will return the weights during the training process.
Specifically, when k = 1, the sentence vector s 1 i = ∑ j a 1 i,j w i,j is a weighted sum of word vectors w i,j . To take different ngrams into consideration, we concatenate the sentence vector s k in different k. For example, in our further experiments, when considering both words and 2-grams, e.g., ANN (1,2gram), the final sentence representation is as follows: This part corresponds to concatenation of Figure 1. The final representations of sentence vectors s i are then fed into higher layers for language register classification.

Register Classification.
We utilize a multilayer perceptron (MLP) [21] to classify the registers. MLP is a kind of feed forward artificial neural network, which consists of at least three layers of nodes: the input layer, the hidden layer, and the output layer. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP uses the supervised learning technique called back propagation to train and distinguish different registers. In our paper, this module takes the sentence vector s i as inputs and returns the probabilities of s i coming from different registers as outputs; this part corresponds the classification in Figure 1, whose structure is shown in the red part of Figure 1.
Although the task of our model is the classification of registers, what we are really focused on is the different keywords used in different registers, represented by the attention weights. It is not necessary to design a complex or elaborated classification module because what we want is actually a powerful attention network, as mentioned in Section 2.1.1. Supposing that C is the set of all different registers and |C | is the number of classes. In an efficient and effective way, it uses two fully connected layers to build the classification module: where jCj are the model parameters, v s is the size of sentence vector s i , and t is the size of the hidden layer. Then, p i has the size of jCj, which are the unnormalized probabilities of s i belonging to different registers. To normalize the probabilities, a softmax layer is conducted on the p i : where p ðjÞ i is the jth value of p i and pðc j | s i Þ is the probability of the sentence s i belonging to the class c j . To give the prediction, we let the c j corresponding to the maximum pðc j | s i Þ, e.g., max c j pðc j | s i Þ, be the predicted class.
The model is trained through cross entropy loss [26], a well-known classification loss in machine learning: where the y i is the label (real class) of the sentence s i and the loss function is used to optimize the model. Usually, the closer the loss is to 0, the better the model will be. In our work, we use the Adam function as the optimizer [27].

Extraction
Key n-Gram. After training, ANN gives low weights to all of the n-grams in the sentence because these n-grams are all short common relations frequently occurring in many sentences, which are not representative. Suppose that feature f i occurs in m documents d 1 , ⋯, d m and its weights in the m documents are α 1 i , ⋯, α m i , then, the feature importance β i can be a weighted average of α i : where l j is the number of input features of document d j (e.g., number of words when input features are words). It is used to normalize the importance because getting a high weight in a long document is more difficult. Then, the features can be sorted according to the importance β i and the features with higher importance than a predefined threshold 1.0 are selected.

Type/Token Ratios (TTR)
. TTR is the ratio of the type to token in the corpus. The so-called type number refers to how many different words are in the text. The token number refers to how many words are in the text. To some extent, the ratio of the type to token reflects the richness of words of the text. But the TTR calculated in this way is influenced by the length of the article, so here, we use a modified method to calculate TTR, Herdan's log TTR [28,29]. The formula is as follows: In our further experiments, we need to calculate the Herdan ′ s log TTR of different registers to measure their richness, namely, Herdan ′ s log TTR novel , Herdan ′ s log TTR news , and Herdan′s log TTR text book . The news is a public dataset (https://www.sogou.com/ labs/resource/list_yuliao.php), which covers ten topics including domestic, international, sports, social, stock, hot spots, education, health, finance, and real estate. This is a public corpus, and many scholars have used this corpus to do researches.
The statistics of these datasets are shown in Table 1. Here, the train data and test data are from 0.8 and 0.2 of the novel, news, and text book, which we use to train and test models, respectively. Moreover, to train the model in a better way, we divided the datasets into different proportions, that is, the training set and the test set were 0.7 : 0.3, 0.8 : 0.2, and 0.9 : 0.1, respectively. It is found that the accuracy of the model is as high as shown in Table 2 when the ratio of the training set and the test set is 0.8 : 0.2.

Research Procedures.
Our experiments are divided into these steps, as shown in Figure 2. Next, we describe each part of the flow chart 2 in detail.
(1) Corpus preprocessing includes the corpus set, preprocessing, and corpus vectorization. Preprocessing uses toolkits to clean data and segment words and  and Stanford CoreNLP (https://stanfordnlp.github .io/CoreNLP/). Corpus vectorization is to translate each sentence into the corresponding word number progressive representation. The result corresponds to the "input layer" shown in Figure 1 (2) The model consists of two parts, attention mechanism and MLP classification. The function of the attention mechanism is to score every n-gram in every sentence by using equations (1), (2), and (3), which are shown in Section 2.1. MLP classification is a classifier for stylistic classification. These two parts correspond to the attention layer in Figure 1; the working process of these two parts is described in Figure 1 (3) Key n-gram extraction averages the scores of n-gram in each register, whether they appear or occur many times. When n-gram appears in three registers at the same time, the key n-gram with the highest score is regarded as the key n-gram of the register in which it belongs (4) Key n-gram analyses are composed of key 1,2-gram clustering and key n -gram analyses, in which key 1,2-gram clustering clusters the key n-grams extracted from the previous step. Key n -gram analyses not only carries out linguistic analyses on the clustering results of key 1,2-gram clustering but also statistical analyses on the key n-grams extracted from the previous step In addition, the most important task is to find out the key n-grams of each register.

Experimental Settings.
To train our model, we employ the grid search to select the best combination of parameters for the model. These parameters include learning rate ∈f0:001, 0:01, 0:1,1, 10g and the batch size ∈f32, 64, 128, 256,512g. Also, our inputs are the sentence vectors, so we need to set the length of each sentence. According to Figure 3(a), we find that the quarter of the sentence length is 10, the average sentence length is 20, three quarters of the sentence length is 40, and the longest sentence is 128. Since our corpus is composed of three registers, we also calculate the average sentence    Journal of Applied Mathematics length of each register, which functions as our reference values to choose the sentence length parameter. From Figure 3(b), the average sentence lengths of the novel and text book are close to 20 and the average sentence length of the news is close to 30. Hence, we set the sentence length set ∈f10,20,30,40,50,80,100,120,130g. The sentence vector size is the total number of words in these three registers; the word size is 32. The best combination of parameters are shown in black bold. For other parameters that have less impacts on our model, we adopt the default values. To reduce the impact of different sizes of the corpus, we adopt the random sampling method and take corpora of equal size as the training set and test set for our model. Here, we adopt accuracy to evaluate our model. Accuracy measures how many of the sentences predicted as positive are effectively true positive examples. Besides, when n = 1, which means the keyword extraction, we use keywords instead of 1-grams, whose structure is the green part of Figure 1. When n = 2, we use 2-grams, which is shown in the blue part of Figure 1.

Result and Result Analyses. In Figures 4(a) and 4(b),
"number" refers to the cumulative number of features in a certain range and "density" means the number of features in a certain interval. From Figure 4(a), we find that the distribution of keyword scores is mainly concentrated in the interval [0.1, 0.5]. In Figure 4 Our experiments are divided into three groups, namely, ANN (keywords), ANN (2-grams), and ANN (1,2-grams), whose structures consist of the green and the blue parts of Figure 1. Through these models, we can extract the keywords and 2-grams for the novel, news, and text nook. The experimental results are compared with the baseline, and the results are shown in Table 2 on the test set and train data.

Linguistic Analyses.
We mainly analyze the differences among three registers from four aspects, corpus content analyses, lexical richness analyses, keyword analyses, and key 2gram analyses.

Corpus Content Analyses.
To better understand the key n-grams of different registers, we mainly analyze the content characteristics of the novel, news, and text book. Their specific statistics are shown in Table 1.
(i) Novel. We choose Mo Yan and Yu Hua's novels as our collection of novels. Mo Yan is the 2012 Nobel Prize winner, whose novels often use bold words and colorful words. The sentences in his novels are rich in style, which contains long sentences, compound sentences, and simple sentences, and the author described things in a way that is unconstrained. The language of Yu Hua's novels is profoundly influenced by Western philosophical language, whose traits are simplicity, vividness, fluidity, and dynamic (ii) News. As a register of recording and reporting facts, the news usually has several characteristics, such as authenticity, timeliness, and correctness. Authenticity means that the content must be accurate. Timeliness means that the contents is time limited. Correctness means that the reporting of time, place, and characters must be consistent with the facts (iii) Text Book. As a kind of register to impart knowledge to students, the text book focuses on training the listening, speaking, reading, and writing skills of students, with the aim of broadening their vision and knowledge scope. So, there are many kinds of articles in a text book for students to learn, which mainly contains prose, novels, inspiration, patriotism, ideological and moral education, and other stories Since Herdan ′ s log novel is greater than Herdan ′ s log text book and greater than Herdan ′ s log news , comparatively speaking, the novel has the richest vocabulary, followed by the text book, and the News.

Keyword Analyses.
We analyze the differences among the novel, news, and text book from the proportion of POS and syllable. The statistical distribution of POS is shown in Figure 5(a) and the proportion of syllable distribution shown in Figure 5(b). The data in Figure 5(a) and 5(b) are based on a training set and test set. In Figure 5(a), we can obtain POS in each register from high to low as follows:  In Figure 5, we find that the number of nouns (NN) in each register is the highest. To better analyze, we subdivide nouns (NN) into small parts according to their semantic information shown in Table 4.
The specific meanings of abbreviations in Tables 5-7 are  given in Tables 4 and 8. The abbreviations in Table 4 are designed by ourselves and the contents of Table 8 are from the Chinese Treebank Marker of Pennsylvania [30]. We will analyze the distribution of each POS in the novel, news, and text book. Take the following POS as examples, as follows.
(1) POS-NN. In Figure 5(a), we find that the proportion of nouns (NN) ranks the highest in the novel, then in the text nook, and the lowest in the news. Combined with   Table 6. Therefore, we find that the news focuses on a wide range of groups, not individuals. These nouns (NN) in the text nook are names, time, plants, animals, events, natural phenomena, etc.; the specific examples of these abbreviations are shown in Table 7. Hence, we find that textbooks focus on describing people, things, etc.
(2) POS-VV. In Figure 5(a), verbs (VV) are the most in the text book, followed by the novel, and the last by the news. Combining Tables 5-7, we find that verbs (VV) in the novel are mainly body-related verbs, such as such as "笑" (laugh), "哭" (cry), "跑" (run), "走" (walk), "跳" (jump), and "唱" (sing). Among them, "说" (say) and "问" (ask) are related to the mouth, "走" (walk) and "跑" (run) are related to the feet, and "抱" (embrace) to the hands; this is related to the characteristics of the novel. In the news, verbs (VV) are mainly dummy verbs and continuous verbs. For example, "进行" (do) is a dummy verb and "上 涨" (go up) is a continuous verb. News uses these verbs to express its formal and solemn tone. Text book includes not only body verbs but also personalized verbs; the latter are rich in the text book because of the wide range of registers selected in the text book (3) POS-CD. Also, in Figure 5, we find that CD is the most in the news, followed by the text book, and the Novel. As demonstrated in Table 6, we find that there are a lot of numbers in the news. It can be said that quantitative figures are used in the news to express   what is mentioned, rather than vague words, such as approximate grade words, "一半" (half) and "大量" (lots of), which can be often found in the novel and text book. In addition, the correctness of the news is also reflected in using a large number of numerals In Figure 5(b), we find that the distribution of syllabic words in different registers ranges from high to low as follows: (i) Novel. 2 syllables, 4 syllables, 3 syllables, monosyllable, multisyllable   We analyze the distribution of each syllable in each register, taking these syllables as examples, as follows: (1) Monosyllable. In Figure 5(b), monosyllabic words are the most in the novel, followed by the text book, and the news. As shown in Tables 5-7, we find that most of the monosyllabic words in the novel are bodyrelated words. These verbs are related to specific parts of the body. According to the content of the novel in Section 3.5.1, we know that it is consistent with the characteristics of the novel which mainly depicts the specific actions of the characters. In the text book, because there are many novels, there are more monosyllables in the text book. With the simplification of Chinese phonetics, homonyms have significantly increased. If monosyllabic words are still widely used in the news, it will inevitably lead to misunderstanding, which hinders the role of language as a tool. Therefore, more accurate polysyllabic words are used in the news (2) 2 Syllables. In Figure 5(b), we find that disyllabic words are the most frequent in the news, followed by the text book, and the last by the novel. Combined with Tables 5-7, we find that the news uses disyllabic words to express a formal and solemn tone. For example, "表 决" (vote), "申明" (instruction), etc., instead of "说" (say) in the novel and text book. In addition, there are more disyllabic verbs in the news, in the novel, and in the text book; disyllabic words are mostly nouns (NN), such as "鼻子" (nose) and "眼窝" (eye socket).

Key 2-Gram
Analyses. In Figure 6, we can conclude that the main 2-gram structure of each register is from high to low, as follows:    Here, * * * + SP denotes a sentence or a phrase ending with sp. Combining Tables 9-11, we analyze the distribution of each 2-gram structure in different registers, mainly taking these 2-gram structures as examples, as follows: (1) NN/NR + VV. In Figure 6, we find that the proportion of this structure is the highest in the novel and text book and the lowest in the news. In the novel, the examples of the structure NN/NR + VV are shown in Table 9. Combining Section 3.5.1, we find that the novel contains many dialogues. For the text book, some novels were selected in it, so there are also many conversations. Referring to Tables 9 and 11, structural NN/NR + VV can be regarded as a description of the action and behavior of NN or NR, etc. In Section 3.5.3, we know that the verbs (VV) in structure NN/NR + VV are body-related verbs, which are consistent with the characteristics of the novel that mainly describes the action of the characters. Contrary to this, in the news, there are many dummy verbs, such as "进行" (do) shown in    PROVERB "真金不怕火炼" (True gold fears not the fire) "癞蛤蟆想吃天鹅肉" (Toad wants to swallow a swan) Journal of Applied Mathematics a lot of novels in it, the structure NN/NR + VV is the same as that in the novel (2) PN/NR/NN + NN. In Figure 6, we find that the structure PN/NR/NN + NN is the most in the news, then in the novel, and in the text book. In conjunction with Table 10, we find that the examples of structure PN/NR/NN + NN are composed of two disyllabic words, such as "市场经济" (market economy) and "试点阶段" (pilot stage). Wang and Zhang once pointed out that such a structure of "disyllabic words + disyllabic words" has pan-temporal characteristics. That is to say, the structure of "disyllabic words + disyllabic words" is widely used in the news, which can describe things in a more accurate way from a higher angle [31]. This is in line with the characteristics of the news. Therefore, there are a large number of such two-syllable structure used in the news (3) NN/NR + VA. In Figure 6, the structure NN/NR + VA is the most in the text book, followed by the novel, and last by the news. From the perspective of the whole content of Table 11, comparing with the news, we find that the text book description is more meticulous, such as "脸色惨白" (pale face), "旧衬衣" (old shirt), "严格的" (strict), and "哪个混蛋" (which bastard), which are shown in Table 11. Combining with the contents of the text book in Section 3.5.1, the trait of the text book is the more meticulous kind of descriptions, which can better help students learn and improve their writing ability. Besides, as a formal written language, the news is simple and serious. The language of the novel and text book are more casual and flexible; therefore, in the dialogue between the novel and the text book, this kind of structure often appears and the noun of the structure is always omitted 3.6. Cluster Verification. To verify the effect of our extracted keywords and 2-grams, we use the t-SNE [32] method to cluster keywords and 2-grams. The input of t-SNE is n-grams trained by the attention network, which are high-dimensional vectors. Compared with other clustering  13 Journal of Applied Mathematics methods, the t-SNE clustering method can distinguish highdimensional data very well and has a good visualization effect. The clustering result is shown in Figure 7. Here, we only show several main keywords and 2-grams in Figure 7.
From the results of keyword clustering in Figure 7(a) and 2-gram clustering in Figure 7(b), we find that the effect of keyword clustering is better than that of 2-gram clustering, which is consistent with our conclusion in Section 4. In Figure 7, we find that the news is more concentrated and the core is centered on "经济" (economy). The novel can be divided into two groups, because our novel corpus consists of works written by two authors, Mo Yan and Yu Hua, as indicated by the red circle in Figure 7(a); each red circle represents one class. The text book is more scattered. This is