An Improved Math Word Problem (MWP) Model Using Unified Pretrained Language Model (UniLM) for Pretraining

Natural Language Understanding (NLU) and Natural Language Generation (NLG) are the general methods that support machine understanding of text content. They play a very important role in the text information processing system including recommendation and question and answer systems. There are many researches in the field of NLU such as Bag of words, N-Gram, and neural network language model. These models have achieved a good performance in NLU and NLG tasks. However, since they require lots of training data, it is difficult to obtain rich data in practical applications. Thus, pretraining becomes important. This paper proposes a semisupervised way to deal with math word problem (MWP) tasks using unsupervised pretraining and supervised tuning methods, which are based on the Unified pretrained Language Model (UniLM). The proposed model requires fewer training data than traditional models since it uses model parameters of tasks that have been learned before to initialize the model parameters of new tasks. In this way, old knowledge helps new models successfully perform new tasks from old experiences instead of from scratch. Moreover, in order to help the decoder make accurate predictions, we combine the advantages of AR and AE language models to support one-way, sequence-to-sequence, and two-way predictions. Experiments, carried out on MWP tasks with 20,000+ mathematical questions, show that the improved model outperforms the traditional models with a maximum accuracy of 79.57%. The impact of different experiment parameters is also studied in the paper and we found that a wrong arithmetic order leads to incorrect solution expression generation.


Introduction
e basic research of natural language processing (NLP) is human-computer language interaction, which reflects human language with algorithms that can be understood by machines. NLP can perform a vast array of tasks such as text summarization, generating completely new pieces of text, and predicting what word comes next, among others. e core is a language model (LM) based on statistics. Honestly, these LMs are a crucial first step for most of the advanced NLP tasks. is paper will begin from basic LMs that can be created with a few lines of Python code and move to state-ofthe-art language models that are trained using humongous data and are being currently used by the likes of Google, Amazon, and Facebook, among others. LMs are the probability distribution of a sequence of words, which can quantitatively evaluate the possibility of a string of characters. LMs are used in speech recognition, machine translation, part-of-speech tagging, parsing, optical character recognition, handwriting recognition, information retrieval, and many other daily tasks. Its ability to model the rules of a language as a probability gives great power for NLP-related tasks. e general process includes a process of predicting the back words. And then, the probabilities of all words are used to evaluate the possibility of the existence of the text. ere are two types of LM: Statistical Language Models and Neural Language Models [1][2][3][4]. Statistical LMs use traditional statistical techniques like N-grams, Hidden Markov Models (HMM), and certain linguistic rules to learn the probability distribution of words. For example, Mezzoudj and Benyettou [5] augment naive Bayes models with statistical n-gram language models to address the shortcomings of the standard naive Bayes text classifier. In the work of [6], they propose a fast and simple algorithm for training NPLMs based on noise-contractive estimation, a newly introduced procedure for estimating un-normalized continuous distributions. Experiment results show that the model reduces the training times by more than an order of magnitude without affecting the quality of the resulting models. e algorithm is also more efficient and much more stable than importance sampling because it requires far fewer noise samples to perform well.
However, the estimation will be difficult in practice if the text is very long. us, there is a simplified method: the Ngrams model. In the N-grams model, the conditional probability of the word is estimated by calculating the first N words of the current word. Unigram, bigram, and trigram are the commonly used N-grams models. Typed character Ngrams reflect information about their content and context. According to previous research, typed character N-grams improve the accuracy of authorship attribution [7,8]. However, the problem of data sparseness and inaccuracy gets worse with the larger text in these models. In order to solve the problem of data sparseness when estimating probability with the N-grams model, researchers try to use neural networks to study the language model, such as UniLM and TransFormer.
is paper proposes a semisupervised approach based on UniLM, which uses unsupervised preview and supervised tuning for language processing tasks. e goal of this approach is to learn a universal representation that requires very little adaptive adjustments when migrating to various downstream tasks. e training process of the algorithm is divided into two stages: the first stage uses language modeling targets on unlabeled data to learn the initial parameters of the neural network; the second stage uses the corresponding supervised targets to adapt these parameters to the target task. Moreover, to evaluate the performance of our model in comparison with other models, we carried out a highly challenging deep QA task on a large-scale and template-rich dataset of Math Word Problems Math23K [9]. e results show it has a maximum accuracy of 79.57%. ere are three advantages and contributions of the proposed model: (1) Although there are three language model tasks in the pretraining process, we do not need to train the three models separately because the parameters of the transformer are shared.
anks to the self-attention masking of UniLM. (2) Parameter sharing makes the learned text representation more universal because these parameters are jointly optimized with different language models. It also alleviates the problem of over-fitting on a specific language model task. (3) e proposed model is suitable for both NLU and NLG problems.

Related Work
In 2000, researchers first put forward the idea of neural networks to study language models [10][11][12]. Until 2011, Collobert and Weston [13] used a simple deep learning model to achieve SOTA results in NLP tasks such as named entity recognition NER, semantic role tagging SRL, and partof-speech tagging POS-tagging. More and more researchers focus on the methods based on deep learning. In 2013, the word vector represented by Word2vec [14] and Pennington et al. [15] became popular. More research has explored to improve the ability of language models from the perspective of word vectors, and focused on the semantics of words and context. In 2014, Kim proposed a TextCNN [16] model based on pretrained Word2vec for sentence classification tasks. In 2016, Joulin et al. [17] proposed a simple and lightweight deep learning model for text classification: FastText. e architecture is similar to the Word2vec CBOW model proposed by Rong et al. [18]. Experiment results show that FastText can achieve a good performance with efficiency.
In addition, researchers have tried to use various mechanisms to optimize the ability of language models such as CNN, RNN, and Transormer [19,20]. e CNN-LSTM architecture involves using Convolutional Neural Network (CNN) layers for feature extraction on input data combined with LSTMs to support sequence prediction. As shown in Figure 1, a common CNN-LSTM model is composed of a cell, an input gate, an output gate, and a forget gate. e cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell. CNN-LSTM networks are well-suited to classifying, processing, and making predictions based on time series data, since there can be lags of unknown duration between important events in a time series. In a CNN, a convolution operation is used to obtain multiple feature maps. en, it extracts key information for classification by filtering noise information through the pooling operation. Among them, pretraining combined with downstream task fine-tuning methods is the most eyecatching trend. In [21], for example, they investigate the benefits of integrating CNNs and LSTMs and report obtaining improved accuracy for Arabic sentiment analysis on different datasets. Additionally, we seek to consider the morphological diversity of particular Arabic words using different sentiment classification levels.
In AI, pretraining imitates the way human beings process new knowledge using model parameters of tasks that have been learned before to initialize the model parameters of new tasks. In this way, old knowledge helps new models successfully perform new tasks from old experience instead of from scratch. In recent years, EMLo, GPT, and BERT frequently refreshed the SOTA result [22]. For example, [23] trained a BERT language understanding model for the Italian language (AlBERTo). In particular, AlBERTo is focused on the language used in social networks, specifically on Twitter. To demonstrate its robustness, we evaluated AlBERTo on the EVALITA 2016 task SENTIPOLC (SEN-TIment POLarity Classification) obtaining state-of-the-art results in subjectivity, polarity, and irony detection on Italian tweets.
Transformer [24], which is based on the attention mechanism, completely abandoned CNN and RNN, and only captured the global relationship between the input and the output. As shown in Figure 2, the transformer architecture is composed of two parts: Encoder and Decoder. e encoder is on the left and the decoder is on the right. Both the encoder and decoder are composed of modules that can be stacked on top of each other multiple times, which is described by N x in the figure. We see that the modules consist mainly of multi-head attention and feed forward layers. e inputs and outputs (target sentences) are first embedded into an n-dimensional space since we cannot use strings directly.
Transformer architectures have facilitated building higher capacity models and pretraining has made it possible to effectively utilize this capacity for a wide variety of tasks. e effectiveness of transfer learning has given rise to a diversity of approaches, methodologies, and practice [25].  Computational Intelligence and Neuroscience e framework is easier to calculate in parallel. e training time for tasks such as machine translation and parsing is reduced. Transformer's ability is obvious to all, and has been applied to pretraining models such as GPT, BERT, and XLM. In 2018, Brown et al. [26] proposed a unidirectional neural network language model GPT based on generative pretraining in OpenAI, which became one of the most popular pretraining models of the year. ey use the fine-tuning method with two stages: the first stage uses the Transformer decoder, which is based on unlabeled corpus, for generative pretraining; the second stage is based on specific tasks for differentiated fine-tuning training, such as text classification, sentence pair relationship discrimination, text similarity, and multiple-choice tasks. Instead of adopting the traditional fully connected layers for classification in CNN, GPT directly feeds the resulting vector into the softmax layer.
Moreover, in 2018, Devlin et al. [25] proposed a pretraining model BERT based on a deep, two-way Transformer. Unlike GPT, the feature extractor used by BERT is the Transformer encoder part. Similarly, BERT is also divided into two stages, pretraining and downstream task finetuning. BERT changes the unidirectional language model in the GPT into a bidirectional one. Instead of using the standard left-to-right prediction of the next word as the target task, BERT proposes two new tasks. e first pretraining task is called MLM, or Masked Language Model. In the input word sequence of this model, 15% of the words are randomly masked and the task is to predict what they are. What we see is that, unlike previous models, BERT can predict these words from both directions-not just left-toright or right-to-left. For example, Yu et al. [27] proposed a replication study of BERT pretraining that carefully measures the impact of many key hyper-parameters and training data size. Experimental results show that BERT achieved the SOTA results on GLUE, RACE, and SQuAD. Moreover, ERNIE [27] is an exploratory framework for continuous learning and understanding based on knowledge enhancement proposed by Baidu. e framework combines big data presets with multi-source knowledge.
rough learning technology, it continuously absorbs knowledge of the text structure and learns in massive data texts to realize the model. ERNIE has achieved SOTA effects in more than 40 classic NLP missions, and has won more than 10 championships on international celebrities such as GLUE, VCR, XTREME, and SemEval.
UniLM is a BERT-based model, which is a simple but effective multimodal pretraining method of text. Unlike BERT, UniLM can be configured using different self-attention masks to aggregate context for different types of language models. It is made up of Transformer AI models jointly pretrained on large amounts of text and optimized for language modeling. e UniLM model uses three types of language modeling (one-way model, two-way model, and sequence-to-sequence prediction model) for pretraining [28]. Using a shared Transform network, a specific self-attention mask is used to control the context of prediction conditions, thereby achieving unified modeling. For example, in the work of [29], they proposes UniVL: a Unified Video and Language pretraining model for both multimodal understanding and generation. It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone. Five objectives, including video-text joint, conditioned masked language model (CMLM), conditioned masked frame model (CMFM), video-text alignment, and language reconstruction, are designed to train each of the components. e train skills in [30][31][32][33] are applied in this paper.
In this paper, a semisupervised approach based on UniLM is proposed. e model allows unsupervised previewing and supervised tuning for language processing tasks. Experiment results show a maximum accuracy of 79.57% of the proposed model. e contributions of this paper as follows: this paper proposes a semisupervised way to deal with math word problem (MWP) tasks using unsupervised pretraining and supervised tuning methods, which are based on the Unified pretrained Language Model (UniLM). It combines the advantages of AR and AE language models to support one-way, sequence-to-sequence, and two-way prediction tasks. Experiments, carried out on MWP tasks with 20,000+ mathematical questions, show that the improved model outperforms the traditional models with a maximum accuracy of 79.57%. e paper is structured as follows: we first introduce our methodology in Section 2, and then describe the test-bed and evaluate the proposed model according to several evaluation metrics in Section 3. After evaluating the performance of the proposed model, the summary and discussion about future work are described in Section 4.

Methodology
Researchers found that BERT could be useful for more than just Google searches [34,35]. BERT seems to promise improvements in key areas of computational linguistics, including chat-bots, question-answering, summarization, and sentiment detection. It's defined as a "groundbreaking" technique for NLP because it's the first-ever bidirectional and completely unsupervised technique for language representation, which means a understanding of each word all at once.
is represents a clear advantage in the field of context learning. It will continue revolutionizing the field of NLP because it provides an opportunity for high performance on small datasets for a large range of tasks. e proposed model is also a multi-layer Transformer network based on UniLM, which is a BERT-based generative model. Compared to BERT, however, the proposed model can complete the three pretraining goals at the same time. Besides the mentioned pretraining methods, a new sequence-to-sequence training method is added into the model, which leads to the good performance of our model on NLU and NLG tasks. Moreover, the proposed model completes the prediction of the mask word through the context of the mask word, which is also a cloze task. For different training objectives, the context is different. e general processes of our proposed model are shown below: (i) Input presentation: Each input x is a sequence composed of word tokens. e sequence can be 4 Computational Intelligence and Neuroscience either a sentence or a pair of sentences combined together. e input representation is the same as UniLM. For each input token t i , the x i is obtained by calculating its corresponding representation through the corresponding token embedding, position embedding, and segment embedding. For the token at the beginning/end of the sequence, we add a special classification embedding (CLS)/a special end-of-sequence (SEP) of each paragraph. (ii) Transformer Encoder: en the multi-layer bidirectional Transformer encoder is used to encode the context information represented by the input. Given the input vector X � x i n i�1 , the encoding form of an L-layer Transformer's input is as follows: , and H l is 210 the implicit vector, which is used as the contextual representation for t i .
Pretraining Objectives: After the encoder process, we have carried out two extensions to the original UniLM pretraining goal to make full use of the rich intrasentence structure and inter-sentence structure in the language: word structure goal (mainly used for single sentence tasks) and sentence structure goal (mainly used for sentence pair tasks)). e two auxiliary targets and the original masking LM target are pretrained to find the internal language structure in a unified model. e structure is shown in Figures 3 and 4 Word Structural Objective: Figure 3 shows the method of jointly training the new word target and the mask language model target. For each input sequence, first, like UniLM, we randomly mask 15% of the token, and then send the output vector to the softmax classifier to predict the original mask. Next, given a randomly scrambled token, the order of the new words is considered. e word goal is equivalent to maximizing the possibility of placing each scrambled token in the correct position. e equation can be formulated as formula fd1: (1) Here, θ represents the trainable parameters in our model. K indicates the length of each scrambled subsequence. A bigger K will force the model to be able to reconstruct a longer sequence, while injecting more interference inputs. We take K � 3 to balance the model's reproducibility and robustness. (iii) Sentence Structural Objective: e original UniLM model is very effective in predicting the next sentence (97%-98% accuracy rate). In our model, it is necessary to predict not only the next sentence but also the previous sentence, such that the pretrained language model perceives the order of sentences in a bidirectional manner. As shown in Figure 4, given a pair of sentences (S 1 , S 2 ), where S 2 may be the next sentence of S 1 or not, probably speaking, there is a two-third probability that S 2 is the next sentence or previous sentence of S 1 . Or there is a one-third probability that they are irrelevant. We use the SEP token to connect S 1 and S 2 , and then the CLS encoded vector is input into the softmax classifier for the three-class prediction.

Experiments
In this section, we evaluate the effectiveness of the proposed model on math problems from the widely used benchmark MAWPS. MAWPS [36,37] is an online repository of Math Word Problems and provides a unified test-bed to evaluate different algorithms. MAWPS allows for the automatic construction of datasets with particular characteristics, providing tools for tuning the lexical and template overlap of a dataset as well as for filtering ungrammatical problems from web-sourced corpora. e online nature of this repository facilitates easy community contribution. At present, the repository has amassed 3320 problems, including the full datasets used in several prominent works. Moreover, we study the effect of different parameters in our model. In the experiments, almost every possible hyper-parameter is the same for the training recipes of both models. Specifically, we carefully control the following hyper-parameters: e same batch size: 256.  (2), MP is the quotient of answers that are correctly selected and the total amount of dataset. MP will measure the accuracy of the model.
Macro Recall(MR): MR is the ratio of the number of shared words to the total number of words in the ground truth. As shown in formula (3), S is the amount of data that is predicted. It measures the completeness of the result.
F1: F1 score is a common metric for classification problems and is widely used in QA. It is appropriate when we care equally about precision and recall. e calculation is as shown in formula.
In Table 1, it is clear that our model achieves a considerable progress in Macro Precision, Macro Recall, and F1 score. It is very hard for a model to make a huge improvement for math word problem solvers, for MWP is a mature research area.

Data Preparation.
e dataset provides a training set containing 1674 question and answer pairs, and a test set including 865 question and answers pairs. We choose 900 questions from the total training set as the development set, and the remaining 1639 question and answer pairs as the actual training set.

Results.
e experiment in this paper consists of two parts: Experiment 1 makes a comparison with other benchmark models. As shown in Table 1, the accuracy results of the proposed model and various baselines are listed. It is obvious that the proposed model outperforms all Positional Embedding  Figure 4: e architecture of the sentence structural objective. 6 Computational Intelligence and Neuroscience baselines in the experiments, and achieves a best accuracy of 79.57%. For example, in the experiment, the proposed method raised the F1 score to 0.78 compared with 0.73 and 0.76, respectively, of Graph2Tree and GTS [38]. is is because UniLM combines the advantages of both AR and AE models, which makes up for the disadvantages of LSTM, i.e., LSTM only stores information of one direction. Obviously, the proposed model performs the best in all tasks. To get a better understanding of how the constrained model is able to perform so well, we further carry experiments to test the effect of different parameters in our model.

Impact of the Length of the Sentence.
We first study the effect of length of the sentence. e experiments are carried on the test set to investigate how the proposed model performs with increasing length of the sentence. Comparisons are built between ours and state-of-the-art models using explicit tree decoders. As shown in Table 2, we find that: First, the proposed model performs better than the other models in most of the cases, except in the case of the number of operators equals to 5. In other cases, with less than 5 operators, the model shows a good improvement compared to other models. Second, when the complexity of the sentence grows, the performance of all models decreases.
is is because longer sentences lead to more complex questions, which are more difficult to predict.

Impact of Numerical Comparison.
Since the wrong arithmetic order leads to incorrect solution expression generation, our proposed model aims to solve it. Experiments are carried to prove this by investigating how the model has improved the arithmetic order problem. We first retrieve the MWPs with incorrectly predicted expressions.
In the experiment, we check that the incorrectly predicted expressions length is equal to their corresponding ground truth expressions' length. As shown in Table 3, the proposed model gets 101 incorrect predicted sentences, while GTS has 119 and Graph2Tree has 103. We then check the amount of incorrectly predicted sentences with the initially retrieved set. e results show the same conclusion; our proposed model always generates fewer arithmetic order error sentences. is suggests that the proposed model is able to significantly improve the arithmetic order in MWP tasks.

Conclusions
is paper proposed an improved MWP model, which improves the task performance by adding UniLM for pretraining. UniLM completes unidirectional, sequence-to-sequence, and bidirectional prediction tasks. rough experiments, we show the superiority of our model against state-of-the-art models on math problem tasks.
ere are three advantages of the proposed model: (1) Although there are three language model tasks in the pretraining process, we do not need to train the three models separately because the parameters of the transformer are shared. anks to the self-attention masking of UniLM. (2) Parameter sharing makes the learned text representation more universal because these parameters are jointly optimized with different language models. It also alleviates the problem of over-fitting on a specific language model task. (3) e proposed model is suitable for both NLU and NLG problems.
For future work, since the proposed model has difficulties dealing with long and complex sentences, we aim to consider the relationships among quantities and other attributes to better understand the context. Moreover, in future research, since advanced optimization algorithms also have been applied in many domains of NLP tasks, we may explore a comparison between advanced optimization algorithms and our model.

Data Availability
e data used to support the findings of this study are available from corresponding author upon request.

Consent
Not applicable.

Conflicts of Interest
e authors declare that there are no conflicts of interest.

Authors' Contributions
Dongqiu Zhang conceptualized the study; Wenkui Li wrote, reviewed, and edited the manuscript. All authors have read and agreed to the published version of the manuscript.