Model and Simulation of Maximum Entropy Phrase Reordering of English Text in Language Learning Machine

. This paper proposes a feature extraction algorithm based on the maximum entropy phrase reordering model in statistical machine translation in language learning machines. The algorithm can extract more accurate phrase reordering information, especially the feature information of reversed phrases, which solves the problem of imbalance of feature data during maximum entropy training in the original algorithm, and improves the accuracy of phrase reordering in translation. In the experiment, they were combined with linguistic features such as parts of speech, words, and syntactic features extracted by using the syntax analyzer, and the maximum entropy classiﬁer was used to predict translation errors, and the experimental veriﬁcation was performed on the Chinese-English translation data set and compared. The experimental results show that diﬀerent word posterior probabilities have a signiﬁcant impact on the classiﬁcation error rate, and the combination of linguistic features based on the word posterior probability can signiﬁcantly reduce the classiﬁcation error rate and improve the translation error prediction performance.


Introduction
Phrase-based statistical machine translation is one of the current mainstream methods of machine translation. e basic unit of translation transitions from word to phrase, and the continuous word string is processed as a whole in the translation process, which solves the problem of word context dependence [1]. When translating, the input sentence is matched with the phrase dictionary, the best phrase division is selected, and the obtained phrase translation is reordered to obtain the best translation. Among them, reordering at the phrase level is an important research problem based on phrase machine translation. Many systems use the distortion model probability to adjust the order between the target language phrases [2]. e distortion probability of each target phrase can be based on the current target phrase's source language phrase starting position and the previous target phrase. e distance between the last position of the phrase in the source language of the phrase is calculated. Obviously, this simple strategy based on penalty length will affect the accuracy of the phrase reordering model. e introduction of syntactic knowledge into the machine translation system can effectively improve the accuracy of reordering [3].
In recent years, with the development of machine translation (SMT) based on statistical methods, many different types of machine translation (MT) systems have emerged, such as phrase-based, hierarchical phrase-based, and syntax-based machine translation systems, and translate performance has been significantly improved [4]. Automatic translation quality evaluation is a hot spot in statistical machine translation research. It can be divided into two types: automatic evaluation with parameters and automatic evaluation without parameters [5]. In the field of software localization, the latter refers to automatically giving a confidence score to the translation quality or identifying and classifying translation errors in the translation without reference to the answer, so as to help the translation editors to quickly locate the translation error position and improve work efficiency. In order to improve the quality of machine translation, automatic error detection and classification play a vital role in the postprocessing of MT output. On the one hand, it can help posteditors improve work efficiency, and on the other hand, it can analyze the translation of the corresponding source language based on translation errors. We could redecode by transforming the source language input, thereby improving the translation performance [6].
Among them, the bracket transcription grammar proposed by Alkazemi et al. [7] has also been widely used in the field of machine translation. However, because the bracket transfer grammar does not contain linguistic knowledge, it cannot predict the combination order of two adjacent target phrases well. Wang et al. [8] used the boundary words of the bilingual phrase as a feature on the basis of the bracket transcription grammar to perform maximum entropy training to obtain the reordering model and obtain the probabilities in order preservation and reverse order by calculating the characteristics of adjacent bilingual phrases; it can better predict the order between adjacent phrases, thereby effectively improving the translation results of the translation system. By observing the features of maximum entropy training based on the maximum entropy phrase reranking model, it is found that the number of instance features of orderpreserving phrases is much greater than the number of instance features of reversed phrases, because the word order of Chinese and English is roughly the same [9]. e use of maximum entropy to achieve reordering of phrases can also be regarded as a classification problem, that is, the order-preserving class and the reverse-ordering class, and the feature data used to train the classifier have a data imbalance problem, which may affect the classifier actual classification effect. For example, if FBIS is selected as the training corpus, the baseline feature extraction system extracts 4,839,390 feature instances, of which orderpreserving feature instances account for 82.7%, while reverse-order feature instances account for only 17.3% [10]. Taking 100,000 sentences in all feature instances as the open test set of the reranking model and the remaining data as the maximum entropy training set, the test results show that the reranking model has a judgment accuracy of 97. 55% for order-preserving features [10]. e judgment accuracy rate of the reverse-order feature is only 72.03% [11]. In addition, based on the bracket transcription grammar, it is assumed that the source language end phrase is adjacent and the target language phrase is also adjacent, but there are adjacent source language phrases in the actual Chinese-English sentence pair. In view of the above situation, this paper improves the maximum entropy feature extraction algorithm from three aspects: order-preserving example selection strategy, introduction of combined features, and addition of new phrase order to improve the judgment accuracy of the reranking model and finally achieve the effect of improving translation quality.

Statistical Machine Translation Based on the Maximum Entropy Phrase Reordering Model
Wang et al. [8] proposed a statistical translation model based on bracket transcription grammar. e simplified bracket transcription grammar contains only the following two rules: (1) Among them, P i is the vocabulary rule, which means that the source language phrase α is translated into the target language phrase β. P i is the merging rule, and the order of the source language phrase and the target language phrase can be expressed as preserving order and reverse order. In the process of phrase reordering, a priori preserving and reversing probabilities can be set for the two different orders in the merging rule.
is method ignores the differences between different source language target language phrase pairs.
Maučec and Donaj [12] improved the ordering model of the abovementioned bracket transcription grammar model and proposed a phrase ordering model based on the maximum entropy bracket transcription grammar, that is, using the maximum entropy model for phrase ordering: Among them, f is the feature function, ω is the feature weight, and the value of δ is order preserving or reverse ordering, and the ending word of the phrase is selected as the feature of the maximum entropy model training. Experiments show that the performance of the phrase ordering model based on the maximum entropy bracket transcription grammar is significantly better than the traditional distortion-based short intonation ordering model and the bracketing transcription grammar-based ordering model. However, it can be seen from experiments that the number of order-preserving instances is much higher than the number of reverse-order instances, which may affect the performance of the maximum entropy model. is paper cuts in from two aspects of the reranking instance extraction algorithm and feature selection, aiming to solve the problem of maximum entropy training data imbalance. In the experiment, the statistical machine translation system [13] based on the maximum entropy ordering model will be used as the baseline system. e maximum entropy phrase reordering model is shown in Figure 1.

Reordering Instance Extraction Algorithm.
e extraction algorithm of reordering examples in the maximum entropy phrase reordering system in this paper is more flexible and concise in implementation and easy to expand, which can meet different extraction strategies in the experiment. e input of the reranking instance extraction algorithm is a word alignment matrix that has been GIZA bidirectionally aligned, and the output is the order-preserving phrase instance and the reverse-order phrase instance [14]. e extraction algorithm first traverses all consecutive word sequences in the source language and extracts the maximum span of the target language that is aligned with this continuous sequence. en, the target language word sequence and the source language word 2 Complexity sequence that do not satisfy the alignment consistency are filtered, that is, the span of the target language is scanned in reverse order to check whether the corresponding source language span is within the range of the original continuous word sequence. Finally, according to the given different extraction strategies, reordering examples are extracted.

Variable
Definition. Before introducing the reordering example extraction algorithm, first define the variables related to the algorithm: (1) Align set: store all alignment matrices from the source language to the target language (2) Straight set: store a collection of instances of the target language phrase preserving order (3) Inverted set: store a collection of instances in reverse order of target language phrases (4) Else set: store the instances where the source language phrase is adjacent and the target language phrase is not adjacent e last lines of the algorithm describe the framework of the improved algorithm for extracting examples. Based on this framework, it is convenient to formulate various extraction rules. Among them, the 10 th step pairs the extracted bilingual word alignment matrix, checks whether it can be split into two adjacent bilingual phrase pairs, and judges the combination order of the split adjacent bilingual phrase pairs. In the final step, the algorithm introduces a new classification, namely, nonadjacent bilingual phrase pairs.

Reordering Instance Selection Strategy.
e baseline system uses a simple method to control the number of reordering instances, that is, only the smallest block is reserved in the order-preserving instances and only the largest block is reserved for the reverse-order instances. Obviously, some phrase boundary features will be lost in this way, and the number of preserving instances still far exceeds the number of reversed instances. is imbalance of feature data will affect the judgment accuracy of the maximum entropy reordering model, especially the judgment of the features of the reverse-order instance [15]. Open-ended testing is performed with 100,000 instances, of which the number of reverse-order instances is 17,286, and the test accuracy of reverse-order instances is only 72.03%. In this paper, under the algorithm framework proposed in Section 3, the following three attempts are made in sequence for the reordering instance selection strategy: (1) In order to solve the imbalance of feature data during the maximum entropy training process, the most direct idea is to adopt a certain selection strategy to directly limit the number of preserving instances [16,17]. Compared with the minimum block in the order-preserving example selected by the baseline system, this paper uses a random algorithm to select the number of order-preserving examples, which avoids the loss of long phrase boundary features that may be caused by the previous method. (2) In bilingual sentences, the source language phrases are adjacent but the target language phrases are not adjacent. In response to this situation, this article adds a new classification based on (1) to reduce the imbalance of feature data to a certain extent. If the extracted instance does not belong to the orderpreserving and reverse-ordering categories, the instance can be classified into one category [18]. (3) Because of the misalignment in the alignment results, extending the unaligned words to the examples will improve the recall rate of phrase feature extraction. Here, we define the order-preserving and reverse-ordering rules, i � {0, 1}, where i � 0, it means that the extracted instance is not expanded by unaligned words, and i � 1, it means that the extracted instance is expanded by unaligned words.

Feature Extraction.
Features from reranking instances are extracted for maximum entropy training. e reordering instance can be represented by <a 1 , a 2 >, where a � <b, c>, b represents the source language phrase, c represents the target language phrase, and a 1 and a 2 represent adjacent or nonadjacent phrases. Here, b. f is used to denote the first word of the source language phrase and b. l is the last word of the source language phrase. e same definition is used for the target phrase c. e baseline system takes into account the scale of feature extraction and uses only the tail words in the rearrangement examples. In the feature extraction experiment, in addition to the above four tail word features, the first word feature and combination feature are added [19]. Because of the different grammatical structures between Chinese and English, the corresponding English translation of the phrase or clause before and after the Chinese punctuation marks may express the phrase or clause in reverse order [20]. e decoding method of the baseline system is that if punctuation marks are searched in the reordering window, this window will not perform the reverse-order operation. is method is quite effective for symmetric symbols, such as <<>> {}. However, the "." cannot be simply judged based on this. In this paper, on the basis of increasing the first word feature and combination feature of the reranking instance, punctuation feature is added for maximum entropy training. e characteristics of reordering examples are shown in Table 1.

English Translation System
Evaluation Criteria e evaluation criteria for the effectiveness of the error detection method adopted are classification error rate (CER), accuracy rate (AR), recall rate (RR), and F criterion. e classification error rate is calculated as follows: CER � the number of words whose classification category is wrong the total number of words .
In the Chinese-to-English translation error detection and classification task, because the number of true categories in the translation hypothesis "incorrect" is greater than the number of "correct," so when determining the baseline level of the classification error rate, the usual approach is as follows: the evaluation criterion score is obtained when all the "correct" words are marked as "incorrect", namely, the baseline level of classification error rate � the number of "correct" samples/the total number of samples. e accuracy is the ratio of the number m n that the classifier accurately classifies the words that are actually in category i to the number t n of words that the classifier marks as i, that is, e recall rate is the ratio of the number m n that the classifier accurately classifies the words of the real category i to the total number of words g n of the real category i: e F criterion is the trade-off between accuracy and recall, namely,  erefore, the test accuracy of the maximum entropy reordering model cannot be simply reflected as the level of translation performance. e test accuracy of the maximum entropy reordering model can still be used as a reference indicator.

Complexity
It can be seen from Figure 3 that the test accuracy of experiment 1 reached the highest value of 92.48%. Because of the limitation of the number of preserving instances in experiment 2, the total number of extracted instances was reduced by 60% compared with experiment 1, resulting in a maximum. e amount of data for entropy training is insufficient, so the test accuracy is only 85.38%. Considering that when the number of instances is reduced, the amount of feature data generated by a single instance needs to be increased, so in experiment 3, the first word feature and combination feature are added to the instance, and the test accuracy reaches 91.39%. However, the adjacent source language phrases do not indicate that the target language phrases are adjacent, so experiment 4 introduces the third category, namely, the target language. e phrase is not adjacent. e test accuracy of experiment 4 dropped to 75.38% because a new category also increased the uncertainty of the maximum entropy reordering model judgment. Experiment 5 is based on experiment 4 and expands unaligned words to increase the number of examples, but the result of experiment 5 is slightly lower than that of experiment 4. Both experiments 4 and 5 are based on experiment 3. e introduction of the third category leads to a large decrease in test accuracy. To a certain extent, it shows that the introduction of the third category will not improve the accuracy of the maximum entropy model judgment. erefore, this paper designs experiment 6 to expand the unaligned words on the basis of experiment 3 and to quote on the basis of experiment 6. e test accuracy of these two experiments is only slightly lower than experiment 1.
is paper pays more attention to the accuracy of the feature extraction strategy for the maximum entropy model to judge the inverted instance. Figure 4 shows the test accuracy of the maximum entropy reranking model on the preserving subset and the inverted subset (Invert) of the test set. A subset of the orderpreserving examples in the test set is tested. Except for the introduction of new classifications in experiments 4 and 5, the uncertainty in the judgment of order-preserving features increases. e test accuracy of experiments 2, 3, and 6 is not different from that of experiment 1 more than 4%. e test results of a subset of reverse-ordered instances in the test set are observed. In experiment 2, because the amount of training data for reverse-ordered features is small, the test accuracy on the reverse-ordered instance subset is lower, and the test accuracy of experiments 3, 4, 5, and 6 are all better than that of experiment 1. e accuracy of the subset of instances in reverse order is high. Among them, the test accuracy of experiment 6 is 6% higher than that of experiment 1. It can be seen from the above experimental data that the maximum entropy reordering model feature extraction algorithm proposed in this paper solves the inaccurate judgment of the reverse-order feature caused by the imbalance of feature data.

Comparison of Translation Results.
e case-sensitive BLEU value was tested on NIST-MT 05. Figure 5 shows the impact of 6 groups of maximum entropy reordering models trained with different feature data on the final translation effect. e BLEU value of baseline system experiment 1 is 0.2283. As can be seen from Figure 5, except for experiment 2, the performance of the maximum entropy reordering model has been greatly reduced during the translation process due to too little feature training data. Experiments 3, 4, 5, and 6 are all based on experiment 2 to add feature information, and the performance of the reranking model while limiting the number of preserving instances is higher than that of the baseline system. In experiment 4, the translation performance of the nonadjacent classification is reduced but the BLEU value is still higher than that of the baseline system. Experiment 6 adds punctuation features and the translated BLEU. e value reaches the highest value of 0.243. e reranking instance extraction and feature extraction algorithms proposed in this paper can significantly improve the performance of the reranking model and improve the translation quality by limiting the number of preserving instances and increasing the number of features.

Misclassification Experiment
e feature function of the maximum entropy classifier is the feature vector considering the context; that is, in addition to each current feature variable, it also considers its front and back. Experimental design: (1) perform classification experiments on 3 typical word posterior probability features and compare and analyze their performance; (2) perform maximum entropy model classification experiments on individual linguistic features and analyze them; (3) combine three typical word posterior probability features with linguistic features, perform classification experiments, and compare and analyze them.

Classification Experiment Based on Word Posterior
Probability Features. Table 3 shows the classification experiment results based on the posterior probability of 3 typical words. In Table 3, Dir represents a word posterior probability feature based on a fixed position, Win represents a word posterior probability feature based on a sliding window, sliding window t � 2, Lev represents a word posterior probability feature based on Levenshtein alignment. When aligning the 1-best translation hypothesis in the Nbest list with other translation hypotheses, the open-source toolkit TER [13] is used, and its "shift" function is turned off, which is WER alignment. e abovementioned three posterior probabilities have been discretized before use [10].
It can be seen from Figure 6 that, in terms of CER, compared with the baseline system, the features Dir, Win, and Lev are reduced by 2.34%, 3.97%, and 342% (relative values), respectively, and Win performs best. Analyzing the above results, we can obtain the following: (1) the Win feature changes the fixed position into a sliding window, which has higher alignment flexibility, so it is more in line with the ordering phenomenon of the source language and the target language due to different word orders, but sliding window is limited to limited local ordering; (2) the Lev feature is based on Levenshtein alignment, so the alignment is better, but it also introduces too many editing operations, such as insertion, deletion, and replacement, and because there is no word order, although the alignment is better than Dir, the flexibility is lower than Win. From the above analysis and data, it can be known that combining CER and F value, the characteristic Win has the best comprehensive performance.      Table 4 shows the error detection results based on linguistic features, namely, word entity (Word), part-of-speech tagging (POS), and syntactic relationship (Link).

Classification Experiment Based on Linguistic Features.
Compared with the baseline system, as shown in Figure 7, Word, POS, and Link have reduced CER by 5.36%, 4.98%, and 1.72% (relative values), respectively, among which Word performs best. In terms of F value, Link performs better than the other two features and POS is better than Word. Analyzing the above results and comparing them with Table 3, we can obtain the following: (1) except for the Link feature, the classification error rates of Word and POS in the linguistic features are lower than the classification error rates of the 3 word posterior probability features; (2) Link feature has the highest recall rate and the lowest accuracy rate. is is mainly due to the relatively small number of Link features. erefore, when classifying, the classification result is more inclined to mark the target word as category i, resulting in category c. e number is relatively small, so that the recall rate is high and the accuracy rate is low; (3) the classification result of the Word feature is better than that of the POS feature. e reason may be that the development set and the test set are more relevant (both in the news field), and the number of features is much more than the number of POS features, so in terms of classification ability, its tendency (or probability) to predict the target word as category i is lower than POS with a relatively small number of features, resulting in a lower recall rate, but the accuracy is better.

Combination Feature Classification Experiment.
In the classification task of natural language processing research, feature combination can often reduce the classification error rate more effectively. Table 5 lists the classification experiment results based on the maximum entropy model after combining the three typical word posterior probability features described in this paper and the three linguistic features. It can be seen from Figure 8 that, in terms of CER, compared with the baseline system, the CER of the three combined features has been reduced by 13.14%, 14.25%, and 13.92% (relative values), respectively, and the F value has also been significantly improved. Although the classification performance of the three feature combinations is not significant, the classification characteristics of the three feature combinations are consistent with those of a single WPP feature; that is, the combination "Win + Word + POS + Link" has the lowest classification error rate and the combination "Dir + Word + POS + Link" has the highest F value, indicating that the word posterior probability feature based on the sliding window position can capture more contextual information, so that its ability to distinguish translation errors is stronger than the word posterior probability feature based on a fixed position. is ability is manifested not only in the comparison of individual features but also in combined features. While comparing the combined effects of three different WPP features, Table 5 also reveals the contribution of linguistic features to error detection, indicating that linguistic features can effectively reduce the classification error rate and improve the ability of error prediction.

Conclusion
is paper proposes a new reordering instance extraction algorithm and adds new features on this basis to achieve better translation results. First, the problem of data imbalance in the maximum entropy training process is directly solved by limiting the number of order-preserving instances, and the translation performance is reduced due to too little feature information. On this basis, the addition of first word features and combination features improves translation performance. Second, the third type of phrase combination order is introduced, that is, nonadjacent cases other than order-preserving and reverse-ordering; although the BLEU value has decreased, it is still higher than the baseline system. Finally, this article attempts to expand the unaligned words in aligned phrases in the experiment, increase the amount of reranking example feature data, and achieve the best translation performance. In the next step, we will continue to study the impact of reordering instance features on translation performance, focusing on the integration of syntactic knowledge features, hoping to further improve translation performance. In addition, we will further explore the improvement of the decoder based on the bracketed transcription grammar framework, so that it can handle the situation where the source language phrases are adjacent but the target language phrases are not adjacent.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e author declares that there are no conflicts of interest or personal relationships that could have appeared to influence the work reported in this paper.