Design of New Word Retrieval Algorithm for Chinese-English Bilingual Parallel Corpus

,


Introduction
In the information age, a large amount of digital information appears. Text information is the most common and basic way. In order to find what you need in the mass of text information, people need efficient retrieval tools. How to store and query unstructured data is the key research content. In the 1990s, people's demand for information retrieval was getting higher and higher; instead of satisfying the same language retrieval, multilanguage information should be included in the retrieval results. With the continuous development of the Internet, the number and types of information resources on the Internet are becoming more and more abundant, and the language is also unbalanced and diverse. e number of people using the Internet is growing, and their language skills are different. Language barriers emerge in information retrieval through the network as a result of the diversity of languages of network resources and the differences in language mastery of network users, causing annoyance to users of non-English speaking countries in utilizing network information. erefore, the research of English-Chinese cross-language information retrieval design is of great significance.
With the popularity of the Internet among countries around the world, the Internet provides people with more and more electronic information. With the rapidly increasing amount of information in today's world and the increasingly frequent international exchanges, according to the needs of people in different countries, the information on web pages is often published in the bilingual form [1]. Although such bilingual websites solve the problem of bilingual information communication between people from a certain angle, the language communication barriers and serious disharmony are still obvious, so the potential demand for machine translation technology is increasing. Translation studies between different languages are becoming more and more urgent [2]. For machine translation, on the one hand, it puts forward more practical requirements; on the other hand, it also puts forward higher requirements for mass text translation. More and more, we need a new means for effective data acquisition and integration of professional and technical information [3]. e continuous progress and maturity of machine translation technology will greatly increase the information level of the whole society and produce inestimable social and economic benefits in all walks of life and various applications. Even though the Internet has a large number of bilingual corpora available, the majority of the corpora are concentrated on international conferences or religious texts [4], resulting in limitations in machine translation research and an imbalance.
Existing corpora are the official languages of the contrast between the materials, such as the British and French and English and Chinese, so the bilingual corpora are unbalanced, because most of them are the subject of official, thus affecting the corpus study; even if the existing corpus is thoroughly researched, it cannot be applied flexibly after all, by the subjects of the system. e bilingual parallel corpus in the professional field is even less, which greatly increases the difficulty of international academic communication. Researchers in various countries have to communicate with each other through the third-party media language English, which prolongs the time of academic communication. Moreover, it is difficult to obtain such official data from the corpus of some very common languages. erefore, the large-scale corpus is the basis of machine translation research, as the saying goes, "the more data, the better" [5]. e reason for the lack of parallel corpus in the professional field is the imbalance of corpus, so how to expand the professional corpus is now the research content of most scholars. In order to advance the expansion of professional corpus, the problem to be solved is the alignment of English and Chinese parallel corpus sentences so that the corpus can better promote academic communication at home and abroad. is thesis starts from this problem to expand. Bilingual corpus not only is applied in machine translation but also provides extremely important application value for Chinese and English information processing, editing of Chinese and English bilingual dictionaries, and research on cross-language information retrieval [6].
Corpus construction has had some achievements, such as the Canadian parliament corpus (Canadian Hansards), due to the language of the country, and the content is the recording of the parliamentary debate, so the most are the bilingual corpus.
is means that research on English-French sentence alignment is certainly qualitative.  [7]. So we need a balanced corpus, not political news and so on, but a specialized bilingual corpus.
ere are a lot of professional bilingual corpora on the Internet, but a large number of professional bilingual corpora are paragraph aligned, and few are English and Chinese bilingual sentence aligned, which causes great obstacles to the research of machine translation. After discovering English-Chinese bilingual paragraphs to align the corpus, the problem of unknown terms must be handled in order to align professional English-Chinese bilingual sentences [5]. As general concept words in a specific professional field, unregistered professional words are highly professional. ey convey the knowledge of professional literature, especially in the complex field of technical literature, and have a decisive impact on translation efficiency and quality. Professional vocabularies embody the professional core knowledge of a discipline, so they reflect the changes of professional terms and the development of a discipline to a certain extent and also affect the international communication of disciplines. First Workshop on Computational Terminology was organized at Coling-ACL '98, an international conference on Computational Linguistics, in 1998. For the first time, the term "Computational Terminology" was used. Later, the study of unrecorded words became an important topic in machine translation information processing [8]. Unlisted words in bilingual majors are of great practical significance to the research of natural language processing topics such as machine translation, information extraction, data mining, information retrieval, and establishment of domain concept system and to deeply understand and grasp the dynamic development status and future direction of a discipline [9]. erefore, if these unknown words are not analyzed, understood, and included in a timely manner, it is bound to affect the dissemination of scientific and technological information in the whole field among international scholars, which will bring inconvenience to the acquisition of effective information and also become an obstacle to the integration between Chinese and international academic circles. On the other hand, in today's Internet information explosion, the traditional manual translation of unknown words in the major has been far from meeting the actual needs of Chinese-English translation [10]. It has become an inevitable trend to extract unknown words from Chinese and English bilingual sentences by using advanced information technologies such as computers and web. erefore, how to efficiently solve the problem of English-Chinese parallel corpus sentence alignment is the focus of this paper.
If the bilingual sentences with unknown professional words obtained in experiments are entered into the translation system, the professional translation can be more accurate, and researchers in this field do not have to face the literature in English. In this way, international academic exchanges can also be accelerated. e obtained professional words can also be included in the dictionary, which can better serve the staff in the professional field and provide a certain convenience for the compilation of professional dictionaries. Compared with traditional translation mining, the results and performance obtained from the translation alignment of a large number of scientific and English unrecorded words are faster, and a large number of scientific and English unrecorded words are obtained. ese aligned sentences with professional unknown words can better serve machine translation and contribute to the research of machine learning, natural language processing, and other fields. e following is a summary of the research. Section 1 contains the introduction, Section 2 discusses the research status at home and abroad as well as a length-based approach, a lexicon-based approach, hybrid approach, and SVM-based alignment method. Section 3 discusses aligning English and Chinese bilingual sentences containing unknown words. Section 4 discusses experimental results of the proposed concepts. Finally, the conclusion brings the paper to a finish in Section 5.

Research Status at Home and Abroad
At present, more and more researchers at home and abroad begin to pay attention to the construction and application research of bilingual corpus [11]. A large-scale corpus has emerged in the world, such as the Canadian Hansards corpus mentioned previously. In the beginning, the research on bilingual corpus was carried out on this corpus. Since it was the initial stage of this research at that time, no relevant corpus was available for use [12]. In the construction of the Chinese-English bilingual corpus, more and more bilingual corpora have emerged in recent years, such as the bilingual corpus with Hong Kong Laws as the background, whose content is the English-Chinese translation of Hong Kong legislation and other legal provisions. ere are also English-Chinese bilingual corpora based on the minutes of the Hong Kong Hansards legislative council, which are all the corpora used by scholars in that period. e reason why many types of corpora can be extracted in Hong Kong is partly that the native people in Hong Kong are also influenced by the British, and most of them speak English as their mother tongue. Because of the limited corpus, domestic scholars for corpus research started relatively late, but in recent years, there have been some scholars who have begun to study the bilingual corpus. e popularity also increases with the increase in the language of international communication, these studies have also made considerable progress, and bilingual corpora for later research have laid a solid foundation. For example, the Institute of Computational Linguistics of Peking University developed the Babble Chinese-English bilingual corpus, which is still a machine translation service in the field of journalism. Even so, this corpus is one of the largest corpora at the present stage. Some other research institutions have also established large-scale corpus successively, such as Tsinghua University, Northeastern University, Institute of Software of Chinese Academy of Sciences, Peking University, and Nanjing Normal University [13][14][15]. With the continuous development of the construction of corpus research, at present most scholars are mainly in the study of the alignment and annotations of the existing corpus, the content of the corpus is out of date, even if the bilingual corpus in machine translation research to obtain knowledge also cannot get the latest achievements of translation, and relatively few scholars focus on corpus of initial construction [16]. However, it can be seen that the imbalance of this corpus towards law or politics limits the practical application of corpus to a great extent. In the initial stage, corpus acquisition online requires manual collection and judgment, which further limits the rapid development of corpus, as well as the diversity and timeliness of corpus [17]. erefore, it is of great significance to develop a professional, efficient, and large-scale construction scheme for corpus [18]. e core of bilingual text processing is alignment technology. ere are different versions of the explanation of alignment, and most of the meaning of alignment is to select a pair of sentences from different languages whose translations match each other. e alignment of the bilingual corpus can be generally divided into paragraph alignment, sentence alignment, and word alignment. Some are aligned through web acquisition, and these methods have achieved certain results, among which the more famous systems are the PTMiner system [19] developed by Nie Jianyun from the University of Montreal in Canada and the STRAND system [20] developed by Franz Josef Och from the University of Maryland in the United States. ese systems mine mainly according to the document structure, which also shows that this method can mine information between two different language web pages. A common feature of these two systems is to use some obvious webpage information as heuristic information and then conduct experimental research through the information on these bilingual websites or use the characteristics of webpage uniform resource locator, URL address, and other characteristics to select relevant websites. However, when using URL features, the former system uses some simple document prefixes and suffixes, while the latter system uses regular expressions. Another mining method is to use a content-based method to collect bilingual web pages, then use a bilingual dictionary to align the bilingual text of the collected web pages, and then calculate the similarity to determine whether the two documents are mutually translated. A representative system of this method is BITS (Bilingual Internet Text Search, Ma and Liberman 1999) [21]. e principle of downloading BITS is to use a systematic approach to bilingual alignment; because this system uses some basic language knowledge, so as to improve the quality of the alignment from a certain angle, it also suggests that need to consider when aligned to the alignment of the characteristics of two different languages and habits. However, there are relatively few such researches in China. Researchers from the City University of Hong

Mathematical Problems in Engineering
Kong obtained a bilingual corpus of Hong Kong news from bilingual websites. e research on these systems and methods to some extent indicates the development of the establishment of bilingual corpus [22]. Although the domestic and foreign research results have achieved preliminary results, the alignment of the translation effect is good. However, there are still some problems in the obtained bilingual corpus, and there are not many open source resources available for download in the bilingual data, and most of them are in the research stage, which is not useful in practice. erefore, if the system in the experimental stage is applied to practice, the translation effect of bilingual sentence pairs may not be very ideal [23]. Most of the acquired corpus is in the general field, and few are in the professional field.
erefore, due to the imbalance of the corpus, the academic circle attaches great importance to the research on the alignment of unknown terms and sentences. During professional bilingual sentence alignment, it is inevitable to encounter unknown words. Generally, the following conditions are unknown words, namely, morphologically derived words (MDWs), fixed expressions (factoids), named entities (NEs), and new words (NWs) [24]. e unknown words involved in this paper generally have academic characteristics, so how to overcome the difficulties caused by the professional unknown words in the study of English and Chinese bilingual sentence alignment is one of the focuses of this paper.
As early as the early 1990s, some foreign researchers began to work in this field, mainly Brown [25], Gale [26], Chen [27], and so on. e methods they used can be divided into two categories: one is based on length alignment and the other is based on word alignment. When Brown studied corpus, he introduced a new concept, namely, anchor point. In his opinion, the role of anchor points is to divide the larger corpus into some relatively small fragments and call these corresponding sentences beads. Liu Xin [28] and Qian Liping [29] et al. have also made corresponding algorithm improvements in some domestic sentence alignment studies. However, most current scholars use the existing bilingual paragraph alignment to conduct research, which causes certain difficulties for sentence-level machine translation [30,31]. However, most of the bilingual corresponding web pages on the Internet are also paragraph aligned, with few sentences aligned. However, to obtain sentence alignment, bilingual paragraph alignment is the first step, which is why bilingual paragraph alignment corpus is needed [32]. e bilingual corpus mentioned above is paragraph alignment, sentence alignment, or word alignment. Paragraph alignment can be found in most web pages and is relatively simple, as long as the calculation of simple word frequency can do paragraph alignment, and this alignment is relatively easy. e results of word alignment are mainly used in the compilation of dictionaries. In machine translation research, sentence alignment is often used for analysis so as to further understand the relationship between words in sentences, and this information can also be used in natural language processing and other research. In this paper, sentence-level alignment technology is studied. Before studying sentence alignment, we should first understand the current mainstream methods, and only through previous research results can we make better improvements. Currently, the mainstream sentence alignment methods can be basically divided into four categories.

Length-Based Approach.
e length-based approach was first proposed by Brown and Gale [33].
is is because sentences give people an intuitive impression of their length, and the initial study was based on the characteristics of sentence length. If the length of the two mutually translated sentences meets the correlation set by the experiment, it means that the two sentences in different languages are mutually translated. e method based on length has been proved feasible by experiments on the original corpus, namely, the English and French bilingual corpus of Canadian parliamentary meeting minutes, and some Chinese researchers have also applied this method to the English and Chinese corpus. For example, researchers from Harbin Institute of Technology and Tsinghua University applied the length-based alignment method to the English and Chinese bilingual sentence-level alignment in the Microsoft NT 3.5 Server installation instructions and relevant legal documents, respectively, and also obtained satisfactory experimental results [34]. However, in the initial stage, due to the simplicity of corpus, the research on sentence alignment has just begun, and most scholars use the method based on length for sentence alignment.

A Lexicon-Based Approach.
In the lexicon-based approach, Kay and Chen [27] used the distribution information of bilingual words and the lexical translation model, respectively, in English-German and English-French sentence alignment. In literature [9], English-Chinese bilingual dictionaries were directly used in sentence alignment of college English textbooks, which also achieved good results.

Hybrid Approach.
A hybrid approach is a combination of a length-based approach and a vocabulary-based approach. e above two methods have their own advantages and disadvantages; for example, the alignment method based on the model length is too simple, which is easy to cause propagation errors, and cannot suppress the propagation well, and the robustness is not good. So we cannot assume that the violation by data produced by the real model still can work well so that it can lead to errors in research in the future. Because the lexicon-based method uses the dictionary, the results are relatively accurate and reliable, but the calculation is quite complicated, consuming a certain amount of manpower and material resources, which cannot improve the experimental speed. So scholars combine these two methods for sentence alignment. Initial alignment is done using a length-based approach and then again using a vocabulary-based approach.

SVM-Based Alignment Method.
e maximum entropy model is the theoretical basis of the maximum entropy classifier. Its core idea is to develop a multilingual sentence alignment model using all known sentence information, but excluding unknown information from the model to prevent altering the outcomes. SVM can only be used for class II problems, and it is not effective for multiclass problems.
erefore, this provides conditions for sentence alignment because the results of two sentences evaluated by the alignment criteria are "aligned" or "unaligned," which is a type II problem, so the application of SVM for alignment is also a trend of current research.

Align English and Chinese Bilingual Sentences Containing Unknown Words
Since most of the bilingual unknown words in this paper are professional, and the dictionary used in the experiment cannot contain a large number of professional unknown words, keywords need to be extracted from the prepared corpus and added to the dictionary. en, the abstract part is extracted from the corpus, and the stem of English words is extracted. en, the paragraphs-aligned text is handed to the bilingual sentence alignment system for processing, and the Chinese-English bilingual sentence alignment text containing unknown words is obtained.

Bilingual Sentence Alignment System Based on Length.
We chose this approach based on length alignment. After that, some parameters are adjusted in the experimental part to achieve satisfactory results. e following is a concrete example of how this paper makes use of the feature of English and Chinese sentence length to align: Example 1: English: Is this a pen? Chinese: Is this a pen?
At a certain point in the bilingual sentence alignment experiment, when a sentence pair needs to be aligned, the processing process is as follows: first, remove punctuation marks, and then search for the longest meaning item of the word in the Chinese sentence according to the order of the English words in the English sentence, for example: " is" has many meanings, but the longest meaning that can be matched in this Chinese sentence is "这." So in its match is "这." e alignment effect is shown in Figure 1.

Stem Extraction.
It has different forms for different parts of speech of the same English word; for example, "run" is the verb form to run, and "ran" is the past tense form to run, which means the same thing when translated. erefore, stemming can greatly improve the accuracy and speed of the system. Table 1 is used as an example.
Application and applicable were applied after extracting stem. Stem extraction can take out the main stem of a word so that, for different forms of the same word, there is no need to retain the various forms of the word; only retain the stem. e extraction of stem greatly reduces the space occupied by entries and programs in the dictionary and improves the accuracy and efficiency of the system.

Bilingual Text
Clause. e data used in the sentence alignment experiment in this system is the data of the "abstract" part of the bilingual corpus, which has no complex grammatical form and relatively simple sentence pattern. Manually searched the corpus, did not find "!" and "?" symbol categories, only based on ";" and "." It is OK to formulate terms. erefore, through manual inspection, only 1 sentence out of 400 sentence clauses made by program clauses is wrong, with an error rate of 0.25%, which can meet the experimental requirements.

Sentence Alignment Dynamic Programming Algorithm.
e method of sentence alignment requires an appropriate evaluation function to measure the overall intertranslation correspondence of a bilingual sentence to a sequence. If the evaluation function is not good, the translation effect will not be satisfied even if the corresponding pair of bilingual sentences is found. e evaluation value designed in this paper is the algebraic sum of the evaluation function values of each sentence pair in the sequence to evaluate whether the sentence pair is the best bilingual alignment. Assuming that the i th sentence pair sequence has H sentence pairs A i (A 1 . . . ah), the evaluation value of the sentence pair sequence is the sum of algebraic values of the evaluation function of each sentence pair, namely, Score a j . (1) In formula (1), Score (AJ) is the evaluation function of Aj(1 ≤ j ≤ h) for English and Chinese bilingual sentences, which is used to evaluate the matching degree of internal translation.
is paper adopts the matching translation method from English words to Chinese sentences, which not only avoids complex Chinese word processing, such as Chinese analysis and part-of-speech tagging, but also speeds up the processing speed. To investigate an English-Chinese bilingual sentence pair A, n and M represent the number of Chinese sentences and English sentences contained in the bilingual sentence pair, respectively; p represents the number of English words contained in the sentence pair. e specific matching algorithm is as follows.
Find e j 's (杰 in China) translation in the corresponding Chinese sentence, and take the longest matching translation Mathematical Problems in Engineering as the matching length of EJ in the Chinese sentence. After matching, some words will be overwritten in the English sentence, and their corresponding meanings will be overwritten in the Chinese sentence, as shown in e Max Match Len function is the longest translation of ej found in Chinese sentences. e overall evaluation value of bilingual sentences on A can be calculated by Score � −log(CScore * EScore).
Given a paragraph of aligned bilingual text, the Ar set is the set of correct sentence alignment after manual alignment. Set A is the number of alignment sentence pairs found by the machine during the experiment. Set A includes the sentences that match the standard answer, as well as the sentences that did not match the standard answer but were considered to be correctly aligned when the machine was running. erefore, the ratio of the number of sentences in set A that match the standard answer Ar tested manually to the number of correct sentences in set Ar is the recall rate mentioned in this paper, which measures the recall rate of the system, namely, erefore, it is not difficult to see that the recall rate of bilingual sentence alignment is the ratio of the number of machine-aligned correct bilingual sentences (relative to the reference manual alignment set Ar) to the number of all correct bilingual sentences. is indicates that the absolute number of correct bilingual sentences found in set A is greater. Precision is the ratio of the number of sentences matching set A with the standard answer Ar tested manually to the number of all bilingual sentences in set A, which is also known as the alignment accuracy rate of set A relative to set Ar. Precision measures the accuracy rate of the system. e alignment accuracy of a relative to Ar can be calculated by e criterion for evaluating an experiment is not only recall rate and accuracy but also harmonic mean value, or F value. e F value integrates the recall rate and accuracy rate and reflects the whole system index, namely e F value can reflect the recall rate and accuracy rate in a balanced way. is value is a single point index and can reflect the global characteristics relatively. Generally speaking, it is impossible to improve the accuracy and recall rate of bilingual sentence alignment at the same time: improving one of them often leads to the decline of the other indicator. We usually find an appropriate degree according to needs, neither too strict nor too loose, to seek a balance between recall rate and accuracy. According to the different applications, improve the corresponding technical indicators.

Experimental Results
Firstly, the bilingual sentence alignment system is used to align the obtained corpus 1. Each action results in "C1, . . . cn − e1 . . .." It is in the form of "em". e left side of "-" is the Chinese sentence number, and the right side of "-" is the English sentence number. For example, "2 3-3" means that Chinese sentences nos. 2, 3 and English sentence no. 3 can be translatable. en combine the sentences according to their numbers, for example, "2, 3-3". Put Chinese sentences nos. 2 and 3 together on the first line and English sentence no. 3 on the second line. If it is "3 -," the second line is empty. Extract the sentences corresponding to the number and put them together in a file to form file 1. Convert the manually aligned result file to the same form and store it in file 2. Finally, file 1 and file 2 are compared, and the results are obtained, as shown in Table 2.
e system has a total of 4 parameters. e system does not adjust all parameters one by one but only adjusts PARA1 and carries out the experiment, and the results are shown (the higher the recall rate and accuracy are, the better they Table 1: Example of stem extraction.   e original sentence  Image  registration  is  an  applicable  application  After the stem is extracted  image  registration  is  an  applicable  application are). Note: with the increase of parameter PARA1, the difference of weights corresponding to different alignments becomes larger so that PARA1 becomes larger and the effect becomes worse. e usage of giza++ increased the quality of dictionaries and, as a result, the sentence alignment system's performance. It has been demonstrated that giza++'s word alignment has a positive effect and can be utilized to extract dictionaries in future tests.
In view of the current information retrieval system, the system retrieval performance is measured by the accuracy rate and recall rate. In the retrieval process, the following methods are used for evaluation: after the same query is retrieved by multiple retrieval systems, the top 100 most relevant documents returned are combined, and manual correlation evaluation is conducted by comparing document sets. is method can reduce the workload of evaluation and improve the accuracy of evaluation. In the process of training English corpus, the best result is the average accuracy of 0.386 9. In the Chinese query set and the English corpus test, except for the training part, the automatic query mode realizes the index processing by word segmentation, and finally, the index processing is realized by the n-tuple segmentation method in monolingual. Figure 2 shows the test results, and Table 3 shows the comparison of the operation results and the average median value of Chinese-English cross-language information retrieval. e comparison shows that C-ECLIR1 has the best performance in Chinese-English CLIR.
At present, the established cross-language information retrieval system has begun to take shape. e results show that the system performance of query translator and Chinese search engine meets the requirements.
Because the sentence alignment work in this paper is aimed at web page text, there is no relevant authoritative data set for actual verification. erefore, the data adopted in this paper are 100 pairs of web pages randomly selected from the obtained bilingual web pages, and then the sentence alignment method is used to conduct experiments. Finally, whether the alignment is verified manually, three methods are compared in the experiment. e direct document object model (DOM) tree alignment method is DOM alignment and then line alignment, while the lexical information-based method is implemented in literature. A dictionary containing 58,000 sets of lexical information is used in this paper. e method proposed in this paper is to first extract anchor points from web pages for rough alignment and segmentation and then use the method based on vocabulary information to align the text between anchor points. Precision, recall, and F1 values were used as evaluation indicators.
As you can see from Table 4, the alignment method using only DOM tree information is the least effective. e reason is that the DOM tree of the hypertext markup language (HTML) source code parsed by the bilingual pages in some   bilingual websites is not completely aligned. In this case, the result of direct line alignment is easy to lead to error propagation. When most web pages are designed in accordance with the rules of web design, the DOM tree alignment method still achieves an accuracy rate of 86.1%. In this experiment, although the method in reference did not achieve the best effect on the open data set, it still achieved high accuracy and recall rate. With the addition of HTML anchor points, the method achieves satisfactory results in this experiment.

Conclusion
e new word research from the professional English-Chinese parallel corpus is aimed at most of the existing sentences based on section alignment. e goal is to use the bilingual sentence alignment system to achieve section alignment and precise sentence alignment, which is one of the hot spots in bilingual alignment research. In this paper, we set up a lengthbased sentence alignment system for English-Chinese parallel corpora containing unknown words and make a deep study of bilingual sentence alignment. Various possible sentence alignment scenarios are evaluated, and routines are constructed to establish different weights for each alignment condition.
en create the bilingual alignment system's processing flow and alignment calculation method, as well as the system's implementation. Finally, the evaluation index of the bilingual alignment system is introduced, and a comparative experiment is designed to check the experimental performance. e final results show that the recall rate and accuracy are satisfactory, and the system performance is good.

Data Availability
e data used to support the findings of this study are available from the author upon request.

Conflicts of Interest
e author declares that he has no conflicts of interest.