A Multimodal Retrieval and Ranking Method for Scientific Documents Based on HFS and XLNet

Aiming at the defects of traditional full-text retrieval models in dealing with mathematical expressions, which are special objects different from ordinary texts, a multimodal retrieval and ranking method for scientific documents based on hesitant fuzzy sets (HFS) and XLNet is proposed. -is method integrates multimodal information, such as mathematical expression images and context text, as keywords to realize the retrieval of scientific documents. In the image modal, the images of mathematical expressions are recognized, and the hesitancy fuzzy set theory is introduced to calculate the hesitancy fuzzy similarity between mathematical query expressions and the mathematical expressions in candidate scientific documents. Meanwhile, in the text mode, XLNet is used to generate word vectors of the mathematical expression context to obtain the similarity between the query text and the mathematical expression context of the candidate scientific documents. Finally, the multimodal evaluation is integrated, and the hesitation fuzzy set is constructed at the document level to obtain the final scores of the scientific documents and corresponding ranked output. -e experimental results show that the recall and precision of this method are 0.774 and 0.663 on the NTCIR dataset, respectively, and the average normalized discounted cumulative gain (NDCG) value of the top-10 ranking results is 0.880 on the Chinese scientific document (CSD) dataset.


Introduction
Scientific literature retrieval and ranking is an important way for workers to obtain scientific and technological information. As an important part of scientific documents, mathematical expressions and contextual texts with mathematical semantics are the primary basis for scientific document retrieval and ranking. However, the traditional full-text retrieval model for one-dimensional is not effective when facing the special two-dimensional pattern retrieval of mathematical expressions. At present, research studies on mathematical expression retrieval and ranking have been carried out with some progress, and methods and prototype systems [1][2][3][4][5][6] with mathematical retrieval functions have been proposed.
In terms of mathematical expression retrieval, Wiki-Mirs3.0 [7] constructed a hybrid index composed of the formulas index and the context index to enable more comprehensive use of mathematical information. In addition, the importance of formulas in the document is calculated for distinguishment. Zhang and Youssef [8] proposed a multidimensional similarity index based on a vector model to determine and evaluate five factors: system distance, data type level, matching depth, query coverage, and whether it is a formula. According to these five factors, the similarity between the query expression and the matching expression parsed by MATHML can be calculated.
In the research of mathematical expression retrieval and ranking that fuses mathematical expressions with textual information, MIaS [9] used the LRO (Leave Rightmost Out) method to split the original query generated by the combination of keywords and mathematical expressions into subqueries and merged the results using appropriate weighting to obtain more relevant results to the original topic. Zai and Tian [10] used FDS [11,12] to parse the formulas and retrieved relevant documents using obtained operators. e cosine distance between the input word vectors and the keyword vectors in the documents after the word embedding model is calculated to obtain the similarity between the two, which enables a more reasonable and comprehensive retrieval and ranking. e textual information of a mathematical expression is usually contained in the context of the expressions. Kristianto [13] proposed the concept of mathematical expression dependency, using rich semantic information to obtain better accuracy and improve the retrieval results of the mathematical search system.
Multimodality refers to any combination of two or more modalities. Piergiovanni and Ryoo [14] proposed a joint multimodal representation space method, using adversarial formulas for unmatched text and video data to improve the joint embedding space. Frome et al. [15] proposed a deep visual semantic embedding model based on the semantic information in the labeled image data and unlabeled text to identify visual objects. Jin et al. [16] proposed a generalized deep multimodal hashing framework for scalable image-text and video-text retrieval that explored feature representation learning, intermodality similarity preserving, intramodality semantic label preserving, and hash function learning with different types of loss functions simultaneously. Shen et al. [17] proposed a novel unsupervised hashing method (multiview discrete hashing) to learn compact hash codes from multiview data. e proposed method jointly learned the hash codes and cluster labels via factorization techniques and spectral analysis. And they developed an efficient alternating algorithm to optimize the proposed model. e generated hash codes not only could reflect the underlying semantics from multiple views but also enjoy high discrimination. Lu et al. [18] proposed an Online Multimodal Hashing with Dynamic Query-adaption (OMHDQ) method in a novel fashion that was designed to adaptively preserve the multimodal feature information into hash codes. Moreover, the online module was parameter-free. It could avoid time-consuming and inaccurate parameter adjustment in the unsupervised query hashing process.
In the image recognition of mathematical expressions, the mathematical document INFTY system [19] utilized the optical character recognition techniques to analyze the structure of mathematical expressions and recognized printed mathematical expressions into LaTeX and XML markup formats. Deng et al. [20] explored an image-text generation technology, applied them to mathematical expression recognition, used a convolutional neural network (CNN) to extract image features, and employed a recurrent neural network (RNN) for encoding and decoding. e abovementioned research on the recognition and retrieval of mathematical expressions has achieved certain results. However, the single-modal retrieval model has great limitations because mathematical expressions in scientific documents often exist in multiple forms, such as embedding descriptions and images. Based on this, this study proposes a multimodal retrieval method for scientific documents based on HFS [21,22] and XLNet [23]. is method integrates the functions of mathematical expression images and contextual text to improve the accuracy of retrieval results. In this study, the input form of mathematical expressions is no longer limited, and the information of mathematical expressions in images and text format can be input, which increases the flexibility and practicability of retrieval. In addition, the context of mathematical expression is closely related to the mathematical expression itself in scientific documents, and the combination between mathematical expression and context makes the retrieval and sorting of scientific documents more reasonable. e contributions of this study can be summarized as follows: (1) Multimodal retrieval is introduced into the retrieval task of scientific documents, and the complementarity between image mode and text mode is utilized to retrieve scientific documents. (2) Mathematical expressions and their context are combined to retrieval and ranking, and XLNet is used to generate word vector, so that a richer semantic representation of mathematical expression context can be obtained. (3) e hesitancy fuzzy set is used to calculate the hesitancy fuzzy measure of scientific documents. e hesitancy fuzzy set considers the attributes of the documents. In addition, Chinese scientific documents (CSD) were added to the retrieved dataset.

Model Framework
e multimodal retrieval and ranking process of scientific documents based on HFS and XLNet is shown in Figure 1.
First, in the query module, mathematical expression images and text keywords are inputted. e processing module of the image model is used to calculate the similarity between mathematical expression in images and in candidate technical documents. e LaTeX forms of the input mathematical expressions are obtained by recognizing the images of the input mathematical expression, and FDS is used to analyze the recognition result. en, the hesitant fuzzy set theory is introduced to calculate the similarity between the mathematical expressions and the results are returned to the document processing module. e processing module of text modal is used to calculate the similarity between the mathematical expression context. e text in the context of mathematical expressions in the dataset is extracted and used to pretrain XLNet. XLNet is used to calculate the similarity between the query text and the mathematical expression context of the candidate scientific documents. e document processing module is used to output documents in order. e document attributes are designed, the scores of the documents are calculated by hesitation fuzzy set, and the ranking results are output in descending order of similarity.

Mathematical Expression Image Recognition.
e ViT and transformer models proposed in the literature [24][25][26] for processing sequence problems and image tasks are shown in Figure 2.
e model consists of a ViT [24] encoder with a deep residual network (ResNet) [25] backbone and a Transformer [26] decoder. e encoder is used for feature extraction, and the decoder is used to convert the mathematical expression information in the image into the LaTeX form. e experimental results show that the accuracy of Bilingual Evaluation Understudy (BLEU) is 0.88.

Mathematical Expression Image Similarity.
e hesitant fuzzy set proposed by Torra [21,22] is used to measure the similarity between query expressions and candidate expressions. e value of membership in the hesitant fuzzy set is a value set containing several possible membership degrees. erefore, the results can be evaluated from multiple aspects. is approach avoids the errors due to a single phenomenon.
e degree of hesitation of people in the process of transaction processing can be more objectively reflected.
Definition 1 (hesitating fuzzy set). Let X be a nonempty set, and the definition of the hesitation fuzzy set is where h E (x) represents the set of possible membership degrees for x ∈ X, which is a subset of the interval [0, 1] [21,22]. Among them, h E (x) means evaluation attributes, which may be one or more. Each group of evaluation attributes contains multiple evaluation indicators. e similarity of the analytical mathematical expression of FDS [11,12] is calculated by the hesitant fuzzy set. e evaluation attribute of the mathematical expression is defined as a triple (h S , h O , h N ) [27], where h S is the structural attribute of the expression, h O is the operator attribute of the expression, and h N is the operand attribute of the expression. e structure and operator characteristics of the expression are evaluated, respectively. Each evaluation attribute contains several evaluation indicators. By setting the membership function for each indicator, the query expression E q and the hesitant membership degree of each result expression E D for each attribute are evaluated.
In conclusion, the set of hesitating fuzzy evaluation attributes (h S , h O , h N ) and the set of hesitating fuzzy ele-

Scientific Programming 3
Definition 2 e subformula weight distribution method [28] in the traditional tree index structure is referred, and the flag, length, and operator level in the subexpression are used to replace the structural complexity, length, and depth of nodes in the traditional method. where Here, f E q is the lowest form of the flag bit of the current subexpression, f E D is the flag of the subexpression in the expression, l E q is the length of the subexpression, l E D is the length of the entire expression, and level is the level of operators in the subexpression. When the subexpression appears several times in the query results, the average is taken as its level attribute value.
(2) Operator Attribute h O . Here, the BM25 algorithm is referenced as the membership function of the operator index: e formula can be disassembled into three components. e first component N represents the total number of expressions in the database, N E m represents the total number of expressions, which contains E q . e second component is the weight of the query word in the database, where f E q represents the frequency of the operator in the database, and k1 and k are empirical parameters. e third component is the weight of the query operator itself, where fq represents the word frequency of the query operators in the user's queries, which is usually set to 1 for shorter queries. k2 is an empirical parameter. e evaluation of operand attribute h N is similar to the operator attribute h O , so the description will not be repeated.
(3) Similarity Calculation where l x i is the number of evaluation values and h σ(j) Let E q be a 2 + b 2 , and some of the retrieval results and the corresponding hesitant fuzzy sets are shown in Table 1.
. e mathematical expression similarity calculation algorithm is as follows:

Mathematical Expression Context Similarity Measure.
XLNet [23] is a generalized autoregressive pretraining model. e text in the documents is extracted, one-third of which is annotated to train XLNet, so that a richer semantic representation of the mathematical expression text can be obtained. e main structure is shown in Figure 3 (assuming the factorization order is 3 ⟶ 2 ⟶ 4 ⟶ 1).

Patch + Position
Embedding *Extra learnable e same keyword may have different meanings in different contexts, and textual information that explains a mathematical expression often appears around the expression. e example is in the document "Parasitic capacitance.html." e expression in this document is i � Cdv/dt, and its contexts are "When two conductors at different potentials are close to one another, they are affected by each others' electric field and store opposite electric charges like a capacitor" and "where C is the capacitance between the conductors." e meanings of "potentials," "electric charges," and "capacitance" may have different meanings in other contexts, and the constructed vectors are also different.
is study introduces the XLNet [23] language model to generate word vectors to be rich in semantics. XLNet solves the problem that BERT did not consider the relationship between the words that are shielded and the words that are not shielded during the training process; that is, the independence between words was not taken into account. e XLNet model implements a new bidirectional coding based on autoregressive (AR) language model. When calculating the text similarity, XLNet will fully consider the semantic information of word vectors, and therefore, the accuracy of calculating text similarity is improved. e TF-IDF algorithm is used to extract keywords and their weights in the context of mathematical expressions. By analyzing a large number of scientific literature studies, the context of the mathematical expressions is used to analyze mathematical expressions and explain symbols. It can be seen that the context of expressions is closely related to mathematical expressions, so it is very important to extract the context of mathematical expressions for the retrieval of mathematical expressions. e context and keywords corresponding to two mathematical expressions are selected as shown in Table 2.

Calculation of the Similarity of Scientific Documents
Retrieval and ranking of scientific and technological documents is a comprehensive measurement with multiple attributes including mathematical expressions and keywords. Different scientific documents have different meanings, even if they contain the same formula. erefore, hesitating fuzzy sets are used to evaluate scientific documents in an all-around way to achieve the final sorting in this study.

Content stream
Query stream Content stream

Definition 5.
e function U exp (E q , E mi ) is used to calculate the similarity between the query expression and the expression in the candidate document.
where sim(E q , E m i ) represents the similarity between the mathematical expression E q of the query and the mathematical expression E mi in the candidate scientific document.
Definition 6. e function U word (W q , W E mi ) is used to express the similarity between the query keyword W q and the keyword W E m i in the context.
where W E m i represents the keyword in the document retrieved in the candidate scientific document.

Definition 7.
e function U loc (E mi , D N i ) is used to express the position of the expression E mi in the document D N i .
where loc exp is the position where the query expression E q appears for the first time in the document D N i , and num represents the total number of characters contained in the document D N i .

Definition 8.
e function U ef (E q , D N i ) is used to express the frequency of the query expression E q in the document D N i .
Input: a LaTeX form of the recognized mathematical expression E q Output: a set of mathematical expressions similar to E q (1) //Initialize the feature vector database E D (id, expstring) (2) FDS E q //parsed by FDS (3) FDS E mi (4) while (FDS E q ) do (5) for FDS E q in FDS E mi :  (9) sim(E q , E mi ) � sim(HFS E q , HFS E mi ) // e similarity between expressions E q and E mi is transformed into the similarity between the hesitant fuzzy set HFS E q and HFS E mi (10) Add to table simexp (id, expstring, sim(E q , E mi )) (11) end for (12)   In mathematics, a Gaussian function, often simply referred to as a Gaussian, is a function of the form: For arbitrary real constants, and it is named after the mathematician Carl Friedrich Gauss.

Gaussian function, Gaussian
where α is the feature weight coefficient of the number of mathematical expressions in the document, which is obtained by counting the number of expressions in all documents in the database. k exp represents the number of expressions in the document D N i that matches the query expression E q , and k sum represents the total number of expressions contained in the document D N i .

Definition 9.
e function U wf (W q , D N i ) is used to express the frequency of the query keyword W q in the document where α is the feature weight coefficient of the number of keywords in the document, which is obtained by counting the number of keywords in all documents in the database. t W q represents the number of keywords in the document D N i that matches the query keyword W q , and t sum w represents the total number of keywords contained in the document D N i .

Definition 10.
e function S(D N i ) is used to calculate the score of scientific document retrieval results.  Table 3. e sorting algorithm of retrieval result documents is as follows:

Experimental Data.
For the image recognition part of mathematical expressions, we use the IM2LATEX-100K dataset for training and testing. e IM2LATEX-100K dataset contains 103,556 images of different mathematical expressions. e label data consist of the LaTeX format of mathematical expressions.
For the scientific document retrieval and ranking part, the public dataset Ntcir-MathIR-Wikipedia-Corpus (NTCIR) is used, and 31,742 documents are extracted, which contains 518,929 mathematical expressions. In addition, Chinese scientific documents (CSD) are added to expand the dataset, which contains 10,372 documents and 121,495 mathematical expressions.

Image Recognition of Mathematical Expressions.
e image recognition algorithm model [24][25][26] is used to recognize mathematical expression images and conducts a lot of experiments on different types of mathematical expression images in this study. According to the BLEU evaluation standard, the model result reaches 0.88.
For this recognition algorithm, five different types of mathematical expression images are selected for recognition and display in this study, and the recognition results are shown in Table 4 (the content of the image here is expressed in text).

Ablation
Study. Ten groups of formulas and keywords are selected in Table 5 as queries for retrieval. e proposed method includes three main parts, and the performance is continuously improved by gradually increasing the functions of each part. e baseline experiment was image expression retrieval. e final reordering of our has the best performance. e average recall rates of this study are 77.4% and 77.8%. And the average precision rates are 66.3% and 69.2%. All of them are shown in Table 6.

Performance on NTCIR Dataset.
In this section, the method in this article is compared with some traditional methods and current existing methods using the NTCIR dataset. FDS + Word Embedding [10] combines the FDS and Word Embedding to retrieve scientific documents: FDS is used to parse expressions, and Word Embedding is used to generate the word vectors of keywords in scientific documents, hereinafter referred to as Method 1. And SearchOnMath [29] is a mathematical formula retrieval tool that aims at accurately matching mathematical expressions, However, SearchOnMath implements pure mathematical expression retrieval and does not consider the important information of the scientific document itself, hereinafter referred to as Method 2. MIaS [4] is based on the full-text search engine Apache Lucene. MIaS processes text and math separately. e text is tokenized and stemmed to unify inflected word forms, hereinafter referred to as Method 3.
In this study, NDCG is used to evaluate the ranking results, which is the search result after the normalization of DCG (discount cumulative gain). e calculation method is as follows: where DCG l � l i�1 r i log 2 (i + 1) , where l is the number of search results, r i is the relevance score, IDCG l is the ideal DCG value, and |REL| indicates that the search results are all related to the query expression.
Scientific Programming e query formula and keywords in Table 5 are taken as the query, and the method in this study and other methods top-10 experts ranking results are shown in Figure 4. Method 2 starts with a higher value than the method in this article, but as the number of expression retrievals increases, the method in this article is all higher than Method 2. e average NDCG of this method is higher than the other three methods. And the average value of NDCG (n � 10) is 0.865 on the NTCIR dataset in this study. e experimental results show that the ranking performance of the proposed method is better and the retrieval result is more reasonable.

Performance on CSD Dataset.
In this section, the method in this article is compared with Method 1 using the NTCIR dataset. Chinese scientific documents (CSD) are Input: document collection of the retrieval results of scientific documents Output: the ranking sequence of the documents (1) while (Result) do: location (E m i , D N i ) // e position of the mathematical expression in the document     x tan θ � sin θ/cos θ Trigonometric function (三角函数) 10 P(X � k) � λ k /k!e λ Poisson (泊松) 8 Scientific Programming added to expand the dataset, which contains 10,372 documents and 121,495 mathematical expressions. e experimental results are shown in Figure 5. It can be seen that the NDCG of the method in this study is higher than the comparison method. e average value of NDCG (n � 10) is 0.88 on the CSD dataset in this study. So the results of the method in this study are more reasonable, and the retrieval and ranking performance is improved.

Retrieval System.
A large number of experiments are conducted for different expressions. e first ten search results are selected for display in this study. When the    Scientific Programming input formula image is "P(X � k) � λ k /k!e λ " and the keyword is "Poisson," some of the search results are shown in Table 7.
First of all, the method in this study identifies the LaTeX form of the formula as "P\left({X � k} \right) � \frac {{{\lambda^k}}}{{k!}}{e^{-\lambda }}," finding out a collection of documents similar to the formula, and the XLNet model is used to obtain the word vector of "Poisson" and document expressions context keywords, and the similarity between them is calculated. Finally, according to the keywords and formula information, the similarity calculation of the documents is performed again using the hesitant fuzzy set so as to sort and output. FileName is the name of the document where the expression is located, and Score is the document score in Table 7.

Conclusion
Based on the retrieval and ranking mode of combining mathematical expression image and text, this study proposes a multimodal retrieval and ranking method for scientific documents based on HFS and XLNet. is method obtains the LaTeX structure information of mathematical expressions through image recognition algorithms and solves the single-modal problem of scientific document retrieval. e similarity between mathematical expressions is obtained by the evaluation of hesitant fuzzy sets, which solves the problem of the unity of evaluation of traditional mathematical expressions. In combination with the context of mathematical expression, the words with similar query keywords are obtained according to XLNet, which enriches the singleness problem of mathematical expression retrieval. Finally, the similarity between of attributes of mathematical expressions and the keywords in the documents is calculated through the hesitation fuzzy set, which makes the ranking of the retrieval results of scientific documents more reasonable.
is experimental method also has some shortcomings. In the future, the following points will be considered for improvement: (1) Only the mathematical expressions whose recognition results are in LaTeX form are analyzed, and different forms of mathematical expressions (such as MathML) will be analyzed (2) e evaluation attributes of documents will be further improved, and the evaluation attributes of document similarity will be increased (3) Only images and texts are analyzed, and an attempt will be made to expand the multimodality more widely and apply voice or video to retrieval

Data Availability
Our data still need to be studied in the next stage, so it is not convenient to provide it directly. e data can be made available upon request via e-mail to the corresponding author.

Conflicts of Interest
e authors declare no conflicts of interest.