ContextAD: Context-Aware Acronym Disambiguation with Siamese BERT Network

,


Introduction
Acronyms are shortened forms of longer phrases and are often used in writing, especially academic writing, to save space and streamline expression. However, in natural language processing tasks such as question answering, machine reading comprehension, information extraction [1], sensitive word detection, and retrieval, it is often necessary to use the defnition of acronyms. Acronym disambiguation can provide an efective acronym comprehension scheme. Its purpose is to select the most appropriate defnition from the acronym dictionary according to the meaning of the sentence containing the acronym. Te sources of the acronym dictionary mainly include WEB data acquisition and manual construction.
Early studies mainly used the construction of acronym dictionaries using WEB information. For example, some researches directly obtain web pages containing acronym and defnitions [2] or automatically extract acronyms and corresponding defnitions from the interaction between users and network data [3]. Ten, machine learning methods [1], pattern matching [4][5][6], and semantic network generation [7] are used to achieve acronym disambiguation. However, the level of network data is uneven, and it is not easy to ensure the quality of the acronym dictionary. Moreover, methods based on network data often require the device to be connected to the network and cannot be applied ofine. Terefore, some scholars use artifcially constructed dictionaries and machine learning algorithms for disambiguation [8]. Examples thereof are shown in Figure 1. Te model needs to pick out accurate defnitions based on acronyms and their contextual information from the corresponding dictionary.
After the development of recent years, dictionary-based acronym disambiguation methods have made great progress. Early researchers used statistical methods for feature extraction, such as support vector machines, naive Bayes, and k-nearest neighbors [9,10]. Tis kind of method is simple, but it has low precision and recall. After machine learning algorithms, especially deep neural networks demonstrated powerful feature extraction, and neural network-based acronym recognition methods began to proliferate, e.g., convolution neural networks (CNNs) or long short-term memory (LSTM) [11]. However, traditional deep neural networks have difculty in incorporating prior knowledge and can only extract features from the dataset. While the transformer-based methods represented by BERT and its derivative models are able to obtain features from a large amount of unlabelled data and apply these features (prior knowledge) to downstream tasks. However, traditional deep neural networks have difculty in incorporating prior knowledge and can only extract features from the dataset, while the transformer-based methods represented by BERT [12] and its derivative models are able to obtain features from a large amount of unlabelled data and apply these features (prior knowledge) to downstream tasks. Tis approach can greatly improve the accuracy of the model. Singh and Kumar used the SpanBERT [13] model to transform the acronym disambiguation problem into a span prediction problem [14]. Pan et al. tried diferent BERT models and fnally found that the SciBERT model [15] has a signifcant advantage in the acronym disambiguation task [16]. Weng et al. then used the DEBERTA [17,18] model for their experiments [19], while Song et al. verifed the validity of the T5 model [20] which is an alias of text-to-text transfer transformer, and the basic idea of this model is that all NLP problems can be defned as "text-to-text" problems, i.e., "input text and output text" [21].
Existing methods tend to analyse paraphrases directly with the original sentences, without taking full advantage of the similarity feature between paraphrases and acronyms. In this paper, we design a framework based on twin networks mainly based on the property that acronyms are completely alternative to exact paraphrases. Complete substitutability means that an accurate paraphrase can replace the acronym in the original text without changing the sentence paraphrase. Tis property is similar to the way humans think, and when they disambiguate acronyms, they usually choose to replace the acronym with the candidate paraphrase and analyse it in context to determine its suitability.
Terefore, this paper proposes a context-aware acronym disambiguation method with Siamese BERT network (ContextAD), which combines candidate paraphrases with the context of acronyms to form new sentences and uses the Siamese network model to obtain the similarity between the new sentence and the original sentence. At the same time, this paper verifes the robustness of the model by expanding the candidate paraphrase dictionary, proving that the model has good ductility. In general, the contributions of this paper mainly include the following four points: (1) the analysis of the advantages and disadvantages of existing methods is adopted to show the impact of the absence of a contextaware approach. (2) A context-aware approach is proposed to achieve better disambiguation by combining candidate paraphrases with acronym contexts. To the best of our knowledge, this is the frst work to perform disambiguation from the sentence level. (3) Te overall approach can simultaneously obtain sentence-and phrase-level similarity, which can get more information. (4) Experiments show that the proposed method can outperform the state-of-the-art methods on the public dataset when using the same BERT model.

Related Work
Tis section describes the acronym disambiguation methods based on dictionary and Siamese neural network.

Dictionary-Based Acronym Disambiguation.
Existing dictionary-based acronym disambiguation methods can be divided into fve categories: feature matching (including statistics-based and classic machine learning-based methods), multiclassifcation, span prediction, binary classifcation, and similarity ranking [22].

Feature Matching Methods.
Te feature matching approach involves extracting features (e.g., discourse tags and special characters) from the input sentences. Statistical models are then used to predict the exact acronym interpretation. Statistics-based methods refer to the implementation according to the calculation formula of statistical word frequency and similarity, such as BM25 and TF-IDF. However, these methods usually cannot understand the semantic correlation between sentences and separate the semantics of words and sentences. It is inconsistent with the facts. With the development of machine learning, traditional machine learning methods based on maximum entropy, decision tree, and support vector machine are gradually emerging [14]. Tese algorithms are based on dictionary acronyms to eliminate discrimination as a classifcation problem. Maximum entropy aims to select the model with the most signifcant entropy among all possible probability models (probability distribution) [23]. However, the maximum entropy model will binarize the features, only record whether the features appear but cannot obtain the feature strength. Te decision tree model is an attribute structure describing instance classifcation, mainly composed of nodes, and directed edges. Te decision tree model usually starts from the root node, obtains the instance characteristics, and then assigns the instance to its child nodes [24]. Te characteristics obtained by the decision tree model are easy to be afected by the amount of data. Te support vector machine is to fnd the support vector that can determine the optimal classifcation hyperplane from the training samples by maximizing the classifcation margin [25]. Te kernel function directly determines the performance of the support vector machine, but there is no suitable method to solve the problem of kernel function selection.

Multiclassifcation Methods.
Te multicategory problem is trained with each candidate interpretation as a category label. With the development of word vector models and neural network models, textual information has been able to be transformed into low-dimensional dense vectors. Current methods for acronym disambiguation are usually analysed on the basis of text embedding. Tis enables more contextual information to be obtained. Te benchmark model GAD given by Veyseh et al. [9] is to obtain sentence embedding through Bi-LSTM and obtain context embedding with the help of grammatical structure (such as dependency tree) and GCN (graph progressive neural networks) model. Finally, the acronyms and sentence embedding under the two codes are spliced as the input of the evaluation layer, and then the interpretation of acronyms is predicted through a two-layer feedforward classifer. Te number of neurons in the last classifer is equal to the number of candidate defnitions of the acronym in the dictionary, but this also means that when the number of acronym defnitions in the dictionary increases, the model structure will change signifcantly.
Jaber et al. combined three supervised machine learning models (support vector machine, naive Bayes, and k-nearest neighbor) with cosine similarity for acronym disambiguation among the feature-based methods. Finally, they found that the naive Bayes and cosine similarity method has the best performance [9]. Pereira et al. combined a support vector machine with the doc2vec method for acronym disambiguation [10]. Tese methods mainly extract the corresponding features from the text and predict the acronyms and corresponding interpretations by statistical methods. Te neural network model challengers use mainly LSTM and CNN [26].

Span Prediction Methods.
Te transformer-based model mainly encodes sentences for BERT and its variants (such as Sci-BERT [15] or RoBERTa [27]). Still, there are diferences in using the output of these language models to predict. Pan et al. [16] and Zhong [28] regarded the task as a classifcation task, while Egan and Bohannon [29] adopted the information retrieval method to calculate and sort the score of each candidate word by using the cosine similarity between candidate embedding and input. Singh and Kumar modelled the problem as a span prediction problem. It obtains the accurate interpretation from the connected text of acronyms, candidate interpretation, and sentence combination by the predicted probability of subsequence [14].

Binary Classifcation Methods.
Binary classifcation is to combine the interpretation of a single acronym with the original sentence through the characteristic that BERT can process two sentences simultaneously [11]. Te input format of two sentences is processed by simulating BERT, and the [CLS] identifer, candidate interpretation (according to the number in the dictionary), and [SEP] identifer are spliced with the original sentence as the model input and then train a binary classifcation model to acquire the score. Tis method is more robust and can handle longer dictionary lengths. However, this method does not consider the matching degree and correlation between the candidate interpretation and the original context. When doing acronym disambiguation, we can fnd the interpretation from the meaning of the acronym in the context and judge whether the interpretation conforms to the original context information.
2.1.5. Similarity Ranking Methods. Te similarity ranking method specifcally refers to the way of ranking by comparing the similarity scores of two inputs. Egan and Bohannon evaluated the similarity of the candidate paraphrases by directly comparing them with the original sentences and used the candidate with the highest similarity score as the predicted result [29]. Tis approach has similarity to the dichotomous approach. Nevertheless, the candidate interpretations contain limited information and the model may fail to evaluate when the two candidates themselves have similarity. In fact, there is complete substitutability between exact paraphrases and acronyms in contextual scenarios. In other words, replacing an acronym in the original sentence with an exact paraphrase will not change the meaning of the sentence at all.
Terefore, we propose a method to fuse the similarity of sentences with the similarity of the candidate translation itself. Tis approach can combine the candidate sentences with the context and can convey more features for the model. However, because of the limited text information that the model can handle, the length of the input text may exceed the upper limit that the model can handle better if the binary classifcation approach is used. We propose an acronym disambiguation method based on similarity ranking methods.

Siamese Neural Networks.
Siamese neural networks, also known as Siamese networks, were frst proposed by Bromley et al. [30] to verify the signature on the credit card. Now, it has been applied to many diferent felds, such as one-short learning [31], text recognition [29,30], and face similarity recognition [32]. Unlike the traditional neural network model, the Siamese neural network model comprises two networks sharing weights. By transforming the two inputs into high-dimensional vectors and interacting with their features, the Siamese neural network model can realize the method of classifcation or similarity prediction. Te advantage of a Siamese neural network is to identify the differences and similarities between the two inputs. Tat is, the Siamese network can measure the direct correlation degree of two inputs, in which network-1 and network-2 can be two same network models, such as CNN [33]or LSTM [34], transformer [35], or attention [36]. When the two networks do not share weights or utilize two diferent neural networks, International Journal of Intelligent Systems such as an LSTM network and a CNN network, separately. We called this kind of models as called the pseudo-Siamese network [37]. With the development of BERT, Reimers N proposed to transform sentence pairs into two vectors with the same dimension through the same BERT model and then use diferent loss functions according to diferent tasks [38]. In the existing acronym disambiguation tasks, the methods' performance based on the pretraining language model is relatively higher than that based on features and traditional neural networks. Terefore, this paper will study the Siamese network based on BERT.

Limitations of Existing Methods
(1) Feature matching method: feature matching methods (including statistics-based and traditional machine learning-based methods) are prone to performance degradation and high cost in the face of large quantities of data. Tis type of method usually analyses only the number of occurrences of the acronym together with the paraphrase. If such methods tend to select cable news network as the CNN paraphrase, this selection is context-independent. (2) Multiclassifcation method: the advantage of the multicategorization approach is the ability to select from multiple interpretations with only one calculation. However, the number of defnitions of acronyms in the dictionary is often uncertain, and the number of categories is closely related to the shape of the last layer of the classifcation model. Terefore, the multiclassifcation method is easily disturbed by the number of candidate defnitions of acronyms. For example, acronym "CA" has 20 paraphrases, which means the output dimension of the model is 20 × 1, while "RF" has only 5 paraphrases and the output dimension of the model is 5 × 1. Tis variability increases the difculty of model training. (3) Span prediction method: this method also attempts to perform interpretation recognition through a single computation. However, the input to the span prediction method is the acronym, all candidate paraphrases, and the concatenation of the original sentence, i.e., [SEP], which means that the length of the input text is related to the candidate. Te number of interpretations is directly related. Neural network models often need to perform zero-padded alignment processing on the input. When the length diference of each input is too signifcant, it is easy to cause the input mean and variance between diferent batches to be too diferent, which is not conducive to the robust processing of the model. Similarly, acronym "CA" has 20 paraphrases, which means that the input is the original sentence plus 20 paraphrases i.e., 40 words, while "RF" has only 5, adding only 10 words, which will introduce too much information when padding is used to compensate.
(4) Binary classifcation and similarity ranking algorithm: binary classifcation algorithms are the methods that splice candidate paraphrases with the original sentences, i.e.

, [CLS] (Acronym [SEP]) Expansion_i [SEP] Sentence [SEP]
and then use a binary classifer to determine whether the paraphrase is correct or not. Also, similarity sorting is to sort the vector similarity between the original sentence and each candidate paraphrase, which is closer to the human way of thinking. Both methods are more robust and are not disturbed by the length of the lexicon. However, both ways do not consider whether the candidate paraphrases' match the context of the acronyms. Te acronym candidate paraphrase should not only be semantically similar to the original acronym but also should be able to replace the acronym in the original sentence directly.
In the example in Table 1, the phrases that are semantically similar to the original sentence are random forest, regression function, and regression forest. But most of the models choose regression function or regression forest. While when asking the opinion of humans not working on machine learning, they mostly choose random forest. Because there is a tree in the original sentences, and they think that they should choose between random forest and regression forest. Also, since the RF is followed by a regression, it would be rather unusual for two identical words to appear next to each other in a sentence. Tey tend to choose random forest. According to this, we propose a context-aware method which analyses candidate paraphrases at both sentence and phrase levels.

4.1.
Problem Description. Given a sentence S � [w 1 , w 2 , . . . , w n ], the acronym position code is P, w P is the acronym, the correct interpretation is d i , and the ac- where s is the number of defnitions of each acronym in the dictionary. Te acronym disambiguation task is to select the accurate interpretation d i from the interpretation dictionary D according to the acronym w P and sentence S. Tat is, the prediction of the model is argmax(p(d i |S, w P , D)).

4.2.
Overview of the Model. Diferent from the traditional methods of candidate interpretation for similarity evaluation with the original text, this paper will integrate the matching degree between the candidate acronyms and the acronym context at the same time. Te acronym candidate replaces the acronyms in the original text to form a new sentence set and then takes the new sentence set and the original text as input to the Siamese network. Te training is carried out by minimizing the loss function. If the label is 1 (similar), the embeddings of the two sentences are as close as possible. Otherwise, the distance between the two is as far as possible. Tis paper will construct sentence pairs from two levels: phrase and sentence levels. Te frst is to directly match the candidate interpretation with the sentence pair constructed by the original text, a single interpretation, as in [29]. Te second is to match the new sentence formed by replacing the acronyms in the sentence with the candidate interpretation with the sentence pair composed of the original sentence, that is, the sentence pair. Te specifc examples are shown in Table 2.
It can be seen from Table 2 that in this experiment, there is no need to add any special characters to improve the attention of the model but only need to carry in two inputs directly into the model. Te interpretation combination simulates the scene where the candidate's interpretation is directly compared with the original text. Trough the direct encoding of the candidate interpretation, the encoding is transformed into a vector consistent with the encoding dimension of the original text through the pooling layer and then evaluated. Sentence combination is the result of comparing the new sentence formed by replacing the corresponding acronyms in the original text with the candidate interpretation. It is essentially the judgment between multiple sentences with the same context. Te model needs to learn the diferences from sentence perspective. Te acronym disambiguation task constructed in this section belongs to the category of Siamese neural networks. We will evaluate the semantic similarity of two diferent sentence pairs by cosine similarity and take it as the score of the corresponding interpretation in the combination. Te operation process is as follows: inputting s (the number of candidate defnitions of target acronyms in the dictionary) sentence combinations and interpretation combinations into the Siamese network, respectively, and using cross-entropy loss as the loss function of interpretation combination and sentence combination, the code of each sentence pair is obtained. Figure 2 shows the general framework of Siamese network structure based on BERT. We use interpretation combination to obtain the similarity score between the original sentence and the paraphrase and use sentence combination to obtain the similarity score between the original sentence and the paraphrase in the context-aware case. Finally, we use the weighted sum of the two as the fnal result. Te SiameseNet in the fgure represents the Siamese network model based on BERT, in which BERTmainly refers to the current commonly used BERT models, including BERT [12], RoBERTa [27], and Sci-BERT [15].

Siamese Neural Networks Based on BERT (SiameseNet).
We use the Siamese neural networks based on BERT [39] to evaluate the correlation between the acronym candidate and the original text, and the new sentence formed by replacing the acronym with the candidate and the original text. Ten, we sort the interpretation of the candidate according to the two correlations. Te candidate interpretation with the highest correlation is selected to be the answer. Siamese neural network structure can be divided into regression target structure and classifcation target structure according to diferent tasks, as shown in Figure 3.
In the fgure, the model takes the two sentences (s 1 and s 2 ) as input into the BERT model for embedding and unify the sentence embedding dimension through the pooling layer to obtain two-sentence vectors u and v with the same dimension. (1) Ten, we augment the embedding by |u − v|, which means subtracting the two vectors u and v in element-wise and calculating the absolute value. Te vectors u, v, and |u − v| are concatenated into the fully connected layer, followed by the Soft Max layer, to obtain the fnal predicted score [40]. Te objective function can be expressed as follows: where FC means the fully connected layer. We use the method based on classifcation and change the acronym disambiguation task into the classifcation task based on sentence similarity. In addition, diferent from the traditional methods of candidate interpretation for similarity evaluation with the original text, this paper will integrate the matching degree between the candidate acronyms and the acronym context at the same time. Te acronym candidate replaces the acronyms in the original text to form a new sentence set and then takes the new sentence set and the original text as input to the Siamese network. Te training is carried out by minimizing the loss function. If the label is 1 (similar), the embeddings of the two sentences are as close as possible. Otherwise, the distance between the two is as far as possible.
To enhance the robustness and generalization of the model, adversarial loss is used to train the SiameseNet network [39]. Te objective loss function can be expressed as follows: where CE is the cross-entropy loss, which is used to portray the similarity between the actual output probability and the expected output probability. m represents the margin value, and N is the number of training samples. Te value of Y is 1 or 0. If the two inputs are similar, it is 0, otherwise is 1. If the diference between the two inputs is less than the marginal value, the loss will be calculated; otherwise, the loss will be 0.
In the process of adversarial training, this loss function can be split into a loss function when the samples are similar and a loss function when the samples are diferent, which can be expressed as follows: 4.4. Score Fusion. We use the loss function of the formula (4) to train the SiameseNet included in the interpretation combination and sentence combining structures, respectively. After that, the two trained models are used for inferencing to obtain semantic similarity, and the score fusion is performed based on this, as shown in Figure 2.
where the output of SiameseNet is the SoftMax score in equation (2), which can be regarded as p(d i |S, w P , D), represents the similarity of the i-th candidate paraphrase to the original sentence. Also, the score of the sentence dimension is p � p 1 , p 2 , . . . , p s , ..  International Journal of Intelligent Systems Ten, adding the scores of the word dimension and the sentence dimension, which is the fnal score where α is the weighting coefcient.

Dataset and Benchmarks.
We selected the AD data set in the SDU challenge in AAAI-2021 for experimental demonstration [9]. Te original data include 50,034 training samples, 6,189 validation samples, and 6,218 test samples. However, because the label of the test sample is not disclosed and the challenging task has been closed, this paper will randomly select 10% of the data from 50,033 training samples as the verifcation set and the original verifcation set as the test set. Terefore, our experiment's number of training samples, verifcation set samples, and test samples are 45,031, 5,003, and 6,189. Te benchmark model framework of this paper includes the model of the dataset and single interpretation scoring, which are compared with sentence matching and double scoring. From the perspective of the model, this paper will select BERT-base, RoBERTa-base, and Sci-BERT-base, respectively, for experiments to provide a reference for followup research.
BERT refers to the BERT model initially proposed by Google. BERT-base uses 12 stacked embedding layers, each embedding layer uses 12 head attention, the feedforward network in embedding contains 768 hidden units, and the total parameters of the model are about 110 million [12].
Te full name of RoBERTa is a robustly optimized BERT pretraining approach, which is a version of refned tuning of BERT [27]. Te model mainly makes the following improvements to BERT: ① the dynamic mask method is adopted for model training, and the static mask is adopted for BERT, that is, the data are masked in advance, while RoBERTa adopts diferent mask modes when inputting sequences to the model, which means that the same data may have diferent mask modes in diferent epochs. RoBERTa believes that this method can teach more language representations. ② More training data, larger model parameters, larger batch size, and longer training time are used. ③ In RoBERTa, the next sentence prediction task is cancelled, and multiple sentences are input continuously until the maximum length is reached (cross text or not can be set, which is better in general). Tis means that the model can read longer text sequences simultaneously. Tis training method is called full sentences. ④ BERT uses Unicode characters as the subwords unit, with a size of about 30 K, while RoBERTa's embedding method combines character level and word level representation (BPE). Tis method includes 50 K subwords units without any additional preprocessing or word segmentation for input.
Sci-BERT is pretrained with a total of 1.14 million scientifc papers in 82% biomedicine, 12% computer science, and 6% other disciplines [15], so it is more suitable for natural language processing tasks in the direction of scientifc papers. In the SDU task of AAAI-2021, the data are collected from scientifc and technological papers. Terefore, in the existing models, the efect based on Sci-BERT is often higher than that of other models. In addition, it can also be replaced with other network models, but the efect may be relatively poor. In this way, the relationship between the acronym and the interpretation, and the relationship between the context of the acronym and the interpretation of the acronym can be considered simultaneously, and the robustness can be taken into account. Te model structure is independent of the length of the acronym dictionary and can deal with candidate dictionaries of any length.

Evaluation Protocol.
In acronym disambiguation, precision, recall, and F1 score are usually used for evaluation.
where TP represents the number of entities predicted correctly, that is, the number of sequences whose predicted sequence is consistent with the real sequence, FP represents the number of entities predicted incorrectly, and FN represents the number of entities predicted incorrectly but actually correct.

Experimental Process.
Te experiment is divided into fve steps: verifcation set division, dataset processing, model training, verifcation evaluation, and result evaluation. Te specifc operation contents are as follows. International Journal of Intelligent Systems Step 1: verifcation set partition. Since the submission channel of the original challenge task has been closed, it is necessary to extract 10% of the data from the training set as the verifcation set and the ofcial verifcation set as the test set.
Step 2: data processing. Replace the corresponding acronyms in the original text with the candidate interpretation of acronyms to form a new sentence set and pair it with the original text. Among them, the sentence pair containing the accurate interpretation is labelled as 1, and the sentence pair where the other candidate interpretation is located is marked as 0 (a single interpretation is to pair the candidate interpretation directly with the original text to form a sentence pair, and the marking method is consistent with the sentence matching).
In this way, it can also realize the expansion of the dataset in essence. Step 3: model training. According to diferent BERT models (mainly including three models: BERT-base, RoBERTa-base, and Sci-BERT-base), the training is carried out with the help of the sense transformer framework, and the loss function is a comparative loss.
Step 4: validation evaluation. Te trained model is used to encode each sentence pair in the test set, and the cosine similarity is calculated. Te candidate interpretation corresponding to the sentence pair with the highest cosine similarity is considered to be the correct interpretation. Te values of α are 0, 0.1, 0.2, 0.3, ..., 0.9, 1.0, respectively.
Step 5: result evaluation. Results were evaluated using precision and recall and the harmonic mean F1 value of both, i.e., P, R, and F1 in the table. Also, the ofcial ranking is mainly based on the macro F1 value. Because the current challenge task submission channel has been closed, this paper will directly compare with the Binary classifcation model on the verifcation set (the test set in this paper).

Experimental
Results. Te pretraining model in this paper mainly adopts huggingface (https://huggingface.co/ models) and uses the sense transformer framework to build the model [22]. At the same time, this paper does not add any characters to the text or carry out any preprocessing to test the robustness of the model to unprocessed data. Te experimental results are shown in Table 3.
Binary classifcation model in the table indicates the results obtained by the current state-of-the-art model on the ofcial validation set (i.e., the test set of this experiment) using the corresponding pretrained model [16]. All other papers only have results from the test dataset, but we cannot get the results of our method on the original test set because the ofcial access to it has been closed. It can be seen from Table 3 that under the same training conditions, the efect of sentence matching is signifcantly higher than that of using interpretation to match directly with the original document. Te fnal scoring is the weighted sum of sentence matching and single interpretation. Experiments were conducted with values of α from 0.0 to 1.0, and the results of validation dataset showed that a sentence combination weight of 0.9 and a paraphrase combination of 0.1 worked best. Trough the comparison of standard models, the dual scoring macro F1 value of Siamese network using Sci-BERT, that is, the F1 value of ofcial ranking is the highest, reaching 91.965, 2.95% higher than that of the ofcial embedding method. However, the relative efect of RoBERTa is poor. Te reason may be that the model based on the Siamese network mainly obtains the embedding of sentence vector through fnetuning, while RoBERTa's dynamic embedding mechanism and whole sentence training mechanism may lead to different concerns of the same sentence in diferent epochs, but the method of using Siamese network in the four models is better than that of the binary classifcation model.

Efect of the Fusion Hyperparameter.
In the score fusion, there is an adjustable parameter α, which has a range of 0.0-1.0. It plays the role of weigh of sentence combination similarity. For the infuence of the results, the experimental results are shown in the table. Te classifcation results can be afected by a suitable adjustment parameter α. Figure 4 shows the results of F1 when α is set to various values. It has been found that when α is 0.9, the most accurate classifcation is provided. When α equal to 0, it represents the result of interpretation combination, when α equal to 1, it represents the result of sentence combination.

Data Preprocessing.
Te statistical analysis of the equipped acronym dictionaries shows that the dictionaries contain a total number of 732 acronyms. Te average number of interpretations of each acronym is about 3; the highest number of interpretations of each acronym is 20, and the lowest number is 2. Where 660 acronyms have less than fve interpretations, accounting for 90.16% of the total number; 55 acronyms have between 5 and 10 interpretations, accounting for 7.51% of the total; 13 acronyms have between 10 and 15 interpretations, accounting for 1.78% of the total; and only four acronyms have more than 20 interpretations, accounting for 0.55% of the total, while the number of interpretations above 20 is only four acronyms, accounting for 0.55% of the total. Te analysis of the test set revealed the number of samples containing these four acronyms, namely, "CA," "CS," "CC," and "SC." Te number of samples containing these four acronyms, namely "CA," "CS," "CC," and "SC," was 44, 40, 35, and 18, respectively, accounting for only 2.21% of the total number of samples. Most of the samples contain the number of acronyms paraphrased concentrated in less than fve. Moreover, these four acronyms should correspond to two-word phrases, and the acronyms for such phrases are not very meaningful but rather increase the difculty of understanding the text. Te specifc overview is shown in Figure 5. However, the advantage of machines over humans is that they can process more information and data; so, this section will expand the existing lexicon based on the AcronymFinder (https:// acronymfnder.com) website and conduct experiments to verify the robustness of the model.
From Figure 5, we can see that more than half of the acronyms in the lexicon are two quantities. Terefore, we will verify the sensitivity of the model by increasing the number of lexical acronyms with the help of the acronym website resource. Te threshold of expansion is noted as Num, i.e., the number of acronym paraphrases less than Num is expanded to Num. Te distribution of the expanded dictionary is shown in Figure 6. Terefore, this section will be analysed using an extended lexicon as shown in Figure 6. First, only the test set and validation set are expanded. Also, the predictions are made using the model trained on the initial training set to demonstrate the sensitivity of the model. In the end, the entire dataset is expanded. Also, the model is retrained on the expanded training set for evaluation to demonstrate the expandability of the model.

Sensitivity Experiment
Results. An overview of the dataset augmented according to the expanded dictionary is shown in Table 4.
From Table 4, when Num � 3, the test dataset will grow by 7.70%, i.e., the number of negative samples in test dataset grows by 7.70%. Also, when Num � 6, the test set will grow by a total of 50.23% of negative samples. When Num � 10, there will be an increase of 120.39%, and the ratio of positive to negative samples in the dataset will be nearly 1 : 9.    International Journal of Intelligent Systems Tis experiment uses the Siamese network framework based on Sci-BERT with the highest F1 value for the experiment, the epoch of the model is 4, the batch size is 16, and the maximum length of the encoding is 400. Data validation is performed once every 500 batch sizes, and the model with the best performance in the validation set is retained. Te specifc experimental results data are shown in Figure 7. Overall, the model efect fuctuates with Num changes. Although there is an overall decreasing trend, the overall amount of fuctuation is within 2%. In interpretation combination, the recall value is the lowest, followed by the F1 value, while precision is the best. It is noteworthy that both precision and F1 values achieve the maximum value at Num � 3 when the data growth rate of the test sets is 7.70%.  International Journal of Intelligent Systems Te mean, range, and variance distribution of the F1 values of the three models are shown in Table 5.
Most existing researches directly compare the candidate's paraphrases with the original sentences. Tat is, the interpretation combination is used. However, it can be seen from Table 5 that the interpretation combination has the disadvantage of a lower F1 score than the sentence combination, but it is more stable. Both the range and variance are lower. Te score fusion combines the two advantages: a higher average F1 value, lower range and variance, and more stability. Terefore, using the score fusion is more robust and efcient than existing interpretation combination methods.

Scalability Experiment Results
. Scalability experiments are performed on an expanded training set using the same model with the same parameters and environment. An overview of the dataset after expanding the training set is shown in Table 6.   It can be seen from the table that after the training set is expanded, the growth rate of the total sample size is similar to that of the test set. However, with the same batch size � 16,12,715 iterations are required in a single epoch in the original training data, and when training with the RTX 3090, the duration of a single epoch is about 43 minutes (only the duration of the frst epoch is recorded). When using the expanded dataset, when Num � 3, a single epoch requires 13759 iterations, and the training time of a single epoch is about 57 minutes 500 iterations are set for one validation during training, with the number of validation sets increasing. When Num � 4, a single epoch takes about 1 hour and 27 minutes; when Num � 10, a single epoch takes about 5 hours and 2 minutes, and four epochs will take about 5 hours and 2 minutes. For more than 20 hours, the consumption of electricity and computing resources is enormous. Te F1 values result comparison of original model and the retrained model on the expanded dataset is shown in Figure 8.
Te time and resource cost of retraining is several times the training cost of the original model, but as can be seen from the Figure 8, the retrained model results are very close to the actual model results or even worse than the initial model results. Tis means that the model has good scalability. In practical, a smallscale dictionary can be used for training and then applied to a large-scale dictionary to save resources.

Conclusion
In this paper, we propose ContextAD, a context-aware similarity ranking method, which mainly exploits the feature of complete substitutability between exact paraphrases and acronyms. ContextAD mainly performs ranking prediction by comparing the similarity between new sentences containing candidate paraphrases and the original sentences containing acronyms. Ten, a score fusion method is designed to weight and rank candidates according to the similarity score of the interpretation and sentence combination, to improve performance and robustness. Te experiments results show that the model does not require additional trained models and data to achieve results beyond SOTA. In addition, we also design an experiment to extend the number of acronyms paraphrases, which efectively verifes the robustness of the model.
In future work, we will conduct further research from two aspects. (1) Multilingual applications, acronyms are not unique to English, but Chinese (Pinyin), Spanish, and French all have this phenomenon. Terefore, we will carry out multilingual or even cross-lingual disambiguation to better understand scientifc literature. (2) Large-model generative disambiguation. With the development of large-scale generative language models, we will study disambiguation methods that directly generate acronyms paraphrases.

Data Availability
Te datasets and evaluation scripts can be accessed through the following link: https://github.com/amirveyseh/AAAI-21-SDUshared-task-2-AD. Also, the other supplementary data are described in the article. All of them are publicly available.

Conflicts of Interest
Te authors declare that there are no conficts of interest regarding the publication of this paper.