Leveraging Pretrained Language Models for Enhanced Entity Matching: A Comprehensive Study of Fine-Tuning and Prompt Learning Paradigms

Pretrained Language Models (PLMs) acquire rich prior semantic knowledge during the pretraining phase and utilize it to enhance downstream Natural Language Processing (NLP) tasks. Entity Matching (EM), a fundamental NLP task, aims to determine whether two entity records from diferent knowledge bases refer to the same real-world entity. Tis study, for the frst time, explores the potential of using a PLM to boost the EM task through two transfer learning techniques, namely, fne-tuning and prompt learning. Our work also represents the frst application of the soft prompt in an EM task. Experimental results across eleven EM datasets show that the soft prompt consistently outperforms other methods in terms of F 1 scores across all datasets. Additionally, this study also investigates the capability of prompt learning in few-shot learning and observes that the hard prompt achieves the highest F 1 scores in both zero-shot and one-shot context. Tese fndings underscore the efectiveness of prompt learning paradigms in tackling challenging EM tasks.


Introduction
In the era of big data, extensive Knowledge Bases (KBs) or Knowledge Graphs (KGs) have been constructed, serving as structured repositories of knowledge about the world [1,2].However, the entities coming from diferent KBs are often heterogeneous and presented using diferent attributes.Figure 1 illustrates the disparities in attribute values for a same product in two diferent online shopping KBs.When integrating KBs to build recommendation systems or question-answering systems [3][4][5], these disparities can lead to increased redundancy and reduced performance in downstream tasks.Entity Matching (EM), as a fundamental knowledge extraction task in Natural Language Processing (NLP), aims to determine whether two entity records from diferent KBs refer to the same real-world entity, thereby addressing the aforementioned challenge [6].
Early EM methods are based on editing distance, which is convenient but less practical.Machine learning-based approaches transform EM into a binary classifcation problem using classifers like Support Vector Machine (SVM) [7].However, given that these methods require manual feature engineering, their generalization is limited.With the rise of deep learning, researchers also attempt to tackle the matching problem leveraging techniques like Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs) [8,9].However, these deep learning-based approaches could only capture semantic knowledge implicit in the training set, and obtaining labelled training data is challenging.
In light of the aforementioned drawbacks associated with deep learning, researchers have proposed Pretrained Language Models (PLMs) consisting of multiple layers of Transformer blocks [10], such as BERT [11] and ERNIE [12].
Initially, these models acquire prior semantic knowledge from extensive unlabelled text corpora through pretraining tasks like Masked Language Model (MLM) [11] and Next Sentence Prediction (NSP) [11].Subsequently, this semantic knowledge can be employed to enhance a variety of downstream NLP tasks [13][14][15].Tis can be regarded as a form of transfer learning, and there are currently two popular paradigms: fne-tuning and prompt learning [16].
Te fne-tuning paradigm involves modifying the model structure for downstream tasks, such as adding an additional classifer on top of the PLM's encoder and discarding the decoder part [17].Terefore, fne-tuning will introduce discrepancies of the training goal between downstream and pretraining tasks, making it challenging for the model to fully leverage the semantic knowledge acquired during pretraining [18].In contrast, prompt learning reformulates downstream tasks based on the pretraining task of a PLM, utilizing all the parameters in a PLM, including both the encoder and decoder, rather than only using the encoder.Taking the most representative BERT-series PLMs as an example, when employing these PLMs for prompt learning, they bridge the gap between downstream tasks and the pretraining task by wrapping the raw input with prompt templates containing [MASK] tokens, stimulating better PLM semantic understanding capability by reproducing the MLM process.At this point, downstream tasks are transformed into predictions for these placeholders, resulting in performance enhancement [19][20][21].For example, Jin et al. [17] proposed a "Word Transferred LM" for sentiment analysis, transferring the target words of a sentence into pivot tokens via MLM.Zhao et al. [22] developed a series of prompt learning approaches, called PromptMR, investigating how prompt learning could improve metonymy resolution.Te methods they proposed achieved competitive accuracy compared to baseline models.
Evidently, the key to applying prompt learning lies in the textual prompt templates.Depending on how templates are generated, prompt learning can be categorized as hard prompt (or discrete prompt) and soft prompt (or continuous prompt) [18].For the former one, templates consist of fxed tokens [23], while for the latter one, templates are vectors that can be learned in a continuous space [24].Terefore, although the templates generated by the soft prompt may not be understandable as natural language by humans, they have the capability to discover more suitable template embeddings.However, there is no published work on comparing the diferences between the fne-tuning and prompt learning paradigms in EM tasks comprehensively, and the properties of prompt learning in EM remain unexplored.
Te present research provides, for the frst time, comprehensive comparison between fne-tuning and prompt learning paradigms for the EM task and explores the capabilities of prompt learning in the context of few-shot learning.Tis is also the frst study on how to apply soft prompts to EM tasks.Te main contributions of this study can be summarized as follows: (1) We conduct a comprehensive comparison of fnetuning and prompt learning paradigms when applied to EM. Specifcally, we transform the structured attribute values of two entity records into textual descriptions.Given that the BERT-series PLMs are widely used and more representative, we chose ERNIE-2.0-en, which shares the same architecture and pretraining tasks as BERT, as the backbone model.For fne-tuning, we train a binary classifer using the representation of [CLS] to determine whether the two entity records are "consistent."For hard prompt, we utilize the template consisting of fxed tokens to convert the original input into sequences with [MASK] tokens and predict these placeholders.Trough this way, the downstream EM task is transformed into the pretraining task MLM.Te approach for soft prompt is similar to that of the hard one, but the template consists of pseudo tokens and searches for their embeddings in a continuous space using a Multilayer Perceptron (MLP

Related Works
Early research into EM primarily utilized methods based on edit distance or machine learning, such as [25][26][27].However, these methods either proved to be impractical or exhibited poorer generalization.Terefore, the majority of current EM research is based on deep learning or pretrained language models.

Deep Learning.
Deep learning has achieved remarkable results in the feld of EM, driven by the development of computer hardware, especially the Graphics Processing Unit (GPU) [8].For example, Di Cicco et al. [28] introduced a methodology to produce explainable deep learning models for the EM task.Nie et al. [29] proposed a deep sequence-tosequence entity matching model, denoted as Seq2Seq-Matcher, which can efectively solve the heterogeneous problems by modelling ER as a token-level sequence-tosequence matching task.Koolin et al. [30] proposed an EM approach, which is mainly based on a record linkage process and detects records that refer to the same entity.Gottapu et al. [31] used a single-layer convolutional neural network to perform an EM task.Kasai et al. [32] attempted to explore the performance of deep learning in the low-resource RM task.Tey designed an architecture that can learn a transferable model from a high-resource setting to a low-resource one.Tese methods based on deep learning can learn features from training data, eliminating the need for manual feature engineering.However, the semantic knowledge they acquire is limited to the training set, which constrains the performance of deep learning-based EM models, especially considering that obtaining labelled training data is challenging.

Pretrained Language Models.
PLMs consisting of multiple transformer blocks, such as BERT [11] and ERNIE [12], can acquire prior semantic knowledge from large-scale unlabelled corpora through pretraining phase and apply this knowledge to downstream tasks.Consequently, PLMbased approaches outperform deep learning-based methods in various NLP tasks.Tere has been research focusing on the application of PLMs to EM.For example, Chen et al. [33] proposed a transfer-learning EM approach, leveraging a knowledge base constructed through PLMs.Mehdi et al.
[34] investigated whether PLM-based EM models can be trusted in real-world applications where data distribution is diferent from that of training.Due to the existing diferences in training goal between fne-tuning and pretraining, recent eforts have focused on employing prompt learning to bridge the gap between pretraining and downstream tasks, namely, utilizing all the parameters both in the encoder and decoder of a PLM.Te key to conducting prompt learning lies in reformulating the downstream target task based on the textual prompts.Tere are two types of textual prompts: cloze prompts, which fll in the blanks of a textual string, and prefx prompts, which continue a string prefx [18].In addition, prompt learning can generally be categorized into two types: the hard prompt [23,35] and soft prompt [24,36].Te diference lies in the fact that the hard prompt has fxed templates, whereas the soft prompt allows the template to be learned in a continuous space.According to the literature review, there has been no comprehensive analysis of fnetuning and prompt learning specifcally for EM.

Methods
Tis section frst provides a detailed introduction to the problem defnition of the EM task, followed by a thorough presentation of the specifc model structures for the two paradigms: fne-tuning and prompt learning.

Problem Defnition.
Te EM task aims to determine whether two entity mentions or records refer to one realworld entity.Specifcally, given a dataset D � (E 1 , E 2 ),  A, Y}, where E 1 and E 2 are sets of entity mentions, A denotes the set of attributes, and Y denotes the set of true labels.For any e i ∈ E 1 and e j ∈ E 2 , both are composed of n attributes, i.e., e i � a i1 , a i2 , a i3 . . .a in   and e j � a j1 , a j2 , a j3 . . .a jn  , where (a i1 , a i2 , a i3 . . .a in , a j1 , a j2 , a j3 . . .a jn ) ∈ A. Assuming the relation between two entity mentions is represented by ′ , e j ′ ) should be the same as the true label y k ′ ∈ Y ′ .Te goal of this study is to construct an appropriate model to represent the mapping function f(•).

Fine-Tuning.
For the fne-tuning paradigm, we follow the method outlined in Figure 2. First, structured key-value pairs are transformed into unstructured textual data denoted as X � x 1 , x 2 , . . ., x n  , where n denotes the length of X.
for each token within the input sequence.In this context, the EM task can be regarded as a binary classifcation task, and the objective is to ascertain whether two entities are identical or dissimilar.Finally, the representation corresponding to [CLS] is used to calculate the predicted label y p through the following equation: where W c ∈ R M * H and b c ∈ R M are the learnable weight matrix and bias and initialized with random values.M is the number of labels (M � 2 in this study), and H is the dimension of the hidden layer.

Prompt Learning.
In contrast to fne-tuning, prompt learning transforms the downstream task into the form of the pretraining task, aligning the objectives of pretraining and downstream tasks.In this study, the downstream task is transformed into the MLM task since we select ERNIE-2.0en, which is a BERT-like PLM, as the backbone model.According to the construction method of prompt templates, it can be categorized as either the hard prompt or the soft prompt.

Hard
where W m ∈ R K * H and b c ∈ R K are the learnable weight matrix and bias in the MLM head, but initialized with the values learned by pretraining process.K is the size of the word dictionary, and H is the dimension of the hidden layer.
where W m ∈ R K * H and b c ∈ R K are the learnable weight matrix and bias, but initialized with the values learned by pretraining process, too.K is the size of the word dictionary, and H is the dimension of the hidden layer.Trough the aforementioned approach, the soft prompt can fnd more appropriate template embeddings in continuous space.CLS ... International Journal of Intelligent Systems

Experiments
Tis section aims to evaluate the method proposed in Section 3 through experiments and presents the selected datasets, evaluation metrics, hyperparameters, and experimental results.For the software environment of the experiments, we utilized the Paddlepaddle deep learning framework, introduced by Baidu (https://github.com/paddlepaddle/paddle).As for the hardware environment, we employed a 2-core CPU, a RAM with 16 GB, and an NVIDIA V100 GPU with 16 GB of memory for the experiments.Additionally, given that this study marks the frst application of the soft prompt to EM, we refer to the EM approach based on soft prompt as "our model." 4.1.Datasets.We evaluated the proposed entity matching method in this study using the datasets provided by Mudgal et al. [37].Tese datasets difer in terms of type, domain, and size, allowing us to assess the generalizability of the entity matching model.Table 1 presents an overview of the datasets, indicating that the datasets consist of two types: structured and dirty.Te dirty datasets are obtained by modifying the structured dataset and are diferentiated using indices 1 and 2. Specifcally, for each attribute except "title," there is a 50% chance that it will be randomly moved to the "title" attribute.Tis simulates a common kind of dirty data seen in the reallife scenarios while keeping the modifcations simple.Te "Size" column represents the total number of labelled samples for each dataset.We split all the dataset into three parts with ratio of 3 : 1 : We also conduct a comparative analysis between fne-tuning and prompt learning paradigms on both structured and dirty datasets.Te corresponding results are presented in Tables 5 and 6, with the abbreviations "FT," "HP," and "SP" denoting "fnetuning," "hard prompt," and "soft prompt," respectively."ΔF1" quantifes the enhancement in F1 scores.Based on the experimental fndings, it is evident that the two prompt learning approaches consistently outperform the fne-tuning paradigm across the majority of datasets, with the exception of the structured iTunes-Amazon and DBLP-Scholar datasets.For these particular datasets, the fne-tuning and the adoption of a hard prompt yield nearly identical F1 scores.
Notably, the utilization of a soft prompt consistently exhibited superior performance, manifesting an enhancement in F1 scores across all datasets.It is noteworthy that our observations indicate a potential correlation between the magnitude of F1 scores' improvement achieved through prompt learning and the scale of the training dataset.Specifcally, prompt learning demonstrates a propensity for generating higher enhancements in F1 scores for smaller datasets.To illustrate, consider the structured datasets BeerAdvo-RateBeer, DBLP-ACM, and DBLP-Scholar, all having the number of attributes with four.However, the BeerAdvo-RateBeer dataset comprises a modest size of 450 instances, signifcantly   7 and 8. Furthermore, Figures 4 and 5 provide a visual depiction of the descending trend of the loss values.Notably, during the initial phases of training on the iTunes-Amazon dataset, hard prompt demonstrated the most favourable performance.As the training progressed, fne-tuning manifested a rapid reduction in loss.However, upon reaching complete convergence, its loss value is higher than that of two prompt learning methods.Our model, which is based on the soft prompt paradigm, ultimately achieved the most remarkable outcome, with the lowest loss value of 6.27e − 5.In order to provide a clearer representation of the descending trends of loss values for diferent methods at the end of training, we took the logarithm of the loss values using a base of 10, as shown in the lower part of Figure 4.For the DBLP-Scholar dataset, the observations in the frst epoch are consistent with those of the iTunes-Amazon dataset.Nevertheless, as the training advanced, all three methods converged to nearly identical loss values.

F1 Scores of Few-Shot Learning under Diferent
Paradigms.Te aforementioned experiment underscores that the prompt learning exhibits lower loss values in the initial phases of training when compared to the fne-tuningbased approach.Tis can be attributed to the narrowing of the gap between downstream and pretraining tasks.To further substantiate this fnding, we systematically investigated the performance of few-shot learning under diferent paradigms.Specifcally, we conducted zero-shot and one-shot learning using the test sets of the structured iTunes-Amazon and DBLP-Scholar datasets as the query sets.Within the context of zero-shot learning, we appraised the performance of diferent paradigms on the query set without prior training.In the scenario of one-shot learning, we randomly selected an individual sample labelled as "diferent" and another labelled as "consistent" from the training dataset, thus constituting a support set for training.Subsequently, we evaluated the performance of diverse paradigms on this constructed support set.Figures 6 and 7 depict the F1 scores for zero-shot and one-shot learning on the iTunes-Amazon and DBLP-Scholar datasets, respectively.It becomes evident that, in the context of the iTunes-Amazon dataset, the F1 scores yielded by the fnetuning are notably inferior in both zero-shot and one-shot learning when contrasted with the outcomes of the prompt learning.It is worth noting that in any few-shot learning scenario, the hard prompt consistently attains the highest F1 score.Te outcomes derived from the DBLP-Scholar dataset substantiate a similar assertion, wherein the prompt learning surpasses the performance of fne-tuning.Tis congruity echoes the observations drawn from the experiment detailed in Section 4.4.3,particularly during the early training stages, underscoring the efcacy of the hard prompt paradigm in the context of few-shot learning.

Discussion
5.1.Performance of Diferent Models.Te experimental results presented in Tables 3 and 4 provide a comprehensive assessment of the proposed EM model in comparison with four established methods: DeepER [38], DeepMatcher [37], Magellan [40], and MCA [41].Among these four models, some are based on deep learning architectures such as RNN or LSTM.Even with the incorporation of attention mechanisms, these models can only capture semantic knowledge from the training set, constraining their potential for performance enhancement.Some models introduce word embeddings like GloVe [39], but the semantic knowledge embedded in them falls short of the richness found in PLMs.
In contrast, our model leverages the advantages of the PLM, namely, ERNIE-2.0-base-en, to generate enriched representations with contextual information, greatly benefting the EM task.Furthermore, we incorporate prompt learning to train the EM model.Prompt learning utilizes both the parameters in the PLM's encoder and decoder.Terefore, it narrows the gap between the pretraining task and downstream task (in this case, entity matching), enabling the model to conduct entity matching similar to the pretraining task."ΔF1" column clearly demonstrates the improvement in F1 scores achieved by our model compared to the previous methods.Tis improvement underscores the efcacy of our approach across diverse datasets (structured and dirty), reafrming its robustness in various data contexts.

Comparison of F1 Scores.
We also compared the performance of fne-tuning and prompt learning paradigms on both structured and dirty datasets, and the corresponding results are presented in Tables 5 and 6.Evidently, the prompt learning consistently exhibits superior performance over the fne-tuning across the majority of datasets, highlighting its robustness and universality.However, exceptions were observed in the case of structured iTunes-Amazon and DBLP-Scholar datasets, where the adoption of fne-tuning and hard prompt yielded F1 scores that were almost indistinguishable.As mentioned earlier, this could be attributed to dataset characteristics, including the number of attributes and the size of training sets.Given that the prompt learning diminishes the gap between pretraining tasks and downstream tasks, it is more suitable for small-scale datasets with fewer attributes.For datasets with ample training samples, as training progresses, the model can acquire more task-specifc semantic knowledge from the training set.Tus, for the structured iTunes-Amazon and DBLP-Scholar datasets, fne-tuning and hard prompt learning achieved nearly equivalent F1 scores.However, soft prompt, compared to the hard one, allows the model to search for a prompt template in the continuous vector space, which is more conducive to prompt learning.Terefore, it consistently obtains the highest F1 scores across all datasets.

Comparison of Loss Values.
Considering the similarity in F1 scores obtained by fne-tuning and hard prompt on the structured iTunes-Amazon and DBLP-Scholar datasets, we also recorded the average loss values at each training epoch for diferent paradigms, as listed in Tables 7 and 8, to explore the ftting capabilities of diferent paradigms on the training set over the entire training phrase.Te results for the structured iTunes-Amazon dataset indicate that compared to prompt learning, fne-tuning consistently yields higher loss values throughout the entire training process.However, hard prompt, although starting with the lowest loss value, performs less efectively than soft prompt at the end of training.Tis phenomenon reafrms the prior analysis that prompt learning, by adopting the MLM for downstream tasks, can leverage the prior semantic knowledge embedded in PLMs more efectively.As a result, prompt learning fts the training set better, resulting in lower loss values than fne-tuning.Additionally, the soft prompt searches for suitable templates in a continuous space.Tus, although it exhibits higher loss values than the hard prompt in the early stages of training, it ultimately achieves the lowest loss value.Te experiments conducted on the DBLP-Scholar dataset also demonstrated similar results, indicating that in the early stages of training, fne-tuning exhibited lower ftting capacity to the training set compared to prompt learning, and hard prompt achieved the lowest loss value.However, the fnal loss value attained by fne-tuning aligns with those of prompt learning.Tis may still be attributed to the size of dataset, where for a larger number of training samples, fne-tuning can acquire more latent semantic knowledge as training progresses, compensating for its structural diferences from prompt learning.

Comparison of the Performance of Few Shots.
Te preceding discussion elucidates how prompt learning can efectively harness the prior semantic knowledge embedded in PLMs.To further substantiate this assertion, we systematically explored the capabilities of few-shot learning under diferent paradigms.Experimental results indicate that, compared to prompt learning, fne-tuning yields lower F1 scores in both zero-shot and one-shot learning.Regardless of the type of few-shot learning, hard prompts exhibit an advantage in terms of F1 scores.Tis result aligns with the observations detailed in Section 4.4.3,particularly during the initial training phase.Fundamentally, the phenomena observed in few-shot learning can be attributed to the efcacy of the prompt learning in bridging the gap between pretraining and downstream tasks, enabling both soft and hard prompt methods to obtain superior F1 scores.Considering that soft prompts require to optimize the template embeddings in continuous space, the experimental outcome further underscores the efectiveness of the hard prompt in the domain of few-shot learning.

Conclusions
In this study, we have explored the potential of leveraging PLMs to enhance EM.Our investigation involves a comprehensive analysis of two transfer learning paradigms: fnetuning and prompt learning, across eleven EM datasets.Te results indicate that the soft prompt consistently outperforms other approaches across all datasets, demonstrating that generating template embeddings in a continuous space can enhance the performance of EM.Furthermore, our exploration into the realm of few-shot learning unveiled the potential of the hard prompt, showing its efectiveness in both zero-shot and one-shot context.In summary, this research contributes to our understanding of how PLMs can be harnessed to augment EM task.For future work, we will continue to delve into the application of large language models in the EM task.By integrating EM tasks with language models, we aim to enhance knowledge extraction and data integration in various NLP applications [43][44][45].

Figure 3 :
Figure 3: Te model architecture of hard prompt and soft prompt.(a) Te entity matching task with a hard prompt.(b) Te entity matching task with a soft prompt.

4. 4 . 3 .
Loss Values of Diferent Paradigms.Considering Section 4.4.2shows the comparable performance between prompt learning and fne-tuning on the structured iTunes-Amazon and DBLP-Scholar datasets, we recorded the loss values of the EM models employing these paradigms at each training epoch, with the intention of conducting a detailed investigation into their performance in the EM task.Te outcomes are presented in Tables

Figure 4 :Figure 5 :Figure 6 :
Figure 4: Loss values at each epoch on the structured iTunes-Amazon dataset using diferent methods.Te fgure below has taken the logarithm of the loss values using a base of 10.

Figure 7 :
Figure 7: F1 scores of zero-shot and one-shot learning on the structured DBLP-Scholar dataset using diferent methods.
Te illustration of entity matching.Te tables display the product (entity) records of "Adobe Photoshop" in Amazon and Google.Te EM task involves determining whether these two entity records represent the same real-world entity.
Ten, the embeddings denoted as E � e cls , e 1 , e 2 , . . .,  e n , e sep } are acquired by incorporating the [CLS] and [SEP] tokens surrounding X and executing an embedding table lookup.Te PLM generates the representations, which are denoted as sep   generated by the PLM is passed into the MLM head.Te most likely word w, which can represent the predicted label y p , is selected from a dictionary through the following equation: Prompt.For the hard prompt, structured keyvalue pairs are frst transformed into two unstructured text chunks, as shown in Figure3(a).Ten, the new input sequence X � x 1 , x 2 , . . ., x n   is constructed based on the template "<sentence_1> and <sentence_2> they are [MASK]."It is obvious that x n corresponds to the [MASK] token.Subsequently, E � e cls , e 1 , e 2 , . . ., e n , e sep   is obtained as the same method as fne-tuning and input into the PLM.Finally, h m ∈ H � h cls , h 1 , h 2 , . . ., h n , h is constructed based on the template "<sentence_1> and <sentence_2> pseudo pseudo [MASK]," and x n corresponds to the [MASK] token.It is worth noting that the template contains pseudo tokens, which can be represented using [UNK], and the number of pseudo tokens is a hyperparameter (set to 10 in this study).E � e cls , e 1 , e 2 , . . ., e p1 , e p2 , e n , e sep   is still acquired through an embedding lookup operation.Te fnal input sequence to the PLM is e cls , e 1 , e 2 , . . ., r p1 , r p2 , e n , e sep 3.3.2.Soft Prompt.For the soft prompt, structured key-value pairs are also transformed into two text chunks, as illustrated in Figure3(b).Ten, a new input sequence X � x 1 , x 2 , ..., x p1 , x p2 , x n r p2 � W p * e p2 + b p , (4)where W p ∈ R L * L and b P ∈ R L are the learnable parameters of a Multilayer Perceptron (MLP) layer and L is the embedding dimension.Finally, h m generated by the PLM is passed into the MLM head, and w is selected from a dictionary through the following equation:

Table 1 :
Overview of the eleven EM datasets.

Table 2 :
Overview of the hyperparameters.

Table 3 :
F1 scores of diferent EM models on structured datasets.

Table 4 :
F1 scores of diferent EM models on dirty datasets.Table5exhibits that, for the structured BeerAdvo-RateBeer dataset, the deployment of hard prompt leads to a notable increase of 3.6 percentage points in the F1 scores, whereas for the structured DBLP-ACM and DBLP-Scholar datasets, the F1 scores attained by hard prompt are almost equivalent to those achieved through fne-tuning.

Table 5 :
F1 scores of diferent paradigms on structured datasets.

Table 6 :
F1 scores of diferent paradigms on dirty datasets.

Table 7 :
Te loss values of diferent methods at each training epoch on the structured iTunes-Amazon dataset.

Table 8 :
Te loss values of diferent methods at each training epoch on the structured DBLP-Scholar dataset.