Optimizing Ontology Alignment through Linkage Learning on Entity Correspondences

,


Introduction
Semantic Web (SW) is proposed by Tims Berners-Lee in 1998, which makes the intelligent applications be able to understand a word's meaning in semantic level.Ontologies are the solution to the issue of data heterogeneity on SW since it is able to make consensus of a certain conception meaning of a field and provide abundant domain knowledge and semantic vocabulary for the interaction between application systems.However, due to SW's scattering essence, there might be different definitions on a concept in separate ontologies, which leads to the issue of ontology heterogeneity [1].Ontology matching is regarded as an effective method to address it, and swarm intelligent algorithm-(SIA-) based ontology matching techniques have achieved good performance in past studies [2], such as genetic algorithm (GA) [3], particle swarm optimization algorithm (PSO) [4], firefly algorithm (FA) [2], and artificial bee colony algorithm (ABC) [5].However, there are two drawbacks in the existing SIA-based approaches: (1) massive time and memory consumption is required, which heavily blocks the efficiency of the ontology matching process; (2) an expert of related field or a reference alignment is required in the process of ontology matching which is usually not available in real application conditions.To overcome these drawbacks, an extended compact genetic algorithm-based ontology entity matching technique (ECGA-OEM) is proposed in this work, which uses both compact encoding mechanism [6,7] and linkage learning approach to efficiently match the ontologies.In particular, our contributions are as follows: (i) A new evaluating metrics on the ontology alignment is proposed, which is able to work without the reference alignment and the domain experts.(ii) An optimal model on ontology entity matching problem is constructed.(iii) An ECGA-OEM is proposed, which uses the linkage learning and compact encoding mechanism to efficiently address the ontology entity matching problem.
e rest of this paper is organized as follows: the related works are narrated in Section 2; the statement of ontology, ontology matching, and similarity measures are presented in Section 3; the ontology entity matching through ECGA proposed by this paper is revealed in Section 4; Section 5 presents the experiment results; and finally, Section 6 draws the conclusion and presents the future work.
Generally, ontology matching techniques are classified into two categories: ontology metamatching techniques and ontology entity matching techniques [27].e former dedicates to address the problem that how to aggregate different similarity measures with appropriate weights, and the latter tries to directly determine the entity correspondence set between two ontologies.e first SIA-based ontology metamatching system is genetics for ontology alignment (GOAL), which aims at optimizing the aggregating weight set for different matchers [3,[28][29][30].Memetic algorithms (MAs), which introduce local search (LS) strategy into evolutionary algorithms (EAs) to improve its local optimization capability, are proposed to solve ontology metamatching problem [31].To overcome the drawback of overreliance reference alignment, Xue et al. presented a partial reference alignment (PRA), in which only a part of standard reference is used to assess the quality of alignment [32].Furthermore, Xue and Wang proposed an innovative metric named unanimous improvement ratio (UIR) to assess the alignment's quality, in which the reference alignment is not required [33].Besides, artificial bee colony (ABC) algorithm is also adopted to address ontology metamatching problem, which further improves the solution's quality [5].
During the matching process, ontology metamatching techniques need to maintain several similarity matrices, which leads to huge memory consumption.For this reason, ontology entity matching techniques, which aims at directly determining the optimal pair set, attracts the expert's interests.Genetic algorithm-based ontology matching (GAOM) firstly regards certain matching pairs set as the optimizing objective [34].MA is also utilized to solve the ontology entity matching problem, whose performance outperforms GA [35].Bock et al. [4] use PSO to solve the ontology entity matching.In detail, it evaluates the fitness of chromosomes through a certain aggregation strategy for multiple objective functions.Alves et al. [36] argue that instances consisting in the ontology can be used to improve alignment in the condition that knowledge is embedded in them.For this reason, Xue et al. [37] take also instance-level matching into consideration to further improve the quality of alignment.

Ontology and Ontology Matching
Definition 1 (ontology).An ontology O is a 5-tuple [33].O � (C, P, I, Λ, Γ) where C is a set of classes that cannot be empty, P is a set of properties that cannot be empty, I is a set (it could be empty) of individuals that represent the instances of classes in the real world, Λ is a nonempty set of axioms that are used to check the consistency of ontologies or deduce new information, and Γ is a set of annotations that provide information metadata so that the researcher can understand.Particularly, C, P, and I make up the entities in ontologies.
Definition 2 (ontology matching).e ontology matching can be considered as a function f(O 1 , O 2 , A ′ , p, r), where O 1 and O 2 are two ontologies to be matched; A ′ is an existing initial alignment of O 1 and O 2 ; p is a set of parameters, e.g., threshold, in the process of ontology matching; and r is a set of external resources, e.g., background knowledge based and dictionaries, which assisted in ontology matching.e process of ontology matching is depicted in Figure 1, where A is the obtained ontology alignment.
An example of matching two ontologies is presented in Figure 2, where O 1 and O 2 are two ontologies to be matched in this figure.e strings in the rounded rectangle are the classes, e.g., "Reference," "Entry," and "Book."e black lines between two classes of the same ontology represent their relation "has a" or "is a" in turn; e.g., "Reference" has a "Book" and "Book" is a "Reference," which means "Reference" is the supclass of "Book," and "Book" is the subclass of "Reference."ere are datatype properties that describe the features of a class; e.g., "Data," "Title," and "Human Creator" are the properties of "Reference." e instances of a class are in a rectangle; e.g., "Of Natures Obvious Laws & Processes in Vegetation" is an instance of "Article."e relation of the entities between the two ontologies are linked by the lines with double arrowhead, and there are symbols: " ≡ ," "⊆" (or "⊇"), and "⊥," which, respectively, means equivalence, more

Syntactic Measures.
Two syntactic measures, i.e., SMOA (string metric for ontology alignment) [37] and Levenshtein [38], are employed in this paper.Given two strings s 1 and s 2 , the SMOA and Levenshtein similarity are, respectively, defined as follows: where Comm(s 1 , s 2 ) stands for the common length of s 1 and s 2 while Diff(s 1 , s 2 ) for the different lengths and WinklerImpr(S 1 , S 2 ) is the improvement to results yielded by the method that proposed by Winkler [39]. Levenshtein where |s 1 | and |s 2 | are the cardinality of the letters contained in s 1 and s 2 , respectively, and d(s 1 , s 2 ) is the number of letters that need to be modified from s 1 to s 2 .e final syntactic similarity is equal to the average of SMOA and Levenshtein.

Linguistic Measures.
e linguistic similarity between two strings is worked out by considering the semantic relations (such as synonyms and hypernym) which usually requires using the thesaurus and dictionaries.In this work, WordNet [23,40], an electronic vocabulary database that has collected every meaning of various words, is used.Given two words w 1 and w 2 , Linguisitc Similarity(w 1 , w 2 ) equals: (i) 1, if words w 1 and w 2 are synonyms in Wordnet.
(ii) 0.5, if word w 1 is the hypernym of word w 2 or vice versa in Wordnet.(iii) 0, otherwise.

Taxonomy-Based Measures.
e core ideal of taxonomy-based measures is to make full use of the hierarchy relationship of ontology to determine two entities' similarity by considering their neighbor's similarity.In this work, a mutual reasoning between class and property (MRCP) is proposed as the taxonomy-based measure, which is shown in Figure 3.
In Figure 3, the circle is the class of the ontology, the triangle is the properties of the ontology, and the one-way arrow represents the hierarchical relationship; i.e., class c a1 is the supclass of class c a3 , the dividing line arrow between the class and the property indicates that the property belongs to this class, the bidirectional arrow indicates that there is a high similarity between the two entities, and the dashed two-way arrow indicates that the similarity between them is improved after the operation.Subgraph (a) depicts the classes' similarity gained from their neighbor, i.e., supclass and subclass.
ere is high similarity between classes c a1 and c b1 , so the similarity between their subclasses c a3 and c b2 is supposed to increase.Likewise, similarity of classes c a3 and c b2 would be increased because their subclasses c a6 and c b4 are highly similar.Subgraph (b) is the properties' similarity gained from their supproperty and subproperty.e similarity of properties p a3 and p b2 will be improved since their supproperties p a1 and p b1 and subproperties p a6 and p b4 are highly similar, respectively.Subgraph (c) is the properties' similarity gained from the classes which they belong to.c a1 and c b1 are the classes of two ontologies and p a1 , p a2 , p a3 , p b1 , p b2 , and p b3 are their properties, respectively.e similarities between properties of class c a1 and properties of class c b1 would be improved because of the high similarity of c a1 and c b1 ; i.e., the similarity of p a3 and p b1 would be promoted and so are the remaining eight combinations.On the contrary, classes' similarity will be increased due to the same or highly similar properties they shared, as is depicted in subgraph (d).Since pairs p a1 and p b3 , p a2 and p b2 , and p a3 and p b1 , the similarity of c a1 and c b1 is increased too.

Aggregation Strategy.
ree similarity matrixes are generated when the three measures have been applied.In this work, three matrices need to be aggregated into one matrix.
e final similarity value S a (s 1 , s 2 ) between two entities s 1 and s 2 is defined as follows: where S s , S l , and S t is, respectively, the syntactic, linguistic, and taxonomy-based similarity of s 1 and s 2 ; S sl is the average of S s and S l ; and Threshold is a given parameter to filter the matching pairs with low similarity.

Extended Compact Genetic Algorithm-Based Ontology Entity Matching
GA is an excellent methodology to solve the ontology matching problem due to its potential parallel search characteristic and good searchability.In our work, an ECGA to efficiently address the ontology entity matching problem is proposed.

Optimal Model.
e optimal model of ontology entity matching problem is given as follows: max ξ(σ), where

e Framework of ECGA-OEM.
Two ontologies are to be matched as input and a reference alignment as output, and the framework of ECGA-OEM is shown in Figure 4, whose critical components are narrated in the rest of this section.Two ontologies, generally in XML or RDF format, are extracted into two hierarchy schema in the preprocessing model.e operation of the ECGA optimization model relies on the similarity matrix obtained in the similarity measure model, which has been stated in detail in Section 3.2.Finally, the alignment is generated by the solution generation model.In detail, the ECGA optimization model is described in Section 4.3.

ECGA Optimization Model.
Given the virtual population's maximum generation, MaxGeneration � 2000 (normally, the number of iterations in the convergence of ECGA is much less than MaxGeneration), Threshold � 0.7, and the hierarchy schema of ontology1 and ontology2 as input and final alignment as output.
e pseudocode of ECGA is proposed in Algorithm 1, where PV and BB are probability vector (see also Section 4.3.1)and building blocks (refer to also Section 4.3.7),respectively.

Probability Vector Initialization.
Different from binary coding, the probability vector (PV) in this work is twodimensional.
e initialized PV and convergent PV are shown in Tables 1 and 2, respectively.e value in the i th row and j th column represents the possibility of matching between the i th entity in O 1 and the j th entity in O 2 ; i.e., it means the probability of 0 th entity of O 1 (reference) and 0 th entity of O 2 (entry) is 1/1 + 1 + 1 + 1 + 1, which is shown in Table 1 (peculiarly, the header "−1 (null)" denotes the probability of no matching).e convergence condition is that the probability of taking a unique number on each locus in the PV is 1; i.e., in Table 2, the probability of the 0 th entity of O 1 (reference) and the 0 th entity of O 2 (entry) is 23.72/0 + 23.72 + 0 + 0 + 0 and so do the rest.

Chromosomes Generation.
Certain size chromosomes are produced in each generation through PV.An example of chromosome is given in Figure 5.In particular, subgraph (a) shows the locus of the chromosome and corresponding code, and subgraph (b) illustrates decoding chromosome in subgraph (b); i.e., it denotes that the fourth entity of O 1 "Article" correspondent to the third entity of O 2 "Article" as the code of the fourth locus is "3" ("−1 (Null)" indicates no matching).

Fitness Function.
e fitness function is used to determine which chromosomes in the population can better adapt the environment.In the context of ontology matching, the objective of fitness function is to find the best chromosome, whose corresponding alignment's quality is the highest, with algorithm convergence.e objective function of the optimal model is used as the fitness function of this work, and given a chromosome σ, its fitness function is defined as follows: where ξ(σ) is the fitness function that is used in this paper; A is the alignment determined by σ; |A| is the cardinality of A; β, a fraction in the range [0, 1], is the relative weight of ϕ(|A|) and f(A), which is set to 0.25 in this paper; ϕ is a normalization function; and f is a function that calculates the mean of the matched entity pairs' similarity values in A.
In addition, ϕ(|A|) and f(A) are defined as follows: where |O 1 | and |O 2 | are, respectively, the cardinality of O 1 and O 2 and η i is the similarity of the i th matching pair in alignment A. In particular, ϕ(|A|) is the ratio of the number of matching pair found to the value of the smaller entity number of the two ontologies and f(A) is the average similarity of the matching pairs found, which, respectively, approximates the recall value and precision value.

Selection Operator.
e selection operator selects the best chromosomes in the current population to participate in the next step [41] and updates the PV.Firstly, the chromosomes are sorted in descending order according to their fitness scores.Secondly, the first half of the chromosomes will be retained as a temporary population.Finally, with fitness as the weight through roulette, we can select the chromosome from the temporary population for subsequent operation.e goal of elite strategy is to keep the historical optimal solution and prevent the fitness of the optimal chromosome from "degenerating" in the process of evolution.In this work, the elite strategy has two steps: the historical optimal solution Elite first competes with the current optimal solution Best, and the winner will become the new Elite; the historical optimal solution then participates in the PV update in each generation.

Probability Vector
Updating.An example of updating the PV is presented in Figure 6, where subgraph (a) is a chromosome generated by the initialized PV. e similarity is derived from the similarity matrix according to the chromosome's code.It should be noted that the code of the third locus is "0," which means that the entity "Part" with sequence number "3" in O 1 does not match any entity in O 2 , so its similarity is equal to 1 minus the value of the highest  6 Complexity similarity between "Part" and all entities in O 2 , i.e., (1 − 0.38) � 0.62.PV will be updated by the normalized similarity of each locus, which is shown in subgraph (b).e probability of being updated in the PV is bold; i.e., the probability of matching pair "Reference" and "Entry" changed from "1" to "1.2" since the normalized similarity of the corresponding locus of chromosome is "0.20." 4.3.7.Linkage Learning.Building blocks will be saved through linkage learning, thus to improve the efficiency of algorithm and quality of solution [42].In simple GA, linkage learning is to identify great locus and protect them so that they will not be destroyed in the subsequent crossover and mutation operations; in ECGA, linkage learning keeps good probability distribution so that they are not disturbed in the subsequent update process.A linkage learning approach is proposed in this work, and its detail is shown in Figure 7.
For clarity, only column 1 of the original probability vector is selected for narration.In each generation, a low probability clearing operation is performed.e value of each column is divided by the maximum value of the row, and the value of the column will be cleared if the decimal is less than a specific value (0.2 in this work); i.e., 0, 2, 3, and −1 bits in PV are cleared and marked.e row of PV is a "good" probability distribution when all but one bit of this row are zero.Link learning generates building blocks based on the "good" probability distribution, i.e., a building block, the pair of index "0," and code "0," which was included in the rounded rectangle produced by linkage learning.After that, all the building blocks are directly put into each chromosome (the bold numbers), which reduces the consumption of runtime and memory consumption.

Experiment Setup.
In the experiment, the Biblio benchmark provided by the Ontology Alignment Evaluation Initiative (OAEI) is used to verify the effectiveness of our approach.Normally, two ontologies to be matched and a reference alignment are included in each testing case as a standard to evaluate the quality of matching results.e testing cases can be classified into five categories, which are briefly described in Table 3.
In this work, the method is compared with the participants of OAEI-, GA-, and CGA-based ontology matching techniques.e experimental results are the H-mean values of 30 independent runs.Complexity

Alignment Evaluation Metrics according to Reference
Alignment.A criterion is needed when evaluating the quality of matching systems.Given an alignment result A, two measures, recall and precision, are employed in this work and their formula is as follows [22,23]: where |R| and |A| are the cardinality of matching pairs in the reference alignment provided in case set and the matching pairs in alignment are produced by the matching system and |R ∩ A| is the cardinality of matching pairs, which exist in both reference alignment and alignment found.It means that all the matching pairs in reference alignment have been found when recall is 1, while all the matching pairs found is correct when precision is 1.
Both recall and precision are important parameters of the evaluation results and they should be considered at the same time.A weighted harmonic mean of recall and precision, F-measure, is used in this work, which is presented in the following equation [43]: where α[0, 1] is the relative weight of recall and precision and it is set as 0.5 in this work, which is named f1-measure.

Comparison with OAEI's Participants.
e participants from OAEI 2016, 2015, and 2014 are selected to compare with our approach.In particular, if a matching system has participated for more than one year, only the latest results are used.e harmonic mean comparison of participants and ECGA-OEM is shown in Figure 8.
e vertical axis represents different matching systems, and the horizontal axis represents the score of their corresponding parameters.In terms of f-measure, ECGA-OEM ranks third, which it is slightly lower than Lily and CroMatch.Wiki has been used as the linguistic measure in Lily and CroMatch, which is able to improve the performance of the algorithm at the expense of efficiency.In addition, ECGA-OEM has an unparalleled performance in maintaining the balance between precision and recall, while one of them is much higher than the other in the participants, which is very important in evaluating results quality.
Further f-measure comparison of OAEI participants and ECGA-OEM in a total of 32 test cases is given.Figure 9 shows the numbers of participants with ECGA-OEM superior, equal, and inferior, respectively.e horizontal axis is the set of different test cases, and the vertical axis is the number of matching systems.In the vast majority of test cases, the number of matching systems with ECGA-OEM superior to is much higher than those ones with ECGA-OEM inferior to.Only in No. 246, No. 247, and No. 254 cases, the ranking of ECGA-OEM is relatively low (refer to also Table 4 for the specific f-measure values in each testing case).
In Table 4, the numbers from 1 to 18 in the first row are edna, AML, CroMatch, Lily, LogMap, LogMapLt, Xmap,  LogMapBio, AML-2014, Gmap, LogMap-C, Mamba, AOT-2014, AOTL, MassMatch, OMReasoner, RSDLWB, and Xmap2, respectively.ECGA-OEM is the matching system proposed by us.e value of each column represents the fmeasure score of the matching system in the corresponding case.e f-measure of participants higher than that of ECGA-OEM is bold, and the equal ones are underlined.e "+," "�," and "−" in the last column, respectively, indicate the number of matching systems, with ECGA-OEM superior, equal, and inferior.

Comparison among GA, CGA, and ECGA.
To verify the performance of linkage learning, we compare ECGA with GA and CGA. e detailed f-measure and runtime of the three competitors are, respectively, shown in Tables 5 and 6.All the GA, CGA, and ECGA's results are the mean value of 30 independent runs.It can be seen that the replacement of crossover and mutation operators (GA) with probability vector (CGA) improves the f-measure and significantly reduces the runtime.e average f-measure is slightly improved, while the average runtime is reduced from 31.975 seconds to 3.540 seconds; it largely improves the algorithm's efficiency with only takes about 1/10 of runtime.Except testing cases No. 301 and No. 304, CGA is more stable than GA in terms of both fmeasure and runtime since their standard deviation is smaller.Testing cases No. 301 and No. 304 are the representatives of real-world cases with unique heterogeneity, which make the fmeasure produced by CGA decreased slightly.Linkage learning, the technique applied in ECGA, further increased the score of f-measure and made decrement in runtime with average 1.749 seconds based CGA.A smaller standard deviation than CGA was obtained by ECGA, which certified the strong stability of ECGA.It is worth to be noticed that the f-measure score of ECGA in testing case No. 301 and No. 304 is almost the same as that of GA (only 0.003 score lower in testing case No.

Figure 7 :
Figure 7: An example of linkage learning.

Figure 9 :
Figure 9: e number of participants with ECGA-OEM superior, equal, and inferior.
|O 1 | and |O 2 |, respectively, are the cardinalities of two ontologies O 1 and O 2 ; x i , i � 1, 2, . . ., |O 1 | is the i th matching pairs.Particularly, it means there is no matching of i th entity in O 1 when x i � −1. e objective of this work is to maximize ξ(σ), and for the details of it, refer to Section 4.3.3.

Table 1 :
Input: the hierarchy schemas of O 1 and O 2 ; the aggregated similarity matrix, M as ; Output: the best chromosome, Best chromosome ;(1) PV � initialization(len of O 1 , len of O 2 ) An example of initialized PV.

Table 2 :
An example of convergent PV.

Table 3 :
A brief description of OAEI Biblio benchmark.