Relative Entropy-Based Similarity for Patterns in Graph Data

How to make a correct similarity between patterns is a groundwork in data mining, especially for graph data. Despite these methods that can obtain great results, there may be still some limitations, for instance, the similarity of patterns in directed weighted graph data. Here, we introduce a new approach by taking the so-called the second-order neighbors into consideration. The proposed new similarity approach is named as relative entropy-based similarity for patterns in graph data, wherein the relative entropy provides a brand new aspect to make the di ﬀ erence between patterns in directed weighted graph data. The proposed similarity measure can be partitioned under three phases. First of all, strength set is given by degree and weight of patterns; in this phase, four variables holding the strength about out-degree, in-degree, out-weight, and in-weight are constructed. Then, with the help of Euclidean metric, pattern ’ s probability set is constructed, which contains in ﬂ uence of similarity between pattern and its all one-order neighbors. Finally, relative entropy is used to measure the di ﬀ erence between patterns. In order to examine the validity of our approach as well as its advantage comparing with the state-of-art approach, two sorts of experiments are suggested for real-world and synthetic graph data. The outcomes of experiment indicate that the recommended method get handy execution done measuring similarity and gain accurate results.


Introduction
At present, many practical networks like Facebook social networks, protein interaction networks, aviation networks, and disease transmission networks can be presented as graph data. This type of data is no longer a straightforward portrayal of pattern's attribute information and composes possible topological information between patterns additionally, i.e., degree and weight of patterns. Due to the extensive use of graph data, many practical problems including pattern analysis, link prediction, and community detection can be abstracted into problems of graph data for research. Among these researches, how to calculate similarity between patterns in graph data is considered as one of the fundamental problems. Many researches on graph data are based on the pattern's similarity measure, for example, traffic networks [1], image classification [2], and pattern recognition [3,4].
Over the past few decades, discovering similarity between patterns has attracted substantial consideration [5,6]. Scholars proposed a range of methods to measure pattern's similarity, for example, shared neighbor-based similarity, random walk-based similarity, path-based similarity, and information theory-based similarity; these methods discuss the similarity of patterns from different perspective.
The shared neighbor-based similarity measure takes into account the shared information of the connected neighbors between patterns, and the greater the coincidence rate of shared neighbors means the higher similarity of two patterns. Cosine index [7], Sorensen index [8], Jaccard index [9], AA (Adamic-Adar) index [10], and WAA (weighted Adamic-Adar) index [11] are also common methods used in the research of similarity measure, which take into account the number of shared neighbors. Besides, LP (local path) [12] index is an improvement of CN index [13]; on the basis of CN index, the influence of neighbor with path length of 3 on the connection between patterns is added. These indicators reduce the computation time and earn good results in the identification of the most similar patterns. Unfortunately, they remain significant challenges, only topological information of first-order neighbors is taken into account, and many patterns with high similarity have no common neighbors, which leads to certain limitations of such indicators.
Random walk-based similarity is widely used to measure the topological similarity of patterns, such as MLRW (Multiplex Local Random Walk) index [14], BRW (Biased Random Walk) index [15], and LRW (Local Random Walk) index [16]. In the process of calculation, these two methods measure similarity moving from one to other patterns through multistep random walk without the global information of graph data. They simplify the similarity measure to some extent, but these three similarity measures rely on large-degree patterns and most similar patterns may be large-degree patterns, which makes the similarity results sensitive to large pattern dependence.
Path-based similarity is an important method used to measure similarity of patterns, Katz index [17,18] and ACT (average commute time) [19]. Compared with local index, global index requires the overall topological information. Besides, Aziz et al. in [20] proposed global and quasilocal extensions of some commonly used local similarity index. Although global index provides more accurate similarity than local index, the computation of global metrics is time-consuming and generally not applicable to large-scale graph data, and sometimes, global topology information is unavailable, especially when implemented in a decentralized manner.
In addition to the similarity measures mentioned above, information theory-based similarity is a kind of similarity measure that is often used. Hereinto, relative entropy is an important concept of information theory, which are used to measure similarity of patterns. Scholars have proposed pattern similarity measure based on relative entropy such as LRE (i.e., the abbreviation of local relative entropy) [21,22], LRWE-SNM (Local Network Relative Weighted Entropy Based Similar Node Mining) [23], and RE-model (relative entropy model) [24]. These methods have advantages in their respective fields and can also measure the similarity of patterns to a certain extent. Although it is faster and simpler to measure, some pattern's information and complex relationships between patterns are lost, for example, information about second-order neighbors of patterns. That is to say, it is hard to distinguish differences between patterns with similar degree. In addition, there are many other ways to calculate pattern similarity, see [25][26][27][28] for details.
For the similarity of patterns in directed weighted graph data, similarity is affected by the direction of the edge and edges in different directions have an impact on its weight. Besides, each pattern has information such as out-degree and in-degree, out-weight, and in-weight, and the relationship between the pattern and its neighbors in different scales is complex. Therefore, the similarity measure of patterns in directed weighted graph data cannot start from a single direction. Generally speaking, the above measure of similarity has been used extensively. Nonetheless, there are still some inevitable limitations. These index that used mutual information are limited to the common neighbor structure or local information of patterns; so, it is easy to make the patterns of larger degree become the general patterns in the similarity calculation. Even if existing submethods simplify the measure of topological information, they ignore the directivity of the pattern's connection and its corresponding degree and weight diversity of the relationship between patterns. Under the circumstances, some edge information of pattern is lost, leading to their performance for calculating the similarity of patterns failing to get further enhancement. In particular, there may be a poor effect when the above indices are applied to ink prediction. To sum up, calculating similarity of patterns from the aspects of degree and weight diversity is still a hotspot [29][30][31].
In this paper, we aim at similarity of pattern in directed weighted graph data. To this, an extended version of the similarity measure approach from a relative entropy point of view is proposed. For more details, the comprehensive process can be considered as three stages. First, compute strength set. By using degree and weight of pattern's information in its first-order neighbors, four variables that contain the influence of topological information about degree and weight diversity are constructed. Second, generate probability set. To take advantage of the second-order neighbor information of patterns, Euclidean metric is presented to measure the similarity between pattern and its first-order neighbors. On this basis, the value of similarity is normalized to construct probability set of each pattern. Third, Quantify similarity of pattern. With the help of relative entropy, the dissimilarity of any two patterns is measured, and similarity can be gained subsequently. We numerically simulated the proposed similarity measure and verified its effectiveness and efficiency in similarity measure and link prediction. In this paper, there is a proposed relative entropy-based similarity for patterns in graph data with the following several contributions in mind.
(1) This paper presents a similarity measure based on relative entropy, which considers the information of second-order neighbors of patterns (2) In the process of pattern's similarity measure, the proposed method considers both degree information and weight information (3) Compared with most benchmark methods, the proposed similarity measure has a great advantage in measuring similarity of patterns and gets good performance the link prediction To make a detailed description of the above proposed similarity approach, in this section, we will provide a brief introduction to the structure of this paper. Section 2 contains some preliminaries. Section 3 describes generation of strength set for patterns in detail. Section 4 proposes probability set calculated by similarity set. Section 5 constructs a measure to compute the similarity of pattern in graph data, and a novel algorithm is proposed. Section 6 carries out two type experiments to prove the effectiveness of the proposed method. Conclusion is given in section 7.

2
Wireless Communications and Mobile Computing

Preliminaries
In this section, we propose some basic concepts used in this paper, such as graph data [26], relative entropy [32], and pattern's neighbor [23].

Graph Data.
A graph data G is defined as a set of patterns and a set of edges. Generally speaking, the so-named directed weighted graph data can be expressed as a 4-tuple G = ðV, E, D, WÞ, formally, where (i) V = fv i ji = 1, 2, ⋯, ng is the set of patterns, and v i ∈ V represents the i th pattern (ii) E = fe ij ji, j = 1, 2,⋯,ng is the set of edges, and e ij ∈ E indicates the set of edges. Hereinto, e ij = 1 if pattern v i and v j are connected; otherwise, e ij = 0 ng is the set of corresponding weight with respect to patterns, thereinto d + ðv i Þ and d − ðv i Þ represent in-degree and outdegree of v i ∈ V, respectively, and the value of them, take v i for example, can be determined by equations Moreover, the degree dðv i Þ can be calculated by the sum of in-degree and out-degree, i.e., (iv) W = fðwðv i Þ, w + ðv i Þ, w − ðv i ÞÞji = 1, 2,⋯,ng is the set of weights with respect to the corresponding edges. Analogously, wðv i Þ, w + ðv i Þ, w − ðv i Þ represent the weight, in-weight, and out-weight of pattern v i , respectively. The value of in-weight and out-weight can be determined by following equations: Thereinto, w ij represents weight on edge of v i and v j . Furthermore, the value of weight can be calculated by the sum of in-weight and out-weighted, i.e., 2.2. Relative Entropy. As we known, relative entropy is an asymmetric measure and can be applied to measure the difference between two probability distributions. In general, its mathematical version can be expressed as where P and Q are two probability distributions, and "m" in equation (5) represents the number of variables that P and Q depended on. Certainly, the greater value of D KL ðPkQÞ reflects the smaller similarity of P and Q, and vice versa.
Generally speaking, if pattern v i has pfirst neighbors, then Nðv i Þ can be represented as Obviously, the elements of Nðv i Þ reflect the topological information of v i directly. For the case that e ij ≠ 0 and e jk ≠ 0 but e ik = 0, how to depict the direct relationship of v i and v k in aspect of topological information is no longer an obvious question. To this, next definition gives the concept of second-order neighborhood to depict such situation.
Definition 2. (second-order neighborhood). Given that G = ðV, E, D, WÞ is a graph data and v i ∈ V, the second-order neighborhood of a pattern v i denoted as the set contained neighbors of its all first-order neighbors, which notes as Nðv i , 2Þ, which can be expressed as Definition 3. (local neighborhood). Given that G = ðV, E, D, WÞ is a graph data and v i ∈ V, the so-named local neighborhood of v i can be expressed as where v k i ∈ Nðv i Þ is the k th first order neighbor of v i , for k = 1, 2, ⋯, p.

Degree and Weight-Based Pattern's Strength Set
In this section, we investigate the problem of how to construct the pattern's strength set in terms of degree and weight.
For any pattern v i in G = ðV, E, D, WÞ, its first order neighborhood Nðv i Þ depends on the corresponding topological connection. Whatever the connection, the topological information for each pattern can be described by four vari- In what follows, we introduce the concept of strength set for any pattern v i in a graph data G.
Definition 4. (strength set). Given that G = ðV, E, D, WÞ is a graph data, for any pattern v i ∈ G, its strength set Uðv i Þ can be expressed by following equation: where k = 1, 2, ⋯, p and p = jNðv i Þj. Each variable in Uðv i Þ contains four strength values consisting of in-degree, outdegree, in-weight, and out-weight, take v i for example, represents the strength of out-degree and can be computed by equation (ii) u d + ðv i Þ represents the strength of in-degree and can be computed by equation (iii) u w − ðv i Þ represents the strength of out-weight and can be computed by equation (iv) u w+ ðv i Þ represents the strength of in-weight and can be computed by equation Analogously, u v k i represents the strength of k th first-order neighbor to v i , which can be calculated by equations mentioned above. One can find that the above proposed strength fully depicts personal properties and topological information with respect to corresponding its first-order neighbors.
As discussed above, take v i and v j for example, if v i and v j are two different patterns, then Nðv i Þ ≠ Nðv j Þ is nothing unusual to some extent. In particular, there would be one extreme situation that By making a deeper investigation of relative entropy, one can see that the patterns with more neighbors will lose certain information, for it only calculates the value of nonzero elements in probability set, and the information of nonzero elements in the probability set of its corresponding patterns will also be lost. Considering this deficiency, we introduce a concept, the scale of strength set, to depict the strength set. Before doing this, we suppose that for a graph data G, there exists at least one pattern v i that having the most neighbors, in which we denote the number of it as n p = max v i ∈V jNðv i Þj. To this, for the pattern v i with jNðv i Þj = p 1 and pattern v j with jNðv j Þj = p 2 , we take the following cases into consideration: Case 1. If p 1 = p 2 = p = n p , then the Uðv i Þ and Uðv j Þ can be represented as Case 2. If p 1 = p 2 = p < n p , then the Uðv i Þ and Uðv j Þ can be changed into In other words, append n p − p zeros, i.e., 0 =ð0, 0, 0, 0Þ to the end of Uðv i Þ and Uðv j Þ.
Generally speaking, the Uðv i Þ and Uðv j Þ can be changed into Wireless Communications and Mobile Computing where the insufficient p 2 − p 1 locations of Uðv i Þ will be appended by strength values u v * i calculated with the help of equation (16), and the rest location of Uðv i Þ and Uðv j Þ will be appended by p − p 2 zeros, i.e., 0 = ð0, 0, 0, 0Þ.
Case 4. If p 2 < p 1 < n p , the Uðv i Þ and Uðv j Þ can be changed into where strength value u v * j of v j can be calculated as the following equation:

Generating Probability Set
Relative entropy is applied to compare the difference of two probability set. To some extent, the similarity can be regarded as the difference. For this, we try to calculate the similarity between patterns in aspect of relative entropy. Before do this, how to construct the probability set of each pattern v i ∈ G constitutes the first step of similarity measure. We have known that the strength set Uðv i Þ, take v i for example, and its one order-neighbors can be determined in terms of degree and weight. To make full use of relative entropy for the purpose of similarity measure, in what follows, we construct an approach to generate the probability set of patterns v i ∈ V for i = 1, 2, ⋯, n. Each strength value of the j first-order neighbors is composed by four variables; here, the Euclid metric can be applied to compute the similarity between v i and v j ∈ Nðv i Þ for j = 1, 2, ⋯, p with respect to its strength set. The concrete formula can be depicted as the following equation: Obviously, the value sðv i , v j i Þ describes the similarity between v i and its j th first-order neighbor, and it is only a local description in view point of Nðv i Þ. With the help of equation (20), we can make a global description of the similarity of pattern v i by the following equation: Up to now, the caring thing, that is, creating probability set, can be realized by the following equation: where By above aforementioned, the relative entropy between the probability set Pðv i Þ and Pðv j Þ, take v i , v j in G for example, it can be determined by the equation where m ′ represents the maximal neighbors of patterns v i and v j ; that is, m ′ = max fjNðv i Þj, jNðv j Þjg.
It can be analyzed that, in the process of calculating relative value of pattern v i and v j , strength set of first-order neighbors of pattern v i and v j is constructed, which contain second-order neighbor information. That is to say, with the help of Euclidean metric, the information of pattern's second-order neighbors is indirectly used during similarity calculation process.

Similarity and Algorithm
The calculation of relative entropy among the patterns has been discussed in detail. In this section, the calculated value of relative entropy will be used to compute the similarity between patterns. And then, an algorithm is proposed.

Quantify Similarity of Pattern.
From the process mentioned above, relative entropy of any two patterns is obtained based on the sorted probability sets. Therefore, the relative entropy matrix R of graph data with respect to any two patterns can be represented as R = r 11 r 12 ⋯ r 1n And then, the similarity matrix S of graph data G can be given as follows.
For the value of relative entropy is asymmetric, take v i and v j for example, both D KL ðPðv i ÞkPðv j ÞÞ and D KL ðPðv j Þk Pðv i ÞÞ describe dissimilarity of pattern v i and v j . To obtain more accurate similarity of pattern, the value of relative 5.2. Algorithm. With the purpose of a better understanding for the proposed pattern similarity measure, this section will give an algorithm containing detailed description of this similarity measure. Notice that for briefness, "Relative entropy-based similarity for patterns in graph data" can be summarized as "RESG." In terms of this algorithm, the similarity of any two patterns will be computed, after which the most similar patterns can be obtained. One can easily see that there are four states of this algorithm. The input of the RESG algorithm is a weighted directed graph data G, and the output is a matrix S composed of similarity between any two patterns in G.
The first state of the RESG algorithm is lines 1-4, strength set U of each pattern in G is generated, and each strength set has four variables in terms of in-degree, outdegree, in-weight, and out-weight. The second state is lines 5-11, to fully utilize information of pattern's first-order neighbors, the pattern with less neighbors will append average strength value in the end of strength set. The third state of the RESG algorithm is lines 12-15, the similarity between patterns and its one-order will be computed, and similarity set is generated. With the help of similarity set, pattern's probability set will be obtained. It is not hard to find; during the process of generating similarity set, the information of pattern's second-order neighbors will be used indirectly. The last state is lines 16-20; by taking the above information into account, the relative entropy and similarity of patterns are measured.

Experimental Materials
In this section, we introduce some experimental materials such as experimental environment, the graph data used in experiment, and benchmark algorithms. The experimental environment we used is listed in Table 1. 6.1. Data. This subsection will give a detailed description about the directed weighted graph data used in experiments.
A synthetic graph data Datal generated by means of the graph generator Gephi will be applied in first experiment. Datal contains 21 patterns and 31 edges and can be used Input: A directed weighted graph data G. Output: Similarity matrix S of patterns in G. 1: for each v i ∈ Gdo 2: Calculate first-order neighbors Lðv i Þ 3: Calculate Uðv i Þ by equation (10)-(13) 4: end 5: for each v i ∈ G with p 1 neighbors and v j ∈ G with p 2 neighbors do 6: if p 1 > p 2 then 7: (20) and (21) 14: Compute Pðv i Þ by equation (22) 15: end 16: for each v i , v j ∈ G do 17: Compute D KL ðPðv i ÞkPðv j ÞÞ by equation (24) 18: Compute entropy matrix R by equation (25) 19: Compute similarity matrix s ij by equation (27) 20: end 21: return Similarity matrix S Algorithm 1: Relative entropy-based similarity for patterns in graph data. to illustrate the feasibility of the proposed RESG algorithm in the following illustrative example.
The following is a detailed description of graph data used in second experiment. Data2 and Gene [33] will be used to demonstrate the similarity of our proposed RESG index and other similarity measures. For the edge connections of Data 2 and Gene [33], see Figures 1 and 2 for detail. Stmarks, FWEW, FWMW, FW FW, Celegans, and Email167 are directed weighted graph data collected from Stanford Dataset. Each of them will be used to show the effectiveness of the proposed RESG algorithm in link prediction. The topology information of these eight graph data are shown in Table 2, where n is the number of patterns, m is the number of edges, hdi is the average shortest distance, hρi is the density, hki is the average degree, and hci is the clustering coefficient.

Benchmark Algorithms.
Here, we introduce several benchmark pattern's similarity indices, which are usually used for similarity measure and link prediction. Adamic-
AA index is the extended version of CN index, which is defined as WAA index is the weighted version of AA index, which is defined as where a may be smaller than 1; so, we use log ð1 + aÞ to avoid a negative value. CN index directly takes the number of all common neighbors between patterns as similarity into consideration, which is defined as LRE index is a similarity measure based on relative entropy and local structure of patterns, which is defined as whereinto Hereinto, ΔðGÞ is the maximum degree of the graph data, and G, P i is the probability set of pattern v i with respect to degree.
Katz index is based on the global information of graph data, which is defined as where jpath s ij < l > j represents the set of all paths with distance l between pattern v i and v j , β is the damping factor used to control the path weight. LP index considers the third-order paths on the basis of common neighbors, which is defined as where A is the adjacency matrix of graph data [34], ðA 3 Þ ij represents the number of paths with length of 3 between patterns v i and v j , and α is adjustable parameter. LRW index is proposed based on the local random walk of particles between two patterns, which is defined as where jEj is the number of the edges in the graph data, π ij ðtÞ is obtained according to the density vector evolution equation: π ! i ðt + 1Þ = P T ⋅ π ! i ðtÞ, the P is the transition probability matrix, and T is the matrix transpose.

Experimental Analysis
In this section, we evaluated the proposed RESG index into different real-world graph data, and two different forms of experiments are used to demonstrate experimental results, which aims to further prove the effectiveness and efficiency of proposed RESG index. 7.1. Illustrative Example. Data1 is used to illustrate the proposed RESG index, for the edge connections of Data1, see Figure 3 for detail. Taking pattern v 15 and v 10 for example, in terms of RESG index, next, we deal with the problem of pattern similarity step by step. Firstly, we find pattern's first-order neighbors of them, respectively, and put them in Lðv i Þ, and relevant strength value about topological information u d − , u d + , u w − , u w + of v 15 and v 10 is calculated and shown in Tables 3 and 4, respectively. However, it can be easily found that d 15 = 5 and d 10 = 4. Based on this, a pattern v 0 with the average value of v 10 for u d − , u d + , u w − and u w + is added as the one-neighbor of v 10 . After that, the neighbors of the two patterns reached the same number, which avoided the partial information loss of v 15 in the subsequent calculation of relative entropy.
Secondly, the similarity sets are generated in the process of calculating the similarity between patterns and its firstorder neighbors, and the details of Then, with the help of equation (24), the relative entropy r 10,15 of pattern v 15 and v 10 can be computed as follows.
Finally, by computing pattern's similarity of the graph data G, the maximum value of pattern similarity can be found from Figure 4; in terms of equation (25), similarity of v 15 and v 10 is 0:8901. Obviously, the similarity calculation process of v 15 and v 10 can help better understand RESG index. The details of relevance matrix of graph data G are    Figure 4, and the most similar pattern in G is shown in Table 5.
According to Figure 3, we can find that compared with patterns v 10 , v 5 and v 15 , they have more similar topological structures. Depending on Table 5, the most similar pattern of v 15 is exactly identified as pattern v 5 . Illustrative example given shows that RESG index is simple, efficient, and reliable with highly satisfactory accuracy.

Result Analysis.
To further illustrate the efficiency of the proposed RESG algorithm in measuring pattern's similarity, this subsection gives comparative experiments with serval   Serial number k out k in w out w in u k out u k in u w out u w in  There is a good similarity measure, whose scatter plot is dispersed on the plane, rather than concentrated on both sides of diagonal. The reason is that if the points are concentrated on both sides of the diagonal line, it shows that this method is easier to identify its neighbors as most similar patterns, which is not accurate enough. In the following, under Data2 and Gene, the scatter plots of the proposed RESG index and other seven similarity indices are used to further validate the performance of similarity measure, which are vividly shown in Figures 5 and 6, respectively. Figures 5(a), 6(a), 5(b), 6(b), 5(c), and 6(c) show scatter plots formed by AA index, WAA index, and CN index, respectively. As we can see, the most similar patterns are concentrated near to diagonal. There is no denying that these three indices are low computational complexity; nevertheless, it uses very limited information. Generally speaking, similarity is determined by the number of common neighbors between patterns. Accordingly, the most similar patterns are distributed near the corresponding patterns. Although the symmetry of patterns is good, it is difficult to accurately describe the similarity between patterns when only one path is considered.

16
Wireless Communications and Mobile Computing overall view, LRE takes information of the local structure into consideration, and it remains a daunting challenge on obtaining accurate similarity value. In addition, with the size of graph data increasing, the symmetry between patterns decreases significantly. Figures 5(e) and 6(e) show scatter plots formed by Katz. It is worth noting that the adjustable parameter of Katz index is set to α = 0:001, whereby the scatter plot under Katz index is concentrated around the diagonal line. Furthermore, the symmetry is not desirable. Katz index relies more on path among patterns in graph data, and patterns with larger degree are more likely to be in the path between different patterns; so, there is a greater probability that most patterns are similar to the patterns with greater degree in graph data.
Figures 5(f) and 6(f) show scatter plots formed by LRW index, and the number of random walks in this experiment is set to 3. One can see that the scatter plots are unevenly distributed, and the accuracy of similarity obtained by this similarity measure needs to be further improved. Moreover,

Wireless Communications and Mobile Computing
LRW index considers the random walk with finite number of steps, and the computational complexity of this measure is higher.
The scatter plots of LP index are vividly showed in Figures 5(g) and 6(g). The advantage of LP index is low computational complexity. However, due to the limited information used, the distribution of similarity values is too concentrated, which makes distinguishable similarity between patterns.
Figures 5(h) and 6(h) show scatter plots formed by RESG index, respectively. As we can see that the most similar pattern is not distributed near to diagonal, and with the size of graph data increasing, the scatter plot formed RESG index still maintains good symmetry. RESG index measures the similarity between patterns using influence of pattern degree and weight and takes the information of first-order neighbors and second-order neighbors of patterns into account, which can get more accurate similarity of any two patterns. Under the circumstances, most patterns avoid becoming general patterns and avoid being identified as certain patterns with common structure that are most similar to multiple patterns. Moreover, in terms of runtime, RESG index is higher to LRW index, CN index, and AA index. However, compared with the same type of relative entropy-based similarity LRE index, the running time of RESG index is only 1/4 of it. In addition, comparing with the normal algorithm, RESG index is simple and efficient and can satisfy measure the similarity of patterns in large graph data efficiently.
For a different method, a quantification named most similar pattern listed in Table 6 is used to demonstrate the difference between RESG index and three existing measures: LRE index, CN index, and EI index [35], so as to verify the good effect of RESG index from another perspective. The first line of Table 6 is the pattern's label, three of every 100 patterns are selected randomly, and a total of 20 will be used as experimental patterns listed. "/" represents that the pattern does not have the most similar pattern. Since there is such a situation that pattern in graph data has more than one of the most similar patterns, only the same pattern sequence numbers are listed, and the rest of most similar patterns are shown in the table with abbreviation numbers. Take pattern v 5 under the EI index for example, (148) represents pattern v 5 which has 148 most similar patterns.
As it shows in Table 6, pattern v 7 is identified as the most similar pattern of 7 different patterns under LRE index, including pattern v 5 , v 52 , v 172 . LRE index takes the degree of patterns into consideration simply; so, it is possible that most patterns may have the same degree distribution, which leads to the same similarity of patterns. Analogously, under the EI index, several patterns have more than one most similar pattern. For example, a number of 148 most similar patterns are identified by patterns v 5 , v 104 , v 297 and so on. However, there are also patterns without the most similar pattern, for instance, patterns v 52 , v 81 .
As we can see, there is no situation that multiple patterns identify the same most similar patterns under RESG index. RESG index takes information of pattern's one-order and second-order neighbors into account, which can accurately calculate the similarity. Meanwhile, the weight of patterns also contains a lot of topological information, and there may be a situation that the degree distribution is the same but the weight is different. RESG index starts from the perspective of degree and weight, which may make it exact to distinguish the similarity. As a result, RESG index is feasible and effective.
To further verify the feasibility of the proposed similarity measure, RESG index is applied to link prediction and compared the prediction performance with CN index, LP index, Katz index, and LRW index. The experiment is carried out on six graph data collected from Stanford Dataset, and AUC is selected as an index to evaluate the prediction performance of effective path topology stability. For the more information of AUC, see reference [34] for details. Figure 7 shows the comparison of AUC results on RESG and other four similarity measures. Among them, CN index only considers the degree information of patterns, LRW index, LP index, and Katz index either consider the local path or the global path of graph data; so, their time complexity is relatively high. As we can see from Figure 7, compared with RESG index and LRW index, the AUC value of CN index, LP index, and Katz value on Stmarks and FWEW is not ideal. However, compared with RESG index, LRW index has higher time complexity. The AUC of RESG index is the highest on four graph data: FWMW, FWFW, Celegans, and Email167, second only to LRW index on Stmarks and FW EW. Meanwhile, compared with the AUC of other four measures, the improvement rate can reach 2% − 21%. The experiment suggests that RESG index can achieve the highest AUC value in four graph data; to some extent, it shows the effectiveness and feasibility of RESG index.
However, it deserves our attention that the proposed RESG index also has limitations, and it can achieve better link prediction effect on graph data with small clustering coefficients. For graph data with large clustering coefficient, the effect of this measure needs to be further improved and optimized.

Conclusion
Measuring similarity of patterns in graph data is a significant work in many fields. In this paper, to overcome the shortcomings and limitations of existing similarity measures, a relative entropy-based similarity for patterns in graph data abbreviated as RESG index is constructed. Our main work is divided into three aspects. Firstly, strength set is given by degree and weight, which proposed four variables that contains the information of topological relationship in first-order neighbors. Then, in order to generate probability set, patterns with smaller neighbors are redefined by appending empty neighbors up to the same neighbors as another. Finally, relative entropy is computed, and pattern's similarity will be calculated. In addition, two sets of comparison experiments with serval classic similarity measure are used to show effectiveness and feasibility of the proposed RESG index algorithm. Experiments indicate that by taking pattern's degree, weight and second-order neighbors into consideration, the RESG index algorithm can better identify similarity between patterns. To some extent, our purposed approach can enrich the research in area of pattern's similarity in graph data.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare that they have no conflicts of interest.