Supergraph Topology Feature Index for Personalized Interesting Subgraph Query in Large Labeled Graphs

Interesting subgraph query aims to find subgraphs that are isomorphic to the given query graph from a data graph and rank the subgraphs according to their interestingness scores. However, the existing subgraph query approaches are inefficient when dealing with large-scale labeled data graph. (is is caused by the following problems: (i) the existing work mainly focuses on unweighted query graphs, while ignoring the impact of query constraints on query results. (ii) Excessive number of subgraph candidates or complex joins between nodes in the subgraph candidates reduce the query efficiency. To solve these problems, this paper proposes an intelligent solution. Firstly, an Isotype Structure Graph Compression (ISGC) strategy is proposed to compress similar nodes in a graph to reduce the size of the graph and avoid unnecessary matching. (en, an auxiliary data structure Supergraph Topology Feature Index (STFIndex) is designed to replace the storage of the original data graph and improve the efficiency of an online query. After that, a partition method based on Edge Label Step Value (ELSV) is proposed to partition the index logically. In addition, a novel Top-K interest subgraph query approach is proposed, which consists of the multidimensional filtering (MDF) strategy, upper bound value (UBV) (Size-c) matching, and the optimizational join (QJ) method to filter out as many false subgraph candidates as possible to achieve fast joins.We conduct experiments on real and synthetic datasets. Experimental results show that the average performance of our approach is 1.35 higher than that of the state-of-the-art approaches when the query graph is unweighted, and the average performance of our approach is 2.88 higher than that of the state-of-the-art approaches when the query graph is weighted.


Introduction
In recent years, along with the growing popularity of the Internet, a mass of data with closely related to real entity has been produced [1][2][3], which has grown explosively [4] in various fields. Taking social networks as an example, in 2018, the monthly active users of WeChat stabilized at about 1.12 billion, and 45 billion messages were sent every day [5]. In 2019, Facebook officially announced that the total number of active users has exceeded 2.45 billion [6]. As an efficient data structure, labeled graphs [7] can better express the internal connections of various entities [8] for dealing with such large-scale data [9].
Subgraph query aims to search all matching subgraphs in a data graph that are isomorphic to the given query graph and have correct labels [9][10][11]. It is widely used to describe, mine, and analyze various knowledge graphs [12]. For example, in the anti-fraud knowledge graph for credit cards, we can monitor fraud gangs by defining fraud gangs (subgraph schema) and query them. By using a subgraph query, various social relations of an enterprise can be captured and analyzed from an enterprise knowledge graph. In addition, by using subgraph query, excellent teams can be mined in a military management knowledge graph, interest groups can be searched in a social knowledge graph, and target groups can be found in an e-commerce knowledge graph for a personalized recommendation.
In practice, users are more interested in a few matching subgraphs with higher-ranking scores, rather than the whole query results of a subgraph query. Usually, a subgraph with a ranking score is called an interesting subgraph, and the ranking score is calculated by the interestingness score [13] of a subgraph, which is the sum of weights. In order to reduce the negative impact of information overload, the concept of Top-K interesting subgraph query [13] is proposed, that is, selecting K subgraphs with the highestranking scores from the results of a subgraph query. For example, Figure 1(a) shows a data graph G abstracted from DBLP, in which nodes A, J, and F represent author, journal, and research field, respectively; the weight on each edge denotes the degree of correlation between two entities. Figure 1(b) shows a query graph Q for the Top-K interesting subgraph query. Assuming K � 2, then the Top-K interesting subgraphs isomorphic to Q in the data graph G are P 1 (interestingness score is 3.4) and P 2 (interestingness score is 3.0).
Recently, various approaches were proposed for Top-K interesting subgraph query. However, most of the existing subgraph query approaches are inefficient when dealing with large-scale graphs. is is caused by the following problems: (i) the existing work mainly focuses on unweighted query graphs, while ignoring the impact of query constraints on query results. (ii) Excessive number of subgraph candidates or complex joins between nodes in the subgraph candidates reduce the query efficiency.
Specifically, in practice, users often want to get more specific and accurate query results. is can be achieved by adding constraints of filters to a query graph. Taking Figure 1(a) as an example, for the given query graph Q and data graph G, we can find all the isomorphic subgraphs of Q in G. en, if adding constraints or filters to the query graph, we can find more accurate subgraph matching results in the query results, such as to find teams (i.e., subgraphs) with frequent cooperation between members. is can be realized by adding edge weights between two entities with label A in the query graph. In other words, it is to find the isomorphic subgraphs of the query graph with the highest-ranking scores on the data graph, and the edge weight between any two author nodes on the isomorphic subgraphs is greater than or equal to the edge weight between two author nodes on the query graph. Finding isomorphic subgraphs with edge weights greater than or equal to those on the query graph is called a constraint or filter. us, in this paper, the process of obtaining subgraph query results by constraining or filtering a query graph is called a personalized Top-K interesting subgraph query. It refers to the K matching subgraphs with the top-ranking scores among the whole query results that are isomorphic to the query graph and satisfying constraints or filters.
To solve the above problems, this paper proposes an intelligent solution. Specifically, an Isotype Structure Graph Compression (ISGC) strategy is first proposed to compress similar nodes in a graph to reduce the size of the graph and avoid unnecessary matching. en, an auxiliary data structure Supergraph Topology Feature Index (STFIndex) is designed to replace the storage of the original data graph and improve the efficiency of an online query. After that, a partition method based on Edge Label Step Value (ELSV) is proposed to partition the index logically. In addition, a novel Top-K interest subgraph query approach is proposed, which consists of the multidimensional filtering (MDF) strategy, UBV (Size-c) matching, and the optimizational join (QJ) method to filter out as many false subgraph candidates as possible to achieve fast joins.
We conduct experiments on real and synthetic datasets. Experimental results show that the average performance of our approach is 1.35 higher than that of the state-of-the-art approaches when the query graph is unweighted, and the average performance of our approach is 2.88 higher than that of the state-of-the-art approaches when the query graph is weighted. e contributions of this paper can be summarized as follows: (i) We propose an Isotype Structure Graph Compression (ISGC) strategy. Specifically, in a data graph, it compresses adjacent nodes that have the same label into a folding node that contains multiple raw data nodes. en, the edges connected from a node to a folding node are compressed to form a critical compression edge. e compressed folding nodes and critical compression edges can not only reduce the size of the data graph but also filter multiple invalid nodes and edges in batch. (ii) We design an index called Supergraph Topology Feature Index (STFIndex), which represents all topology information in the generated compressed graph. As an auxiliary data structure, it can replace the original data graph storage. STFIndex will be created offline to improve the efficiency of online queries. It consists of Node Compression Topology Index (NCTIndex) and Edge Compression Feature Index (ECFIndex), which can be used to filter false candidates of nodes and edges. (iii) To improve the computational efficiency of subgraph query in a large labeled graph, we propose a partition method based on Edge Label Step Value (ELSV). It logically divides STFIndex into multiple partitions, enabling parallel queries for each partition. (iv) By using STFIndex, we propose a personalized Top-K interesting subgraph query approach in large labeled graphs, called STF_ISQtop-k. It contains query graph preprocessing, multidimensional filtering (MDF) strategy, UBV (Size-c) matching, and an optimizational join (OJ) method. MDF is used to filter out as many false candidates as possible, and OJ implements the fast joins of subquery results.
Organization. e rest of this paper is organized as follows: in Section 2, we introduce the background of this paper, which contains the problem statement and the related works. ISGC strategy and STFIndex structure are discussed in Section 3. We provide the details of the proposed Top-K interesting subgraph query approach in Section 4. Experimental results and analysis are discussed in Section 5. Section 6 is the conclusion.
is the set of edges; L (V ⟶ Σ) presents a function labeling the nodes; Σ is the sets of labels that can appear on the nodes; and W is the set of edge weights. In this paper, the data graph we studied is an undirected labeled graph, in which an undirected edge between nodes u and v is denoted indifferently by (u, v) or (v, u). Conventional subgraph query problem is usually defined as given a query graph Q and a data graph G, to find all the isomorphic subgraphs of Q in G. With the explosion of data volume and data complexity, researchers turned to research the approaches to search K optimal results by using edge weights, i.e., Top-K interesting subgraph query [13]. However, in practice, users hope to add one or more constraints of edge weights to the query graph to satisfy more accurate (or personalized) queries. is query can be called a personalized Top-K interesting subgraph query in this paper.
We use the interestingness score [14] to identify the quality of a matching subgraph, which is the sum of the weights of a matching subgraph in the data graph G. It is generally believed that the higher the interestingness score, the better the matching result. us, the personalized Top-K interesting subgraph query is used to search the top K matching subgraphs with the highest interestingness score under the condition that the query constraint is satisfied.

Related Work.
Subgraph query has a long research history. e earliest and most classical subgraph query algorithm, named Ullmann algorithm [15], was based on state-space search, which laid a foundation for the subsequent research on subgraph query. However, the Ullmann algorithm adopted a recursive exhaustive method without defining the matching order. As the size of data graphs increases, the query efficiency of the Ullmann algorithm decreased significantly.
As early as 1979, subgraph query has been proved to be an NP-complete problem [16], and it is difficult to find the optimal results in a limited time. us, VF2 [17] improved the pruning strategy of the Ullmann algorithm. It proposed a set of feasibility decision rules based on the original statespace representation and then defined the matching order of query nodes. However, both Ullmann and VF2 were limited by the superlinear complexity of large-scale graphs.
Some algorithms [18][19][20] constructed indexes based on frequent substructures to quickly match the substructures contained in a query graph. Such approaches are efficient when the query graph is frequently structure. Since the same node or edge may exist in multiple frequent subgraphs, the improper mining of frequent structures can lead to excessive redundant storage when building indexes in a large-scale graph. In addition, if the query graph does not contain frequent substructures, only plain query approaches can be made. In other words, frequent subgraph indexes limit the generality of the algorithms.
GraphQL [21] and SPath [22] constructed indexes based on reachable neighbors, and they pruned the candidate sets by capturing the information of neighbor nodes or edges and defining the corresponding optimization strategies. GraphQL [21] used the information of neighbor nodes and breadth-first search to minimize candidate sets, which filtered out invalid nodes and edges. SPath [22] constructed the index by the shortest path of nodes within the D-hop range. e index contained the number of neighbor labels per hop and the ID of the corresponding node for each label. e former was stored in faster memory, while the latter was stored in memory or out of memory, depending on the size  CFLMatch [23] delayed Cartesian product operation to reduce the number of intermediate results for subgraph matching. It built a Compact Path-Index (CPI) according to the query graph in real-time to achieve query matching. CFLMatch used an adjacency matrix representation (with size |V| × |V|) of the data graph, which limited that it can only be used to process small-scale data graphs. CECI [24] utilized the BFS-based filtering and reverse-BFS-based refinement to prune the unpromising candidates and then replaced the edge verification with set intersection to speed up the candidate verification. In addition, similar algorithms include the literature [25][26][27][28].
Most index-based algorithms adopted the principle of filtering the nodes in the query graph one by one, and only considered the structural relationships between the current node and its adjacent nodes. ey ignored the relationships between nodes of the same label type, leading to a lot of double counting. TurboISO [29] proposed the Comb/Perm strategy, which used neighborhood equivalence classes to generate a combination for each NEC during a subgraph isomorphism search. e similar nodes in the query graph are then combined into a single supernode. Instead of arranging all possible enumerations, it only filtered the supernodes.
BoostISO [30] focused on the characteristics between nodes. It not only considered the relationships between nodes in a query graph but also extended the compression techniques to a data graph. It also merged similar nodes in the data graph. When TurboISO and BoostISO applied graph compression techniques to subgraph queries, they focused on the redundancy of similar nodes, while ignoring the redundancy of dissimilar nodes. A subgraph query algorithm based on the ASSG model [31] was proposed in [31] to compress the data graph into a graph that can adapt to the structure of the query graph. It reduced the computation in the matching process. However, for the subgraph query of weighted graphs, none of these algorithms contained a weight filtering strategy. is resulted in an inefficient late filtering process that negated the benefits of earlier queries.
SumISO framework [32] performed graph compression with modular decomposition. e compressed graph was parsed into several regions containing query graphs, using supervertex selection to find the modules that may match the query graph. en, the backtracking algorithm [32] was used to find a successfully matching query graph in each region. Multiple iterations of the backtracking algorithm made it difficult to extend to large graphs.
e Top-K interesting subgraph query mainly includes two parts: one is to find all matching results from a data graph, which are isomorphic to the given query graph. e other is to extract the first K subgraphs according to the sum of weights on the data graph. e personalized Top-K interesting subgraph query adds the query constraints by the given weights in the query graph.
RAM [33] constructed the SPath index structure and proposed a filter-validation strategy, which can obtain all the matching subgraphs. And then it extracted the K matching results with the greatest interestingness. Since the process of obtaining all the matching results was relatively time-consuming, it was inefficient to query on large-scale data graphs. Different from RAM, RWM [13] proposed the idea of ranking with matching. e definition of Top-K interesting subgraph query was first standardized in RWM. It proposed two index structures, graph topology index and maximum metapath weight index. An optimized filtering strategy was also proposed to filter the obvious unsatisfied results. RWM did not require sorting and screening, which effectively improved the query efficiency.
A Top-K search framework STAR was proposed in [34], which consisted of two parts: a fast Top-K algorithm for star queries and an assembling algorithm for general graph queries. It was effective for processing query graphs with no weights. In [35], a new graph similarity measure was introduced based upon subtree patterns for the Top-K graph similarity search. It created hierarchical representative matrices for fast search. DSQL [36] was a level-based algorithm with an approximation guarantee, which was proposed to find K subgraphs isomorphic to a given query graph with maximum coverage. However, it was only applied to unweighted graphs. TKG [37] was proposed to find Top-K frequent subgraphs. It started searching for patterns using an internal MinSup threshold set to 0 and gradually raised the threshold as patterns were found. It had a certain limitation that it only searched the frequent subgraphs.
PBSM [38] introduced graph compression techniques into the Top-K interesting subgraph query to reduce the enumeration. In the filtering stage, it reduced the memory cost by only considering the direct neighbors. In the verification stage, it took the vertex with the minimum number of candidate vertices in the query graph as the start vertex of the matching order and used the idea of ranking while matching (RWM) to terminate the execution of the algorithm as early as possible by estimating the upper bound of the weights. is can reduce redundant verification and improve the overall performance. In [39], it proposed two types of interestingness measurements IE-Match and C-Match based on weighted query semantics. Moreover, it devised a space-efficient indexing scheme DVI to improve the indexing and candidate generation process.
CSEA pattern [40] provided to the user a set of attributes that had exceptional values throughout a subset of vertices. e strength of the proposed pattern language lied in its independence to a notion of support to assess the interestingness of a pattern. It was only applied to the attributed subgraphs without weights. W-Gaston [41] was proposed to modify the support in the Gaston algorithm. e proposed interesting approach integrated the frequency, page duration, and entropy parameters to mine frequent interesting subgraphs.

Data Graph Compression and STFIndex Creation
is section discusses the techniques to preprocess the data graph offline. We design an Isotype Structure Graph 4 Complexity Compression (ISGC) strategy to compress the connected nodes with the same labels and edges. en, based on the generated compression graph, we create STFIndex to capture the candidates of query graphs. In order to extend the method to the large graphs, STFIndex is divided into several partitions for parallel processing based on the ELSV partition principle.

Isotype Structure Graph Compression.
Structure profile is an important data compression model for XML documents at the earliest, which aims to simplify the description of XML document tree structures. e XML structure profile is also a tree structure, and it has no nodes with the same label and path. To narrow the size of the graph and speed up a query, this paper introduces the idea of XML structure profile into data graph compression and proposes an Isotype Structure Graph Compression (ISGC) method. Nodes with the same label and connected to each other are called equivalent nodes, and the edge between them is called isotype edge. Other nodes in the graph are called nonequivalent nodes, and other edges are known as heterotype edges. ISGC aims to merge equivalent nodes of a data graph by using equivalent relations. In this paper, the basis for judging whether nodes are equivalent or not is based on the structure of the graph and the labels of nodes, that is, whether nodes have the same labels and been connected in the graph. In the process of merging equivalent nodes, the merged node is called the folding node and the internal connections in a folding node are internal edges. With the merging of nodes, edges connected between some nodes to internal nodes within the folding node are merged. e merged edge is called critical compression edge, and its weight is the maximum weight of the internal edge. To distinguish the folding node from the nonfolding node, we set the first bit of the folding node id at 0. We call the compressed graph at supergraph, and there are no isotype edges inside. Figure 2 shows the example of graph compression. ereinto, u 1 and u 2 are compressions into folding node u 01 , and the critical compression edge is (u 01 , u 6 , 0.3). Figure 3 shows the supergraph G C of the data graph G in Figure 1, in which the folding nodes are marked red.

STFIndex Creation.
In order to improve the efficiency of invalid nodes and edges filtering, this paper presents a multilayer index structure, namely, Supergraph Topology Feature Index (STFIndex). STFIndex consists of the Node Compression Topology Index (NCTIndex) and the Edge Compression Feature Index (ECFIndex). e NCTIndex can retrieve the information of nodes to filter invalid nodes whose topological structures do not meet the constraints.
e ECFIndex index can be used to filter false candidates edges based on the types of edges and weights.

NCTIndex.
It is found that topological properties such as types of labels and information of adjacent nodes with different labels are symbolic and distinguishable in labeled graphs. erefore, NCTIndex is designed based on them in this paper, which is composed of three layers. e bottom layer index is created through the topological relationship of nodes in the supergraph, which is a multituple of <n_id, num lable1 , num lable2 , . . ., num lablen >.
ereinto, n_id is the id of each node that is the subscript of each node u, and num lable1 , num lable2 , . . ., num lablen represent the number of adjacent nodes with each label, respectively. e middle layer index is created according to the topological relationship of nodes contained in the folding nodes of the supergraph. It is composed of a 4-tuple <f_id, count, max-w, child []>; among them, f_id is the id of the folding node that is the subscripts of the folding node u, count represents the number of nodes compressed by the folding node, max-w is the maximum weight of the internal edge, and the child [] is the pointers to the bottom layer. e top layer index is created according to the label types and the number of nodes, which consists of <label type of node, the number of nodes with the current label, and pointer to the middle layer or bottom layer>. e pointer to the bottom layer will point to the unmerged nodes.
For example, Figure 4 shows the NCTIndex of supergraph G C in Figure 3. In the top layer, we can see that G C has three kinds of labels A, J, and F, which contain 3, 4, and 4 nodes, respectively. In the middle layer, for label A, it contains two folding nodes, whose ids are 03 and 04. e folding node 03 is merged by four nodes, and the max weight is 0.9. e bottom layer stores the original nodes and the number of their adjacent nodes for different labels.

ECFIndex.
Analyzing the characteristics of the labeled graph, it is found that not only do the nodes contain a lot of information, but also the information of edges Complexity such as edge type (represented by the combination of the node labels) and weight is useful to filter the candidates. us, in order to further filter the invalid structure, this paper proposes the ECFIndex. It will be divided into Isotype Edge Compression Feature Index (I-ECFIndex) and Heterotype Edge Compression Feature Index (H-ECFIndex) according to whether the label types of endpoints are the same.
I-ECFIndex contains a two-level index. e bottom layer index represents the information of edges for each kind of types, and it is composed of <id 1 , id 2 , w>, where id 1 and id 2 are ids of endpoints. To improve the filtering speed, the bottom layer index is stored in descending order of weights. e top layer index is created according to the label types of edges.
H-ECFIndex consists of a three-level index. e bottom layer index is used to establish the edge information contained in each label type, which is represented by a 3-tuple <id 1 , id 2 , w>. And it is also stored in descending order of weights. According to the features of compressed edges in folding nodes, the middle layer index contains the endpoints of the edge, the max weight of the folding node, and the pointer to the bottom layer. It is represented by <id 1 , id 2 , max-w, child []>. e top layer index is created in accordance with label types of edges. As shown, Figures 5(a) and 5(b) are I-ECFIndex and H-ECFIndex of supergraph G C in Figure 3.
Since all the information of the data graph can be read in the bottom indexes of STFIndex (consists of NCTIndex and ECFIndex), which is built created offline. To avoid redundant storage, the original data graph can be dumped or deleted as required, which reduces the memory overhead effectively.
e space complexity of STFIndex is O(|E|), which E is the set of edges. STFIndex incurs a higher space cost due to the replication of the information of nodes, but it accelerates the filtering process. e time complexity of STFIndex creation is O(|E|).

STFIndex Partition.
In order to fully improve the computing efficiency of subgraph query in a large labeled graph, we divide the STFIndex into several partitions, which partitions the original data graph indirectly, to realize the parallel query in each partition.
Firstly, we introduce the concept Edge Label Step Value (ELSV), and the standardized definition is defined as follows.

Definition 1. Edge Label
Step Value (ELSV): given a group label of graph G, L � {L 1 , L 2 , . . ., L n } (n is a positive integer), custom setting the order of labels in L, such as L 1 > L 2 . . . > L n thereinto, ">" represents the sequential order of labels, and such L 1 > L 2 indicates that L 1 comes before L 2 . en, ELSV of either edge label L i L j (i and j are positive integers) is as follows: (1) In the phase of subgraph matching, the number of edge label types contained in the data graph directly determines iterations of matching layers. In order to make matching layers distribute to each partition evenly, a partition method based on the ELSV principle is proposed. In real complex networks such as military management networks, social networks, and so on, the edge size is usually several times the node size, or even an order of magnitude higher than that. erefore, considering the memory overhead of replication, we divide the index based on nodes, which needs to replicate partial node information of the NCTIndex rather than a huge number of edge information.
e main ideas of the ELSV partition method are to calculate the sorted ELSVs in the data graph and set a partition for each label. e NCTIndex is divided according to the label types of nodes logically. For example, if there are four labels a, b, c, and d in the data graph, then we will divide NCTIndex into four partitions, respectively, NCTIndex_a, NCTIndex_b, NCTIndex_c, and NCTIndex_d. For

Complexity
ECFIndex, I-ECFIndex is divided into the partition that the endpoints belong. H-ECFIndex is divided vertically into partitions according to the parity of the ELSVs. If the ELSV is odd, it is stored on the partition of the endpoint with a smaller label value and if it is even, it is stored on the partition of the endpoint with a larger label value, and vice versa. Such as if the order of labels is a > b > c > d, then the ELSV of edge label ab is 1, which is odd. us, we store the ECFIndex_ab in partition b. In order to guarantee the correctness of the filter, each partition should replicate the information of "special" nodes from NCTIndex without needing to replicate the whole NCTIndex. e "special" nodes have the same labels as the current partition in the ECFIndex. Lemma 1 states the quantity relationship of the number of index items between NCTIndex and each partition. We exploit this relationship for verifying the effectiveness of the proposed partition method. Proof. According to the ELSV partition rule, assume there are N labels in the data graph that Num (NCTIndex) is N, and label X ranks K in the label order. e NCTIndex_X that is part of NCTIndex stored in partition X contains three categories: (1) the index items of NCTIndex with label X; (2) the index items of NCTIndex with labels sorting before K, in which the ELSV is odd and there are index items of adjacent nodes with label X; (3) the index items of NCTIndex with labels sorting after K, in which the ELSV is even and there are index items of adjacent nodes with label X. So, in case (1), there is only one kind of node label X. In case (2), the number of labels is K/2, and in case (3), that is (N − K)/2, then Num (NCTIndex_X) � 1 + K/2 + (N − K)/2 � N/2 + 1. In conclusion, Num (NCTIndex_X) � Num (NCTIndex) /2 + 1.
When replicating the NCTIndex, we will filter and only replicate index items of the NCTIndex having the same label with the current partition, rather than the whole NCTIndex. erefore, the size of the index items corresponding to each label in partition X is less than or equal to that in the original data graph. Since Num (NCTIndex_X) � Num (NCTIndex)/ 2 + 1, the replication quantity of per partition is much less than the whole NCTIndex. erefore, the ELSV partition method avoids duplicate storage of ECFIndex and greatly reduces the overhead of NCTIndex index redundant storage.
In addition, the label frequency of nodes is normally power-law distributed and the edges follow a similar pattern. So, the size of one label partition may be significantly larger than the other. To manage the imbalances between partitions, we adopt the combination of the logical and physical partition. We use the proposed strategy based on ELSV to partition indexes logically, and then, the partitions are sorted by the number of nodes and edges. According to the number of machines in the cluster, we assign the high-ranked (the partition with more nodes and edges) and low-ranked partitions into the same physical partition, to solve the problem of imbalance as far as possible.
Taking Figure 3 as an example, there are three types of nodes A, J, and F and six types of edges AA, AJ, AF, JJ, JF, and FF. us, we set three partitions A, J, and F to store the nodes of three types in NCTIndex, respectively, NCTIn-dex_A, NCTIndex_J, and NCTIndex_F. We assume label order is A > J > F, and isotype edges AA, JJ, and FF are stored in partition A, J, and F, respectively. And we calculate the ELSVs of heterotype edges, which are ELSV(AJ) � 1, ELSV(AF) � 2, and ELSV(JF) � 1. ELSVs of heterotype edges with AJ and JF are odd, so the index items of AJ in ECFIndex    Table 1, S(X Y ) represents the information of nodes with label Y in NCTIndex_X, and S(X) is the NCTIndex of label X. e example (NCTIndex_J and ECFIndex_J) of partition J is as shown in Figure 6.

Query Graph Preprocessing.
Before the interesting subgraph query, we preprocess the query graph includes query graph compression and partition, which is executed by the same methods as the data graph. For example, the process of compression and partition of query graph Q is shown in Figure 7. First of all, nodes v 1 , v 2 , and v 3 have the same label A, we compress them into folding node v 01 , and it means that the query graph Q is converted into the supergraph Q c . en, we construct STFIndex for Q c with the same index items of the data graph to filter candidates by the index. Finally, in accordance with the partition principle based on ELSV, Q c is divided into different partitions to realize the subsequent query.

Multidimensional
Filter. e scale and accuracy of candidates will directly influence the efficiency of matching. Creating the effective index and filter strategy can exclude inconsistency with the query graph, which can reduce the unnecessary verification to improve the efficiency of subgraph matching.
In this paper, a multidimensional filter strategy (MDF) is proposed based on NCTIndex and ECFIndex to pruning nodes and edges. To acquire the candidate of nodes, NCTIndex is used in this paper to effectively prune the nodes that do not meet the requirements. We take into account such factors as the label type of nodes, the size of folding nodes, and the number of adjacent nodes.
Based on the acquired candidate node set (CNS), we utilize ECFIndex to excavate the information of nodes and edges. We consider the factors such as weights and the effectiveness of endpoints to prune the query graph further. Given the above filters, we can obtain candidate edge set (CES), which is the intermediate set to subsequent subgraph matching. We know most constraints of the query are limited by weights, and we subdivide edges of the query graph into common edges and special edges (weights is not zero). MDF strategy applies five filters as follows, in which (1) and (2) are used to filter the nodes and edges are filtered by (3)-(5): (1) Folding node location filter (FNLF), for a folding node, the number of compressed nodes and a maximum weight of internal edges must be no less than that of the query graph. erefore, NCTIndex is used to prune folding nodes that do not meet the constraints and filter their bottom layer data in the index. At the same time, the nodes without compressing in NCTIndex are all filtered. is is because they are single nodes. For example, the query graph is Q C _J of partition J in Figure 7, the index of data graph is shown in Figure 6, we find only a folding node v 01 with label A in Q C _J, and the middle layer of NCTIndex with label A is retrieved to acquire u 03 and u 04 . Since the number of compressed nodes in u 04 is 2, which is less than the folding node v 01 in the query graph, u 04 and its bottom index items are filtered out, and u 11 is filtered as a single node. e number of compressed nodes and a maximum weight of u 03 are both more than v 01 , so the u 03 is the candidate node. (2) Adjacent node label frequency filter (ANLFF), the candidate nodes should satisfy that the number of their adjacent nodes with different labels are not less than that of the query graph. us, based on the FNLF strategy, we utilize NCTIndex to prune the nodes, which do not satisfy the occurrence frequency of the adjacent nodes with each label. After that, we can acquire a smaller CNS. For example, we can acquire the candidate sets of v 1 , v 2 , v 3 , and v 4 in query graph are {u 6 , u 7 , u 14 , u 15 }, {u 7 , u 14 }, {u 6 , u 7 , u 14 , u 15 }, and {u 12 , u 13 , u 8 , u 20 , u 10 }.
(3) Critical compression edge filter (CCEF), for a critical compression edge, which has been introduced in 3.1 that multiple edges connected from a node to a folding node are merged, its maximum weight should not be less than the weight of the query graph. us, the critical compression edges whose weights do not meet the constraints are pruned by ECFIndex, and the bottom layer data have been filtered indirectly. Since the bottom of ECFIndex is sorted in descending order according to weights, we do not need to traverse over the items that are less than the constraint. us, filter in batches is realized. In our example, since (v 1 , v 4 , 0.5) and (v 2 , v 4 , 0) are compressed into critical compression edge (v 01 , v 4 , 0.5), and the type of edge is AJ, we filter out edges (u 04 , u 8 , 0.3) and (u 03 , u 20 , 0.4) and their corresponding bottom index items by the middle layer of ECFIndex, whose weights dissatisfy the constraint. (4) Bottom special edge filter (BSEF), for a special edge, we retrieve the bottom index of the ECFIndex and Table 1: Temperature and wildlife count in the three areas covered by the study.
Complexity select the edges that have the same label with query graph, and weights are not less than the current special edge. en, we prune the false candidate edges, whose endpoints are not in CNS.
us, it can filter the CES further. Likewise, (v 1 , v 4 , 0.5) is a special edge with the edge label AJ. We retrieve the bottom of ECFIndex_J to get its edge candidate set {(u 14 , u 13

UBV (Size-c) Matching.
In this section, matching is performed on the candidate edge set CES. We select the minimum candidate edge set as the initial matching edge for breadth-first search (BFS) to reduce iterations and calculation overhead effectively. Before introducing the method of matching, the concepts of Size-c matching and its upper bound value (UBV) are introduced as follows.
Definition 2. Size-c matching: it represents a partial growth matching when the c edges of the query graph are instantiated in the process of subgraph matching, where c ∈ (1, n) and n is the number of edges in the query graph. And the interestingness score is the sum of weights that have been instantiated.  Figure 7: e compression and partition of query graph Q. Q is the query graph with edge weights, Q C is the compressed query graph of Q, and Q C is divided into three partitions Q C _A, Q C _J, and Q C _F.  equal to the sum of weights that both have been instantiated and uninstantiated with the maximum candidate edge weight in the process of the Size-c matching. e specific steps of matching are shown in Algorithm 1. e algorithm maintains two heaps, Top-K heap and candidate matching (CM) heap.
e Top-K heap stores K matching results in descending order. e CM heap stores the matching results in ascending order. Some matches have been inserted into the Top-K heap but may be overwritten by subsequent matching. Other matches have not been inserted into the Top-K heap but have been verified. e array CP is used to store the matching results temporarily, and O [|E Q |] is the order of matching (line 2). e starting edge and matching order are determined in lines 3-4. Query graph Q_J is instantiated to get the Top-K heap and CM heap in lines 5-15. If the interestingness score of Size-c () is equal to or less than the bottom of the Top-K heap, then the result will be stored in the CM heap (lines 7-9). Else, the bottom of the Top-K heap will be replicated to the CM heap and deleted from the Top-K heap, and at the same time, the result will be stored in the Top-K heap (lines [11][12][13][14][15]. e time complexity of the matching algorithm is O (n), in which n is the number of edges in CES.
Take the candidate edge set CES_J in Figure 8 as an example, and K is assumed 2. Firstly, we traverse CES_J to determine the initial matching edge (v 1 , v 4 , 0.5). en, BFS is performed from this edge to get the matching order (v 1 , v 4 , 0.5) ⟶ (v 2 , v 4 , 0). According to the matching order, (u 14 , u 13 , 0.8) is instantiated (v 1 , v 4 , 0.5) to obtain the Size-1 candidate matching (u 14 , v 2 , u 13 ): 0.8. By that analogy, Size-c part growth matching (c � 2. . . n and n is the number of edges in query graph Q_J) is performed to obtain K matching results (u 14

Query
Join. An optimizational join (OJ) algorithm is proposed to join the matching results of each partition and obtain the complete K matching subgraphs of query graph Q. It proceeds as few communications times as possible to reduce the communication cost in the join stage. Before describing the optimizational join algorithm, Lemma 2 is introduced first.

Lemma 2. If the top K matching results have been joined and the minimal interestingness score is μ, the interestingness score of the lowest possible matching result in the Top-K heap of partition i (Top-K_i) is μ i . Eventually, the top K matching subgraphs can be acquired by joining the matching results
in each partition, whose interestingness scores are greater than or equal to μ i � μ − n j�1,j≠i INT(r j,1 ). Among them, r j,1 represents the first result in Top-K heap of partition j, INT (r j, 1 ) is the interestingness score of matching result r j,1 (since the Top-K heap is in descending order, the interestingness score of the first result is largest.), and n is the number of partitions.
Proof. Assume that the minimum interestingness score of the final Top-K matching subgraphs is μ k , thus μ k ≥ μ, and then μ k ≥ [μ − n j�1,j≠i INT(r j,1 )] + n j�1,j≠i INT(r j,1 ). Make μ i � μ − n j�1,j≠i INT(r j,1 ), and thus, μ k ≥ μ i + n j�1,j≠i INT(r j,1 ). us, join the matching results in the Top-K_i heap of partition i whose interestingness scores are less than μ i with the matching results of other partitions. e interestingness score of generated matching result will be less than μ k ; thus, it is impossible to be the final matching result.
We know from Lemma 2 that if the minimum interestingness score of matching results in partition i is greater than the lower bound value μ i � μ − n j�1,j≠i INT(r j,1 )of possible matching in this partition, then it illustrates that the current partition may acquire better results by joining with other partitions. erefore, the matching results that are not less than μ i in partition i will be joined with other partitions to update complete Top-K results.
In the OJ algorithm, the query subgraph contains most nodes in the initial matching, because it is easier to instantiate. e query order is obtained by BFS of the initial matching. e specific process of the OJ algorithm is shown in Algorithm 2. Take Figure 9 as an example. First, traverse CES and determine the join order as Partition A ⟶ Partition J ⟶ Partition F. en, the initial matching results of the query graph (v 1 , v 2 , v 3 , v 4 , v 5 , v 6 ) are (u 14 , u 7 , u 6 , u 8 , u 1 , u 9 ): 3.4 and (u 14 , u 7 , u 6 , u 8 , u 1 , u 2 ):3.4, which get from joining the Top-K heaps of each partition, as shown in Figure 10.
en, calculate the lower bound values μ i of each partition, respectively, μ A � 3.4-1.2-1.0 � 1.2, μ J � 1.1, and μ F � 0.9. Since the lower bound values are all not greater than the minimum interesting of the Top-K heap about the corresponding partition, the first K matching results are returned and made join again. Since there is no matching that the interestingness score is greater than 3.4 and each partition no longer generates a new result, the final matching results are (u 14 , u 7 , u 6 , u 8 , u 1 , u 9 ):3.4 and (u 14 , u 7 , u 6 , u 8 , u 1 , u 2 ):3.4.

Experimental Settings.
All experiments are performed on five Intel Pentium(R) CPU G3220@3.00GHZ machines with 8 GB of RAM and 1T hard disk drive. All experiments are implemented in JAVA. We repeat each experiment 5 times and report the average results.

Datasets.
We use five different real datasets and multiple synthetic datasets to evaluate our approach. Real datasets include WordNet, DBLP, YouTube, DBpedia, and LiveJournal. WordNet, DBLP, and YouTube were used in [15,36], DBpedia was used in [36], and LiveJournal was used in [27]. We obtained the DBLP, YouTube, and LiveJournal datasets from Stanford Large Network Dataset Collection (http://snap.stanford.edu/).
WordNet represents relationships between English words, and labels of nodes represent different parts of speech. DBLP provides a comprehensive list of research papers in computer science. We chose the information of authors, fields, and journals to build the graph. DBpedia is crawled from Wikipedia. We cluster DBpedia into a data graph including the person dataset and their links, as in [36]. YouTube is extracted from a video-sharing website called YouTube. We chose the video IDs, uploaders, and categories to build the graph. LiveJournal is a free online blogging community where users declare friendship with each other, in which users have journals, individual blogs, and shared blogs. e details of the datasets above are given in Table 2. For DBpedia and LiveJournal, there are no given labels; as in [36], we have assigned a label for each vertex from a synthetic label set of sizes 50 and 100, respectively. We generate four synthetic datasets using the R-MAT graph generator [42] with G 1 , G 2 , G 3 , and G 4 , and 10 4 , 10 5 , 10 6 , and 10 7 are the number of vertices, respectively. e number of edges for each synthetic graph is 10|V|. Labels of  (13) for each cm in CM_i do (14) If INT(cm) < μ i (15) delete cm in CM_i; (16) Top-K_i, CM_i ← continue UBV (Size-c()) in i partition and INT > μ i ; (17) if Top-K_i NOT NULL do (18) CES ← Top-K_i; (19) Top   12 Complexity nodes are randomly assigned from A ∼ E, and edge weights are randomly generated from [0, 1], which is referenced in [13].

Query Sets.
We ensure that each query has at least one matching in the data graph. We traverse the data graph by depth-first search (DFS) to generate the connected query graphs of size from 2 to 50 nodes similar to existing works [15,24,38,39]. en, we assign labels and add query constraints randomly. Unless specified explicitly, we are interested in computing the top 10 interesting subgraphs (K � 10).

Preprocessing Time and Memory
Cost. Preprocessing is crucial to the subgraph query as it can significantly reduce the memory consumption, as well as the total runtime. Figures 11(a) and 11(b) show the comparison of preprocessing time on different sizes of real datasets and synthetic datasets. e necessary preprocessing tasks for CECI include finding the root query node, generating the query tree, determining the matching order, breaking automorphism, and finding the cluster pivots. Since preprocessing is mainly aimed at the query graph, we choose a compromised query graph size of 25 nodes. CECI requires less processing time than performing the data graph due to the smaller scale. e preprocessing of PBSM includes data graph compression, in which two vertices are equivalent. It traverses the data graph to find the equivalent vertices and merges them to reduce the size of the data graph. Since it does not involve a large amount of computation, it has the shortest preprocessing time compared with other methods. e method in [39] includes the creation of the DVI index, which includes destination vertex count (DVC) and sorted edge lists. To obtain all DVC, it uses breadth first search (BFS) from one vertex to another at distance D. In supplement to DVC, it keeps sorted edge lists. e preprocessing time of DVI is affected by D.
e preprocessing time of the method proposed in this paper includes data graph compression, STFIndex creation and STFIndex partition. us, compared with PBSM and DVI, it needs more preprocessing time. Since the preprocessing operations in the last three methods are independent of the query graph, only the data graph is preprocessed. us, the operations can be completed offline without affecting the efficiency of the online query. Figure 12 shows the comparison of memory cost among STIndex, PBSM, DVI, and CECI on different datasets. In this paper, STIndex will replace the storage of the original data graph, and the memory cost only STIndex without extra storage. e memory cost of PBSM includes the information of the data graph and the compressed graph. e memory cost of it is affected by the compression ratio. DVI includes the data graph and the information of the DVI index, which are distention vertex count and sorted edge lists. According to the increase of D, more information of neighbors needs to be stored, and the memory cost of DVI will increase exponentially with the increase of D. CECI needs to store data graphs, the information of preprocessed query graph as well as the candidates of the query graph. We choose the query graph with 5 nodes in our experiments. It still needs more memory space.

Parallel Processing Efficiency.
In this section, we evaluate the efficiency of the STFIndex partition. We compare the parallel processing and single machine solution with the increasing number of query graph size on two real datasets DBLP and LiveJournal, and a synthetic dataset G 3 .
As shown in Figure 13, the abscissa is the number of nodes in the query graph, and the ordinate is the acceleration ratio between the query efficiency after partition and the single-machine query. We can see the larger the data graph and the more labels in the data graph, the more obvious the speedup is. e main reason for the speedup is that the singlemachine processing solution needs to be verified one by one in the complete index according to the matching principle to obtain candidate nodes and edge sets. However, when the index is partitioned, we can search the candidate nodes and edges from different partitions in parallel. Since the partition method in this paper is partitioned according to the node type, the relevant edge information can be copied to the partition at the same time. is can reduce the communication overhead between different partitions effectively, and the parallel efficiency is relatively high. e trend slightly of the speedup ratio flats out sometimes. is is because, with the increasing of the query graph, parallel processing needs more time to join the subresults to obtain the final subgraphs.

Running Time of Personalized Top-K Interesting
Subgraph Query. We have run the experiment to find the top ten (K � 10) subgraphs of the query graph without weight (Q) as in Figure 1(b), and the query graph with weight (Q′) as in Figure 7 on five different real datasets and large synthetic dataset G 4 . Figure 14 shows the speedup ratio of running time for the proposed method and IEM, PBSM, and CECI.
As shown in Figure 14, STF_ISQtop-k outperforms IEM by an average of 1.75 and 5.02 on Q and Q′, respectively. e main reason for the speedup is that IEM utilizes the DVI index to filter candidates one by one, which does not consider the potential value of isotype nodes. e speedup on Q′ is larger than on Q. is is because IEM needs to search all isomorphism subgraphs in the data graph and rank the results to choose the top ten, which takes more time. However, STF_ISQtop-k has pruned edges whose weights do not meet the constraints in the filtering stage, which avoids repeated calculation. It is more suitable for weighted graph query. Both PBSM and STF_ISQtop-k take into account the problem of isotype nodes and introduce graph compression, which can reduce the size of the data graph and filter the invalid candidates effectively. However, STF_ISQtop-k implements parallel calculating based on the partition STFIndex in the phase of filter, rather than traverse the entire supergraph as in PBSM.
In addition, since PBSM needs to filter subgraphs whose existing weights do not meet the query constraints from all query results, the filtering is repeated. us, STF_ISQtop-k is better than PBSM for both Q and Q′, especially on the weighted query graph Q′. e running time of CECI includes the time for preprocessing, CECI creation, and the enumeration of the subgraphs. Although the preprocessing of CECI is performed online, the parallel technique speeds up its query.

Impact of Query Graph Size and K.
We evaluate the impact of query graph size on the running time of interesting subgraph query. We restrict the comparison with two real datasets WordNet and YouTube and assign weights to the query graphs randomly. e running time of each method increases with the growth of the query graph size.
As shown in Figure 15, STF_ISQtop-k has the most obvious advantages over IEM, and this is especially true in the large graph. is is because IEM needs to verify the result set again and filter the subgraphs that the edge weights  satisfy the query conditions. Meanwhile, as the size of the query graph increases, the number of iterations increased, which leads to the greater query time increases significantly. Compared with IEM, PBSM has better query performance utilizing compression technology. It also lacks the optimal query strategy for the weighted query graph, which only filters the results again to obtain the satisfying final subgraphs. CECI adopts parallel processing technology to effectively improve the query time, but it requires creating a CECI index for each query graph. However, the method in this paper only creates the index of the data graph offline, which can be used many times after the construction is completed. CECI does not consider homotype node-compression, which requires filtering for each query node to extract candidate nodes. e speedup of our STF_ISQtop-k comes from the reduction of redundancy in filtering, which prunes the unsatisfied subgraphs. e larger the size of the query graph, the more nodes and edges are compressed out, and the advantages of STF_ISQtop-k are more obvious. Figure 16 shows the impact of changed K for our STF_ISQtop-k, which is verified on real dataset DBLP. It can be seen from the figure that the query time increases with the increase of query graph size, but the increase is relatively gradual. And when the query graph Q is fixed and K changes, the fluctuation of the query time is small. So, it is verified that the STF_ISQtop-k has good scalability.

Conclusions
In this paper, we propose a novel intelligent solution for the Top-K interesting subgraph query. Specifically, the proposed ISGC strategy can compress the data graph to a smaller size and filter multiple invalid nodes and edges in batch. e proposed auxiliary data structure STFIndex represents all topology information in the compressed graph, which can replace the storage of the original data graph. Meanwhile, STFIndex will be created offline to improve the efficiency of an online query. We also propose a partition method based on ELSV to divide the STFIndex into several partitions, which realizes the parallel query in each partition. With the help of STFIndex, we provide a personalized Top-K interesting subgraph query approach in large labeled graphs named STF_ISQtop-k. It contains MDF strategy, UBV (Size-c) matching, and the OJ method, which are used to filter out as many false candidates as possible to achieve fast joins of subquery results. e experimental results show that the proposed approach can perform Top-K interesting subgraph query in large labeled graphs efficiently, and the query results have practical significance in practical applications. In the future, we will apply our solution to solve the problems in other areas and improve our Top-K interesting subgraph query approach.
Data Availability e data used to support the findings of this study are available from visiting http://snap.stanford.edu/.

Conflicts of Interest
e authors declare that they have no conflicts of interest. 16 Complexity