Dominance-Partitioned Subgraph Matching on Large RDF Graph

,


Introduction
e problem of subgraph matching is one fundamental issue in graph search, which is an NP-complete problem [1]. Specifically, given a query graph q and a large data graph G, the problem of subgraph matching is to extract all isomorphic subgraphs of q on G. However, one aspect is that the complex structure of a query graph depresses the query accuracy and performance on large data graphs, as the eruptive growth of data scale the in real world. Another aspect is that the data of social network tend to be organized as a rich-semantical structure. In this paper, we devote to research the subgraph matching problem on the large richsemantical RDF graph.
Despite the complexity of knowledge structure and the polynomial-time problem of subgraph matching, recent existing researches have made significant advances in improving the performance of subgraph matching on large knowledge graph in a distributed environment.
One aspect is to encapsulate RDF data into triple-based relational tables [2,3], which ensures the completeness of triple-based indivisible knowledge. Since the relational methods ignore the inherent graph-like structures of RDF data, an expensive cost is incurred to consume the excessive join operations over relational tables. Another aspect is to manage RDF data into native graph formats, which typically employs adjacency lists to index RDF data [4][5][6]. Since the minimum edge-cutting strategies on large graphs depress the structure of indivisible knowledge, the enormous intermediate results are rigorous to balance the loading of partitioned RDF subgraphs.
To ensure the completeness of indivisible knowledge on graph-based formats, most researchers devoted to decompose the pattern graph into special-shaped subgraphs. StarMR [7,8] decomposed query graphs to a set of starshaped subgraphs, and then, two optimization techniques were employed to filter invalid input data and reduce the data of stars. CFLMatch [9] postponed the aggregate operations on a tree-shaped index constructed from the coreforest-leaf query partitioned model. guide the partition of large RDF graph. Finally, a subgraph matching algorithm is designed to conduct all isomorphic subgraphs on partitioned RDF subgraphs. Our contributions are illustrated as follows: We proposed a dominant connected pattern graph to extract the dominating relationships of pattern graph, including node denotative relationship and node connotative relationship. e node denotative and connotative relationships discover the dominant and semidominant nodes in pattern graph. en, fishshaped pattern subgraphs are obtained through the dominant node-centered expansion. We design a dominance-partitioned pattern hypergraph to model the fish-shaped pattern subgraphs. Each hypernode refers to a fish-shaped pattern subgraph, and each hyperedge denotes the common subgraph between fish-shaped pattern subgraphs. We employ a dominance-driven spectrum clustering strategy to gather the fish-shaped pattern subgraphs to multiple clusters. A dominance-partitioned weighted matrix is first constructed from dominance-partitioned pattern hypergraph. en, the spectrum clustering strategy is employed to gather the hypernodes into multiple clustering based on the weighted matrix. We design a state transition model to describe the transition states of changed candidates, which consists of three states and six transition rules. Based on the state transition model, we analyze the influence of changed candidates to adjacent region and design our incremental maintenance strategy. We propose a dominance-partitioned subgraph matching algorithm to conduct all isomorphic subgraphs on a cluster-partitioned RDF graph. e rest of this paper is organized as follows: Section 2 introduces the preliminaries about problem definitions and related works. A framework of a dominance-partitioned RDF graph is provided in Section 3, including a dominantconnected pattern graph, dominance-partitioned pattern hypergraph, and dominance-driven spectrum clustering strategy. Section 4 presents a dominance-partitioned subgraph matching algorithm. Experimental results are reported in Section 5. A conclusion is given in Section 6.

Preliminaries
In this section, the definitions of RDF graph and subgraph matching are first given. en, the related researches are introduced.

Problem Definitions.
Resource description framework (RDF) [10] is a standard semantic model designed by a W3C group\footnote {https://www.w3.org/community/kgconstruct/}, which is represented by a set of triples 〈S, P, and O〉. Each triple 〈s, p, and o〉 consists of three components: a subject, a predicate, and an object. Further, a triple 〈s, p, and o〉 is formed as I × I × IL, where I denotes an IRI (Internationalized Resource Identifier) and L represents a literal.
Definition 1 (RDF graph). An RDF graph is a directed labeled graph, formed as G(V, E, L, and φ). Here, V is a set of vertices, E ⊆ V × V represents a set of directed edges, L denotes a set of labels about vertex and edge, and V ∪ E ⟶ L indicates a labeling function that assigns vertex and edge with the instantiated labels. e labels of a RDF graph are classified as instance-label, relation-label, attribute-label, and type-label according to the resource and interresource relationship of RDF data. An RDF triple 〈s, p, o〉 is considered, o is named as type-label if and only if both s and o are IRIs, and p is a typed predicate, e.g., rdf : type, rdf : subclass of. e s and o are called as instance-label, p is named as relation-label if and only if both s and o are IRIs and p is not a typed predicate, and p is called as attribute-label if and only if o is a literal.
Considering an RDF graph in Figure 1, each vertex is labeled by an instance-label or a type-label or an attributelabel or a literal and each edge is labeled by a relation-label. e set of literal-labeled vertex is described as {123@163.com}. All the labels of edges are mapped to relation-labels.
Definition 2 (pattern graph). A pattern graph is a directed labeled graph, formed as Q(V, E, L, and ψ), where V is a set of vertices, E ⊆ V × V represents a set of directed edges, L denotes a set of labels about vertex and edge, and ψ : V ∪ E ⟶ var indicates a labeling function that assigns vertex and edge with the conceptual labels.
Considering a pattern graph in Figure 2, each vertex is mapped by type-label or attribute-label and each edge is mapped by relation-label. e difference of RDF and pattern graph is that the pattern graph does not contain the instancelabels. Further, the pattern graph is a conceptual network and each query graph is a subgraph of the pattern graph.

Subgraph Matching.
e problem of subgraph matching is to search all possible subgraphs of data graph G that are isomorphic to query graph q. e subgraph matching is formally defined as a problem of subgraph isomorphism, described in Definition 3. A query graph q is subgraph isomorphic to a data graph G if there exists a subgraph isomorphic mapping (subgraph mapping for short) of q on G. Simply, considering data and Definition 4 (K-partition problem on RDF graph (RG-KP)). Given an RDF graph G(V, E, L, and φ), the K-partition problem on the RDF graph refers to divide q into k subgraphs, satisfying G � G 1 ′ , . . . , G k ′ , such that overlapped cost |G kp | � 1≤i,j≤k (G i ′ ∧G j ′ ) is minimum and subgraph cost satisfies the condition |G 1 ′ | ≈ · · · ≈ |G k ′ |.
In this paper, our research of the RG-KP problem focuses on the dominance-partitioned strategy to divide the topological and tree-shaped structures of pattern graph. en, a dominance-driven spectrum clustering is used to gather the dominance-partitioned pattern subgraphs into multiple clusters. Finally, dominance-partitioned subgraph matching algorithm is designed to conduct all isomorphic subgraphs on a cluster-partitioned RDF graph.
In this paper, we focus on the directed labeled graphs. Both q and G are directed labeled graphs, and the directed or undirected edges cannot affect the execution scheduling of subgraph matching. us, the dominant connected pattern subgraph and dominance-partitioned pattern hypergraph are defined as an undirected graph without the graph-labels. e detailed notations and meanings are described in Table 1.

Related Works.
In this section, we mainly review the related works on triple-based relational and graph-based traversal strategies in distributed environment.  triples into multiple attribute tables. RDF-3X [2,12] and hexastore [3] implemented index-based query schemes through directly storing multiple arrangements of triple redundantly in B + -tree. Peng et al. [13] designed an RDF graph storage scheme to optimize graph division and balance the query loading, where the query processing was classified as two stages: scanning and joining. During the scanning phase, the query engine decomposed the SPARQL query into a set of triple patterns. In the joining phase, the scanned intermediate results are first bound into a leftjoining tree, and then, the query results are conducted through the left-joining tree.
Distributed systems H-RDF-3X [14] and SHARD [15] horizontally divided RDF data into multiple computing nodes and used Hadoop as the communication layer for cross-node queries. H-RDF-3X divided the RDF graph into the specified number of partitioned data subgraphs through the minimum edge cut method METIS [16]. en, the strategies of 1-hop or 2-hop replication were employed to extend the partitioned boundary of data subgraphs which ensured that small-diameter queries can obtain complete answers within single partitioned data subgraphs. e query processing of two systems used the reduce-side strategy that RDF triples are scanned in the mapping phase and the intermediate results are combined to the final results iteratively in the reducing phase. However, the iterative mapping and reducing operations of RDF triples can conduct expensive time-consumption on the complex topological structure of query graphs.
To reduce the complexity of topological query graph, a query decomposition model, called as TwinTwig [17], was designed to an efficient subgraph enumeration algorithm on distributed undirected graph. S2RDF [18] converted SPARQL queries into RDD operations on the spark-distributed computing framework. Even though the offline indexes were built to speed up the online subgraph matching processing, the expensive time-consumption of index construction needs to be paid to match the large-scale data graph. TriAD [4] combined join-ahead pruning via the form of RDF graph summarization with a locality-based horizontal partitioning of RDF triples into a grid-like distributed index structure.

Graph-Based Traversal Strategy.
e graph-based traversal strategies were employed to store RDF data in native graph format, which focused on the construction of data indexes and pruning rules of redundant intermediate results.
e constructed indexes of large data graphs were used to shrink the search space of candidate intermediate results.
BitMat [19] proposed a compressed bit-matrix structure to store the huge RDF graphs, and a variable-binding-matching algorithm was directly designed to produce the final results without indexing the intermediate results. TripleBit [20] presented a fast and compact system for storing and accessing RDF data, which designed two auxiliary index structures to minimize the cost of index selection during query evaluation. A signature technology was proposed by gStore [21], which stored RDF data in disk-based adjacency lists and transformed an RDF graph into a data signature graph by encoding each entity and class vertex. en, VS * -tree was proposed over the data signature graph with light maintenance overhead. To enhance gStore, a redesigned gStore [22] was given a new query plan generation module that generated different query plans according to the structures of query graphs. Furthermore, it redesigned the Table 1: Notations and meanings.

Notations
Meanings A set of nodes dominated by u u ≺ u′ u dominates u′ u ≼ u′ u semidominates u′ P(u) A circular-pattern subgraph of dominating node u T(u) A tree-shaped pattern subgraph of dominating node u PT(u) A dominance-partitioned pattern subgraph of node u, PT(u) � P(u) ∪ T(u) 4 Complexity vertex encoding strategy to achieve more pruning power and a new multijoint algorithm to speed up the subgraph matching process. e researches of pruning rules were employed to cut the redundant intermediate results in the processing of subgraph matching, such as Trinity.RDF [23] and WuKong [5]. Trinity.RDF was a distributed memory-based graph engine for web-scale RDF data. Instead of managing the RDF data in triple stores or as bitmap matrices, the engine stored RDF data in its native graph format to support the graph-based operations on RDF graphs, e.g., random walks, reachability, and community discovery. However, Trinity.RDF only used onehop pruning rules to avoid the redundant path-based intermediate results, and a master machine was needed to aggregate all positive intermediate results.
e researches [4,5] found that the single-machine aggregate operations can easily conduct the bottleneck of a big query graph because the huge intermediate results can cause memory overflow on the single-master machine. Further, experiment [5] shows that the aggregate operations consumed more than 90% of the total matching time. Based on the experimental verification, WuKong adopted a full history pruning strategy to reduce the redundant intermediate results previously. However, a cost model of aggregate operation was designed to guide a matching order on the relational database. e expensive time-consumption of full-join Cartesian products should restrict the time-efficiency of conducted final results, but it only uses the cost model based on the predicate connection in the relational method to guide the query execution.
Most existing studies are devoted to ensure the completeness of indivisible knowledge. Xu et al. [24] studied the problem of multiobjective spatial keyword query with semantics and designed the LIR-tree index to integrate the spatial and semantic information of all objects in a balanced way. Wang et al. [25] created a new social network with more complete knowledge and proposed a k-Dcore framework to retrieve effective communities in the directed social network. Chen et al. [26] proposed a pivot-based hierarchical indexing structure S 2 R-tree to integrate spatial and semantic information in a seamless way, which carefully designed a space mechanism to transform the high-dimensional semantic vectors to a low-dimensional space so that more effective pruning effect can be achieved. Cheng et al. [27] studied to automatically repair the graph with some repairing rules that designed a decomposition and join strategy to solve the polynomial time complexity of finding isomorphic subgraphs of graph data on graph-repairing rules. DistR [28] was an efficient-distributed strategy to solve the problem of reachability query over large uncertain graphs that found all of the maximal subgraphs of an original graph on the step of distributed graph reduction and transform the problem into a relational join process on the step of distributed consolidation. Deep NBCN [29] discovered the homogeneous and multibranch architecture to model the complex internal relationship between amino acid sequence and protein secondary structure sequence.
In this paper, we devote to decompose the pattern graph into partial subgraphs for reducing the time-consumption of aggregate operations. Our research motivation is induced by the previous researches on a special structure discovery of the pattern graph. e first existing study was StarMR [7], which decomposed query graphs to a set of attribute stars to filter redundant input star-shaped data. e second existing study demonstrated [30,31] that a topological structure was discovered by the analysis of anchored and followed relationships to reduce the discontinuous intermediate results.
e third empirical study [32] is our previous work for subgraph matching on static knowledge graph that constructed a flow-based subgraph index to reduce redundant RDF data.
Benefiting from the previous researches, a dominancepartitioned pattern subgraph is designed to encapsulate the topological and attribute structures of pattern graph. en, the large RDF graph can be partitioned in the aid of our dominance-partitioned pattern subgraphs, and the framework of the partitioned RDF graph is introduced detailedly in Section 3.

Framework of Dominance-Partitioned RDF Graph
In this section, a dominance-partitioned subgraph matching framework is proposed to conduct the subgraph mappings of query graphs on large data graphs. Firstly, a dominantconnected pattern graph is acquired from a pattern graph. Secondly, the large data graph is partitioned by a method of pattern-driven spectrum clustering. Finally, the subgraph mappings of query graphs on partitioned data graphs are conducted iteratively. A pseudocode of DP-SM is described in Algorithm 1. A dominant-connected pattern subgraph (DCPG for short) is acquired from a pattern graph Q, which is formed as Q dc . Firstly, a model of flow graph is employed to extract the dominated vertices and dominating relationships of the pattern graph. en, the Q dc is constructed from dominated vertices and dominating relationships of pattern graph (Line 1 and Section 3.1). Secondly, a method of spectrum clustering is used to divide the larger data graph based on the hypergraph of Q dc (Line 2 and Section 3.2). Finally, the subgraph mappings of query graph q on partitioned data graph G k are conducted iteratively (Line 3 and Section 4).

Dominant Connected Pattern Graph.
e dominantconnected pattern subgraph Q dc refers to a subgraph of pattern graph Q, satisfying Q dc ⊆ Q, which is expanded from the theory of Dominator Tree [33]. Considering a pattern graph Q(V, E) and vertices u, u′ ∈ V, if there exists an artificially designed root u r ∈ V, such that u is the one necessary vertex on the path from u r to u′, then u is a dominant node of u′, formed as u ≺ u′. Similar to the dominating relationship among query vertices, if there exists an edge e of E, satisfying e is a necessary edge on the path from u r to u′, then e ≺ u′. On the basis of dominating relations in Q, the definition of dominant connected pattern graph is described in Definition 5.

Complexity
Definition 5 (dominant connected pattern subgraph). Given a pattern graph Q(V, E) and a root node u r ∈ V, a pattern subgraph is a dominant-connected pattern subgraph, formed as Q dc (V dc , E dc ), if and only if it satisfies the conditions: (1) for any a node u′ ∈ (V − V dc ), it always finds a node u ∈ V dc such that u ≺ u′ on the paths |u r , u ′ 〉. (2) For any an edge e′ ∈ (E − E dc ) and the end-node u′ of e, satisfying e ≺ u′ on the paths |u r , u ′ 〉where |u r , u ′ 〉 is the paths from root node u r to query vertex u′. e dominating relationships u ≺ u′ and e ≺ u′ on |u r , u ′ 〉 refer that u and e are the necessary vertex and edge on the path from u r to u′, respectively. We define the collector of dominant nodes in V dc as V d , and then, it satisfies the condition V d ⊆ V dc ⊆ V, and the set of nodes dominated by u is denoted as a dominant set of u, formed as dom(u).
Considering a pattern graph Q and an artificial root node u 1 in Figure 4(a), the Q dc of Q is shown in Figure 4(b). e dominating relationship is used to acquire the topological and attribute structures of the pattern graph; thus, the directionality of pattern graph cannot be considered in the calculation of Q dc . Considering the Q dc in Figure 4(b), if any a node of Q dc is deleted, such that Q dc is invalid, then the Q dc is minimum. Regarding a deleted node u 3 , satisfying edges (u 6 , u 3 ) and (u 2 , u 3 ) are also deleted from Q dc , and then Q dc will be invalid, because there cannot find an edge dominating u 3 in V − Q dc .
In this paper, we explore the characteristics of Q dc to analyze the node denotative and connotative relationships in a pattern graph Q. e node denotative relationship is discovered by the dominating relationships of vertices in pattern graph, as described in eorem 1.
Theorem 1 (node denotative relationship). Given a DCPG Q dc (V dc , E dc ), if exists a dominant node u ∈ V dc , satisfying dom(u) is nonempty, and then, the nodes of dom(u) are constructed as a tree rooted by u.
Proof. (for eorem 1).A dominant set dom(u) is considered, satisfying u is the one necessary vertex on the path |u r , u ′ 〉, thus dom(u) and u can be combined into a flow graph originating from u.Different edges e, e′ ∈ E − E dc is considered, if the common end-node u′ is contained by them, then e and e′ are not the dominant edges of u′. e nondominating relationship between vertex and edge contradicts the condition (2) in Definition 5. e node connotative relationship is discovered by the vertex semidominant relation in a pattern graph. A DCPG Q dc (V dc , E dc ) and a dominated vertex u ∈ V dc are considered, if there exists a dominant vertex u on the path |u r , u ′ 〉, satisfying u ≺ u′, such d(u r , u′) is a minimum distance of D(u r , dom(u r , u′)), then u is a semidominant node of u′, formed as u ≼ u′. Here, dom(u r , u′) is a set of vertices dominating u′ on the paths |u r , u ′ 〉, d(u r , u′) denotes the distance from u r to u′, and it is collected into the set D(u r , dom (u r , u′)), satisfying u′ ∈ dom (u r , u′), and d(u r , u′) ∈ D(u r , dom (u r , u′)). e node connotative relationship is defined in eorem 2.
□ Theorem 2 (node connotative relationship). Given a minimal DCPG Q dc (V dc , E dc ), if exists a dominant node u ∈ V dc semidominated by u′ ∈ V dc , then the paths from u′ to u are combined as a node or single-circular graph or multicircular graph.
Proof (for eorem 2). A minimal DCPG refers to the DCPG generated the least dominant nodes of a pattern graph. DCPG Q dc (V dc , E dc ) is considered, if exists a dominant node u ∈ V dc semidominated by u′ ∈ V dc , satisfying there is a path from u′ to u, and then, there must exist a smaller DCPG that does not contain u, thus Q dc is not a minimal DCFG. erefore, if u′ and u are different dominant nodes, there must exist multiple paths from u′ to u, which should be combined as a single-circular or multicircular graph, otherwise, they are the common dominant node. en, a method of dominance-driven spectrum clustering is employed to divide a pattern graph as k subgraphs. e dominance-partitioned pattern hypergraph is defined in Definition 6.
Both hypernodes and hyperedges of DPPG indicate the geometries of a pattern graph Q, which are conducted on the basic of node denotative and connotative relationships in Q. e geometry of a node is a fish-shaped subgraph of Q. A dominating set dom(u) is considered, we define the denotative and connotative relationships of u as a tree-pattern subgraph T(u) and circular-pattern subgraph P(u), respectively. e combination of a tree-pattern subgraph and circular-pattern subgraph of dominating node u is denoted as (u). us, (u) is a fish-shaped graph with a semidominant node of (u) as the fish head and leaves of (u) as fish tail. e geometry of edge indicates the circular-pattern common subgraph, which is composed of multiple paths between any dominant nodes in V dc .
Considering the pattern graph Q and dominant-connected pattern subgraph Q dc in Figure 4, the dominance-partitioned pattern subgraphs are illustrated in Figure 5, where the rounds filled with diagonal lines denote the dominant nodes, and the rounds filled with vertical line indicate the semidominant nodes. Each dominance-partitioned pattern subgraph is composed of the multiple paths from a semidominant node to the dominant one and the dominated tree-shape structure of dominant node. Regarding a dominance-partitioned pattern subgraph (u), it is composed of the multiple paths |u 1 , u 2 , u 3 〉, |u 1 , u 6 , u 3 〉, and |u 1 , u 7 , u 6 u 3 〉} from u 1 to u 3 and the dominated tree-shaped structure {u 3 , u 4 , and u 5 }. 6 Complexity us, the k-partition problem of a pattern graph can be converted to divide k subgraphs of hypergraph Q dc , which is denoted in Definition 7.
Definition 7 (dominance-driven k-partition problem). Given a dominance-partitioned pattern hypergraph Q d (V d , E d ) and V d � {u 1 , . . . , u n }, the dominance-driven k-partition problem refers to divide V d into k clusters, satisfying C � {C 1 , . . . ,C k }, such that overlapped cost |C kp | � 1≤i,j≤k (C i ∧C j ) is minimum and subgraph cost satisfies the condition |C 1 | ≈ · · · ≈ |C k | where |C i | denotes the pattern-clustered subgraph cost of PT(C i ) and 1≤i,j≤k (C i ∧C j ) indicates the overlapped cost of pattern-clustered subgraph PT(C i ), . . ., PT(C j ).
In this paper, we abbreviate (C i ′ ) as PT i . en, the overlapped cost of a pattern-clustered subgraph is redefined as |C kp | � 1≤i,j≤k (PT i ∧PT j ), and subgraph cost is represented as PT 1 ≈ · · · ≈ |PT k ||. Here, PT i corresponds to the i-th of cluster C i . Actually, the overlapped cost |PT i ∧PT j | is equivalent to |P i ∧P j |, proven in Lemma 1.
Lemma 1 (node connotative relationship). Given patternclustered subgraphs PT i and PT j , the overlapped cost PT i ∧PT j is equivalent to P i ∧PT j .
Proof. (for Lemma 1).Considering the node denotative relationship in eorem 1, regarding dominant nodes u ∈ PT i and u′ ∈ PT j , if there exists (u) ∧ T(u ′ ) ≠ ∅, then u ≺ u′ or u′ ≺ u. Regarding the dominating relationship u′ ≺ u, there must exist a semidominant node u″ ≺ u, such that u″ ≺ u′ or u″ � u′. Consider the node connotative relationship in eorem 2, u′ must be contained into P i . us, e subgraph cost of pattern-partitioned subgraph is evaluated by the quantity of data triples mapped to pattern triples. A pattern-partitioned subgraph (u) is considered, and the subgraph cost of (u) is the quantity of data triples mapped to all pattern triples in (u), formed as | (u)|. Patternpartitioned subgraphs PT(u i ) and PT(u j ) are considered, and the overlapped cost is the common subgraph cost of pattern subgraphs, formed as |PT(u i )∧PT(u i )| . e subgraph and overlapped costs are described in the following formulas:

K-Partition on Dominance-Pattern Weighted
Graph. e k-partition is used to cut the graph DCPG into k subgraphs that are not connected to each other. We define the sets of k dominant nodes in V d as C 1 , . . ., C k , satisfying For the dominant node sets C i and C j of any pattern subgraphs PT i and PT j , we define the weight of the cutgraph between C i and C j as the following formula: us, for the dominant node sets C 1 ,. . ., C k of k pattern subgraphs PT i , . . . , PT k , the weight of the cut-graph on k pattern graphs is defined as the following formula: where C i is the complementary of C i ′ , satisfying

Dominance-Partitioned Algorithm on Large RDF Graph.
In this section, the k-partition algorithms are designed to divide the large data graph as multiple-distributed subgraphs. We first give the construction of DPPG in Algorithm 2, and then, the dominance-driven k-partition is designed in Algorithm 3. e first core work of DPPG construction is to select a root node of a pattern graph. Intuitively, we tend to choose the node with the smallest local matching results and the largest degree as the root node. A query vertex conducting the smallest matching results means the minimal network transmission cost, and the one with the greatest degree means the maximal probability to prune the negative node pairs. e evaluated formula of root node is described in the following equation: where M(u) denotes the entities typed and attributed by u, satisfying M(u) ∈ V. W3C group provides a set of vocabularies (as part of the RDF standard) to encode rich semantic information on RDF graphs. For example, type predicates (rdfs : type) provide the function of grouping vertices of RDF graphs into different categories. Different from general label graphs, the vertices of RDF graphs identify entities/text information, and the semantic characteristics of entities result in the same type of vertices usually having similar predicate combinations, which is convenient for statistics. us, the entities of G can be easily mapped to pattern nodes of Q through typed and attributed predicates on entities themselves. e construction of DPPG is shown in Algorithm 2. e input is a pattern graph Q(V, E), and the output is a dominance-partitioned pattern hypergraph Q d (V d , E d ). An artificial root node is first selected by formula (5). e root node u r is encoded by an initial sequential number DFS[u r ] � 0 and added to the dominant set V d (Line 1). en, the four modules are sequentially executed in the construction of DPPG. e first module is to encode a sequential unique number to each vertex of G by the order of depth-first searching (Lines 2-6). Each node u of G is encoded iteratively through searching the successors of visited node u r (Line 2). If u is unvisited, u is encoded and the successors of u are deployed into the encoder (Lines 3-4). If u is visited, u is taken as a dominant node and added to the dominant set V d (Lines 5 6). e second module is to identify the semidominant nodes of G in a descending order of depth-first searching (Lines 7-12). Each node u′ of G is traced iteratively through searching the precursors of encoded node u (Line 8). If DFS [u] < DFS[u′], the edge (u, u′) is collected into a circularpattern subgraph P(u) and the precursor of u′ are deployed into the tracer (Lines 9-10). If DFS[u] > DFS[u′]} and sdom [u] > sdom [u′], the semidominant node of u is replaced by u′ and the edge (u, u′) is collected into circular-pattern subgraph P(u) (Lines [11][12]. e third module is to minimize the dominant set V d (Lines 13-16 e fourth module is to acquire the DPPG and dominance-partitioned pattern subgraphs (Lines 17-21). e dominated node of one node u is added to tree-pattern subgraph (u), and the union of (u) and (u) is updated to (u) (Lines 17-18). Considering dominance-partitioned pattern subgraphs (u) and (u′) if (u) ∧ (u′) ≠ ∅, a hyperedge (u, u′) is constructed into E d . Figure 6, where the rounds filled with diagonal lines denote the dominant nodes 8 Complexity and the rounds filled with vertical lines indicate the semidominant nodes. A root node is first selected as u 1 by formula (5), which should contain with the smallest local matching results and the largest degree. en, the four modules are sequentially executed in the construction of DPPG. e first module is for encoding a sequential unique number to each vertex of G by the order of depth-first searching. e sequential unique numbers are encoded as the subscripts of nodes, and the double visited nodes are collected to a dominant set V d , satisfying V d � {u 3 , u 6 , u 9 , u 15 }. e second module is to identify the semidominant nodes of G in descending order of depth-first searching, and the semidominating relationships are acquired as {u 1 ≼ u 3 , u 1 ≼ u 6 , u 7 ≼ u 9 , u 1 ≼ u 15 }. In the acquiring process of semidominating relationships, the circular-pattern edges are inserted into circular-pattern subgraphs. Regarding a dominant node u 3 , the precursor u 6 of u 3 satisfies DFS[u 3 ]} < DFS[u 6 ]}, thus precursors of u 6 should be expanded continuously and (u 6 , u 3 ) is inserted to P(V d u 3 ). After the semidominant node of u 3 are found as u 1 , the ascending path |u 1 , u 3 〉 is inserted to (u 3 ). e dominance-partitioned pattern subgraphs are illustrated in Figure 6. e third module is to minimize the dominant set.

Example for DPPG Construction. An example of DPPG construction is illustrated in
then u′ is removed from V d . e fourth module is to acquire the DPPG and dominance-partitioned pattern subgraphs. e dominated tree-shaped structures are combined with circular-pattern subgraphs, which are used to conduct the compact dominance-partitioned pattern subgraphs. e dominance-driven k-partition is described in Algorithm 3. e inputs are a dominant-connected pattern graph Q d (V d , Ed) and an RDF graph G (V, E, L, φ). e output is a set of k-partitioned data subgraphs G k � G 1 ′ , . . . , G k ′ . e dominance-driven k-partition algorithm consists of three modules. e first module is to construct a similar matrix of a dominance-partitioned pattern hypergraph (Lines 1-4). e dominance-pattern weighted matrix initializes the similarity of pattern-partitioned subgraphs through its bidirectional adjacency matrix W of size (Line 1). W[i, i] represents a subgraph cost of PT i (Line 3) and W [i, j] represents an overlapped cost between PT i and PT j (Line 4). e second module is to calculate the degree matrix through accumulating elements in each line of a similar matrix (Lines 5-6). en, the method of Graph Laplacians is employed to calculate the feature matrix, satisfying us, the similar matrix W can be abstracted as a feature matrix F of size k × |V d | (Line 7). Further, a k-means clustering method is used to gather the feature matrix line by line as k clusters (Line 8).
Finally, the third module is to generate k-partitioned data subgraphs through the mappings of data graphs on k clusters (Lines 9-11). Considering a pattern-clustered subgraph P(C i ) of cluster C i , if there exists a data triple t ∈ G, satisfying t ∈ PT(C i ), then t is a mapped data triple of PT(C i ) (Line 11).

Example for Dominance-Driven K-Partition.
Each dominance-partition pattern subgraph is taken as a hypernode of Q d and each edge indicates the common part between dominance-partition pattern subgraphs. Considering circular-pattern subgraphs (u 3 ) and (u 6 ) in Figure 6, the edge (u 1 , u 6 ) is the common part of (u 3 ) and (u 6 ). Regarding a dominance-driven 3-partition strategy, the clustered pattern subgraphs are illustrated in Figure 5. An RDF graph and a pattern graph are considered in Figures 1  and 2, the divided pattern graphs and the partitioned RDF graphs are illustrated in Figure 7. e divided pattern subgraphs Q 1 (�PT (Department)) and Q 2 (�PT (GraduateCourse)) are shown in Figure 7(a), where Q 1 and Q 2 are fish-shaped pattern subgraphs semidominated by the common node ResearchAssistant. Regarding a dominancedriven 2-partition strategy, the divided pattern subgraphs do not need to be clustered, because the clustered pattern subgraphs are still the divided ones. Large RDF graph is partitioned into different servers based on the divided pattern subgraphs Q 1 and Q 2 , illustrated in Figures 7(b) and 7(c).

Subgraph Matching on k-Partitioned RDF Graph
In this section, we introduce the subgraph matching algorithm on a k-partitioned RDF graph. Given a query graph q and a data graph G, the subgraph matching problem refers to search all isomorphic subgraphs of q on G. Here, the query graph q is defined as a subgraph of Q, formed as q(V ′ , E ′ ).

Feasibility of Dominating Relationship.
For corresponding to the flow characteristics of a pattern graph in this paper, we attach a query graph with a virtual root u r ′ that isotopic to the artificial root u r of the pattern graph. e root-attached query graph is redefined as q(u r ′ , V ′ , E ′ ), where u r ′ is the virtual root node. Given a pattern graph (V, E) and an artificial root u r , if there exists a refined query graph q(u r ′ , V ′ , E ′ ), satisfying u r ∈ V ′ , then u′ is a nonempty node and u r � u r ′ , otherwise, u′ is an empty node. Considering the refined query graphs q 1 and q 2 in Figure 8(a), node u 1 of q 1 is a real root node of a pattern graph in Figure 9 and node u r of q 2 is a virtual root node which is isotopic to node u 1 . To ensure the reachability of q 2 , the virtual edges are constructed from u 2 to the nearest nodes u 1 and u 7 on the reachable paths from u 1 in Figure 9. e dominating relationships in a pattern graph are still suitable for the refined query graph, which is proved in eorem 3. nodes u, u′ ∈ V ∩ V ′ , satisfying u ≺ u′ in Q, such that dominating relationship u ≺ u′ cannot be suitable in q. en, there must exist another node u″, satisfying u″ is the reachable node on path |u r ′ , u ′ 〉. Since u r ′ is isotopic to u r of Q, u″ must be the reachable node on path |u r ′ , u ′ 〉. However, the u will not dominate u′ in Q, because there exists twonode u and u″ , which are the reachable nodes on path |u r ′ , u ′ 〉.
Regarding the refined query graphs q 1 and q 2 in Figure 8, the query graphs inherit the dominating relationships of the pattern graph, because it can always find a query root node (real or virtual node) that is isotopic to the root node of a pattern graph for any query graph.
Benefiting from the feasibility of dominating relationship in eorem 3, the combination of a computational paradigm of subgraph matching and the concept of DCPG   can conduct many outstanding characteristics to accelerate the subgraph matching on a RDF graph. e smallest calculated unit of subgraph matching is a data-query vertex pair (node pair for short), that describes a mapping from query vertex u to data vertex v, formed as 〈v, u〉. A solution of subgraph matching is defined as a subgraph mapping, described as M � (〈v 1 , u 1 〉, . . . , 〈v n , u n 〉), n � |V q |. A DCPGbased characteristic is denoted in Lemma 2.
where L V and L E denote the labeling function of vertices and edges, respectively, L V (u) represents the vertex-label set coupled with vertex u and L E (u, u′) represents the edge-label set coupled with edge (u, u′).

Lemma 3.
e node pairs dominated by 〈v, u〉 cannot conduct any one subgraph mapping containing 〈v, u〉 if 〈v, u〉 is negative.
Proof (for Lemma 3). A negative node pair 〈v, u〉 is considered, it does not satisfy the constraints of subgraph isomorphism in Definition 2, thus M(u) ≠ v. Further, the node pairs dominated by 〈v, u〉 cannot conduct any one subgraph mapping containing 〈v, u〉. erefore, we employ a circular-pattern first matching order to guide the iterative processing of subgraph matching. Considering a refined query graph q 1 containing a real root u 1 in Figure 8(a), the nodes of circularpattern (u 15 ) are first ordered in the process of subgraph matching. Considering a refined query graph q 2 containing a virtual root u r in Figure 8(b), the nodes of circular-patterns P(u 9 ) and P(u 3 ) are first ordered in the process of subgraph matching. Since q 2 contains two circular-patterns, it needs to select a priority for executing the multiple circular-patterns (u 9 ) and (u 3 ). e priority of multiple circular-pattern is selected by the density of Complexity circular-patterns, which is formed as the proportion of edge to a vertex in a circular-pattern. Regarding the circular-patterns (u 9 ) and (u 3 ), the densities are calculated as 1 and 2/3, then (u 3 ) is executed in front of (u 9 ). Note that the calculation of density does not consider the virtual root and edges for a query graph coupled with a virtual root, similar to q 2 .

Physical Storage.
We design the physical storage of patterned RDF graph to accelerate the acquisition of triplebased RDF data. Before the introduction of our physical storage, a dictionary encoding mapping table is first designed to encode the strings of RDF triples as integers. e dictionary encoding mapping table consists of data and semantic dictionaries. e data dictionary corresponds instance-labels of RDF triple to integer unique numbers and the semantic dictionary corresponds predicates, type-labels, and attribute-labels of RDF triple to integer unique numbers.
Considering the RDF graph in Figure 1, the semantic and data dictionaries are illustrated in Table 2. Both data and semantic dictionary consist of two parts: the first part records the unique integer encodes (e.g., 1, 2, . . ..) and the second part records the instance-label, predicate, type-label, or attribute-label (e.g., Person_A, and advisorBy) corresponded to unique integer encodes in the first part.
e v-id denotes a data dictionary, where each line encapsulates a unique integer and an instance-label (e.g., 〈1, Person A〉). e semantic dictionary is shown on the table of p/type/attribute, where each line encapsulates a unique integer and a predicate or a type-label or an attribute-label (e.g., 〈5, workfor〉, 〈8, r df: type〉, and 〈16, hasEmail〉).
Regarding the divided RDF graph in Figures 7(b) and 7(c), our physical storage is displayed in Figure 10, which employs the structural layout of hash mapping table. e Key is formed as [v-id|p/type|dir], and the Value is assigned as p/ type, where v-id is a unique integer encode, p/type represents a predicate or a type-label or an attribute-label, and dir indicates the direction of RDF triple on a graph. For example, a key [1|0|0] on server 1 refers to the labels of in-edges to vertex numbered as 1, that is the predicate advisorBy marked as 1. A key (1|8|1) indicate the type-label of the vertex numbered as 1, which is the label FullProfessor marked as 10. A key [0|8|0] denotes the vertices coupled with an in-edge label rdf:type marked as 8, that are Person_A, Person_B, Course_A, and Course_B, marked as 1, 2, 6, and 7, respectively.

Physical Storage.
e subgraph matching of a k-partition RDF graph is described in Algorithm 4. e input is a query graph q(u′ r , V ′ , E ′ ) and the outputs are the subgraph mappings M of q on G k . e subgraph matching algorithm is originating with u r ′ until all subgraph mappings are conducted. e number i is used to count the positive node pairs and a subgraph mapping is conducted to M if i is equivalent to |V ′ | (Lines 1-2). A circular-pattern first matching order is employed to guide the iterative processing of subgraph matching which benefits from Lemmas 2 and 3. An isotopic virtual root u r ′ is attached to the original query graph and assigned to an initial number i � − 1 (Line 4). Since u r ′ is a virtual node, the successors of u r ′ are expanded to explore the real node pairs (Line 5).
An iterative processing inserts the positive node pairs to M (Lines 7-11). Considering a selected query vertex u i , the node pairs of u i are previously extracted from G k (Line 7). If the node pair 〈v, u i 〉 is positive and satisfies the partial subgraph isomorphism, 〈v, u i 〉 is extended to M and the successor of u i is expanded to exploring the read node pairs. Otherwise, other node pairs are sequentially verified by partial subgraph isomorphism and candidate verification (Lines 8-10). If node pairs of a query vertex u i are negative or do not satisfy the partial subgraph isomorphism, the precursor of u i is backtracked and repeating the extending processing until all subgraph mappings are conducted to M (Line 11).

Example for Subgraph Matching Algorithm.
In the subgraph matching algorithm, we employ a circular-pattern first matching order to guide the iterative processing of subgraph matching. A ordered query graph is shown in Figure 11, where the rounds filled with left-diagonal line denote the first executed region, the rounds filled with rightdiagonal line indicate the second executed region, and the nonfilled rounds refer to the final executed region. e filled rounds are included in circular-pattern subgraphs and u 7 is the regional juncture. en, the subgraph matching is iteratively conducted by our circular-pattern first matching order.

Experimental Evaluation
In this section, we verify the effectiveness and scalability of algorithms that are experimented on synthetic and real datasets, and we mainly analyze the experiments with current memory-based distributed SPARQL query processing strategies.
Input: a pattern graph Q, a query graph q, and a data graph G Output: the set M of all subgraph mappings of q on G k (1) -extraction from; (2) G k ⟵ k-Partition of Q dc on G; (3) M ⟵ subMatching (q, G k ); (4) Return M; ALGORITHM 1: Dominance-partitioned subgraph matching algorithm (DP-SM). 12 Complexity

Experimental Settings.
All experiments are conducted on a distributed cluster including six identical computing nodes. Each computing node uses an Intel(R) Core(TM) i7-7700@3.60 GHz 8-core processor, and the node communication is deployed on the Ethernet of 1000 Mbps. e physical memory is 16 GB, and the hard disk size is 1 T.
Experimental evaluation employs the four generated scales of synthetic data set LUBM (Lehigh University BenchMark) and the real data set YAGO2 (yet another great ontology 2). e related information of datasets is shown in Table 3, where #T, #S, #O, and #P represent the numbers of triples, subjects, objects, and different predicates, respectively.

Datasets.
e synthetic data set LUBM [34] was developed by Lehigh University, which is a standard and systematic semantic Web repository evaluation benchmark for university ontology.
is benchmark aims to evaluate the extended queries of a single real ontology on a large data set. e two datasets of different sizes are generated from the data generator UBA 1.7\footnote {http://swat.cse.lehigh.edu/ projects/lubm}. YAGO2\footnote {http://yago-knowledge. org} is a linked data knowledge base that mainly integrates data from three sources: Wikipedia, WordNet, and GeoNames which contains 120 million triples and more than 10 million entities (such as individuals, organizations, and cities).

Analysis of Experimental Results.
We compare the query performance of our DP-SM algorithm with TriAD [4] and Wukong [5]. e query performances are deployed on six calculated nodes (including a master node) and evaluated on the LUBM-2560 dataset. A benchmark is used to generate different scales of query graph, which is employed in the research of many distributed RDF systems and is published in [12].
Experiments are evaluated into two groups of query graphs, illustrated in Figure 12.
e first group of query graph L1, L2, and L3 correspond to Q1, Q3, and Q7, respectively, in [19], our PD-SM algorithm is faster by 1.4-2.2 times than WuKong algorithm. Actually, the final results of L2 are empty. Even though there exists a large number of predicate relationships mapped to the query graph L2, the verified candidates of query nodes are empty. e candidate verification of our algorithm can previously find the candidates of all query nodes before the matching processing is executed, while the algorithms of WuKong and TriAD need to find the candidates with a time-consuming traversal on large intermediate results. e final results of L1 and L3 are conducted as 65,000 and 1,000 data subgraphs, respectively. e experiments verify that the graph-based exploration method has nearly one order of magnitude faster than the relationship-based joining model. A circular-pattern first matching order is employed to guide the iterative processing of subgraph matching in our algorithm, which can prune the redundant intermediate results previously. us, our algorithm obtains a greater improvement of matching performance. e second group of query graphs L4, L5, and L6 corresponds to the extended Q2, Q1, and Q7 in [19], which employs the more complex and denser topological structures than the L1, L2, and L3. e query graph of L4 is a noncircular topological structure and the intermediate results are larger without the verification of partial subgraph isomorphism.
us, our algorithm DP-SM has a small improvement than TriAD. Compared with algorithm WuKong, the improved matching performance of our algorithm benefits from the strategy for postponing the cluster-connected calculations of Cartesian products. e query graph L6 contains more dense circular topological structures than L5. e circular-pattern first matching strategy can speed up the acquisition of subgraph results. e average matching time on YAGO2 dataset is evaluated in Figure 13, where the simple and complex query graphs are denoted in [8]. Similar to the experimental evaluation on LUBM dataset, the matching performance of complex query graphs Y4, Y5, and Y6 is similar to the experimental evaluation of complex query graphs on LUBM dataset, illustrated in Figure 13(b). Our algorithm of DP-SM proposed is 1.5-2.5 times faster than algorithms WuKong and TriAD. e difference is that the matching time-performances of simple query graphs Y1, Y2, and Y3  are faster than the simplex ones on LUBM datasets, because the nodes of simple query graphs are limited by constant values that can conduct the smaller search space of intermediate results, shown in Figure 13(a). Compared with Wukong and SDSM, since our algorithm has a timeconsumption in the orchestration of matching order, it is negligible with the overall running time of matching algorithms.

Experimental Scalability.
e scalability of algorithms is evaluated based on the number of machines and the size of dataset.
e scalability based on the number of machines are evaluated in Figure 14, where the number of machines is gradually increased from 2 to 6. e experimental results show that the matching time-performances of query graph L1, L3, L4, L5, and L6 gradually improved in increasing order of the machine number. e trend of experimental evaluation proved that our DP-SM algorithm can effectively conduct the subgraph results in distributed environments. Since the candidates of L2 are verified as empty in the previous candidate verification, the matching time appears as a constant trend. For complex queries L4, L5, and L6, the decreasing magnitudes in matching time-performances are slightly lower than ones of L1 and L3, because the query Input: a query graph q(u r ′ , V′, E′) Output: the subgraph mappings M of q on G k (1) If i � |V ′ | then; (2) Output M ⟶ M; (3) Else (4) If i � − 1 then; (5) Continue to u i .successor; (6) Else (7) Foreach 〈v, u i 〉 ∈ M(u i , G k ) do (8) If candidateValid(v, u i ) then (9) 〈v, u i 〉 ⟶ M; (10) subMatching(q(u i .successor)); (11) subMatching(q(u i .precursor)); ALGORITHM 4: Subgraph matching algorithm on k-partitioned RDF graph (subMatching()).  Complexity graphs crossing multiple partitioned pattern subgraphs increase the time-consumption of transmission on the partitioned RDF graphs. e scalability based on the size of dataset LUBM is evaluated in Figure 15, where the number of machines is fixed as 6. e different scales of LUBM dataset are generated to evaluate the matching time-performances of algorithms, which are located in the range from 5.3 M to 346 M. e matching time-performances of our DP-SM algorithm can maintain a nearly linear growth without the complex topological structures of the query graph. Our algorithm employs a circular-pattern first matching strategy to previously prune the redundant RDF and postpone the subgraph-connected calculation of Cartesian products. en, the partial intermediate results are linked slightly without the huge matching time-consumption on noncircular pattern subgraphs.

Conclusions
In this paper, we propose a dominance-partitioned subgraph matching on a large RDF graph. Firstly, a dominanceconnected pattern graph is extracted from a pattern graph to construct a dominance-partitioned pattern hypergraph, which divides a pattern graph as multiple fish-shaped pattern subgraphs. Secondly, a dominance-driven spectrum clustering strategy is used to gather the pattern subgraphs into multiple clusters.
irdly, a dominance-partitioned subgraph matching algorithm is designed to conduct all isomorphic subgraphs on a cluster-partitioned RDF graph. Finally, experimental evaluation verifies that our strategy has higher time-efficiency of complex queries, and it has better scalability on multiple machines and different data scales.

Data Availability
e LUBM data used to support the findings of this study have been deposited in the web repository (http://swat.cse. lehigh.edu/projects/lubm). e YAGO2 data used to support the findings of this study have been deposited in the web repository (http://yago-knowledge.org).

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this article.