Incremental Graph Pattern Matching Algorithm for Big Graph Data

Graph pattern matching is widely used in big data applications. However, real-world graphs are usually huge and dynamic. A small change in the data graph or pattern graph could cause serious computing cost. Incremental graph matching algorithms can avoid recomputing on the whole graph and reduce the computing cost when the data graph or the pattern graph is updated. The existing incremental algorithm PGC IncGPM can effectively reduce matching time when no more than half edges of the pattern graph are updated. However, as the number of changed edges increases, the improvement of PGC IncGPM gradually decreases. To solve this problem, an improved algorithm iDeltaP IncGPM is developed in this paper. For multiple insertions (resp., deletions) on pattern graphs, iDeltaP IncGPM determines the nodes’ matching state detection sequence and processes them together. Experimental results show that iDeltaP IncGPM has higher efficiency and wider application range than PGC IncGPM.


Introduction
Graph pattern matching is to find all the subgraphs that are the same or similar to a given pattern graph  in a data graph .It is widely used in a number of applications, for example, web document classification, software plagiarism detection, and protein structure detection [1][2][3].
With the rapid development of Internet, huge amounts of graph data emerge every day.For example, the Linked Open Data Project, which aims to connect data across the Web, has published 149 billion triples until 2017 [4].In addition, real-world graphs are dynamic [5].It is often costprohibitive to recompute matches starting from scratch when  or  is updated.An incremental matching algorithm is needed, which aims to minimize unnecessary recomputation by analyzing and computing the changes of matching result in response to updates Δ (resp., Δ) to  (resp., ).
For example, Figure 1(a) is a pattern graph  and Figure 1(b) is a data graph .The subgraph which is composed of A 1 , B 1 , C 1 , D 1 , E 1 , and the edges between them (for simplicity, denoted as {A 1 , B 1 , C 1 , D 1 , E 1 }) is the only matching subgraph.Assuming that (B, E) and (C, D) are removed from the pattern graph, the traditional recomputing algorithm will compute the matches for the new pattern graph on the whole data graph.It is time consuming.The incremental algorithm will just check a part of nodes in G, that is, B 2 , B 3 , C 2 , C 3 , A 2 , and A 3 , and add new matching subgraphs ({A 2 , B 2 , C 2 , D 2 , E 2 }, {A 3 , B 3 , C 3 , D 3 , E 3 }) to the original matching result.
At present, the study of incremental graph pattern matching is still in its infancy and existing work [6][7][8][9][10][11][12] mainly focuses on the updates of data graphs.In our previous study, we proposed an incremental graph matching algorithm named PGC IncGPM, which can be used in scenarios where data graphs are constant and pattern graphs are updated [13].PGC IncGPM can effectively reduce the runtime of graph matching as long as the number of changed edges is less than the number of unchanged edges in .However, the improvement effect of PGC IncGPM gradually decreases as the number of changed edges increases.In this paper, the bottleneck of PGC IncGPM is further analyzed.An optimization method of nodes' matching state detection sequence is proposed, and a more efficient algorithm called iDeltaP IncGPM is designed and implemented.Using Figure 1 as an example, suppose (B, E) and (C, D) are deleted from the pattern graph.PGC IncGPM algorithm will first consider the deletion of (B, E), that is, checking B 2 , A 2 , B 3 , and A 3 , and then consider the deletion of (C, D), that is, checking C 2 , B 2 , A 2 , C 3 , B 3 , and A 3 .Thus B 2 , A 2 , B 3 , and A 3 are all checked twice.iDeltaP IncGPM considers the two deletions together; C 2 , B 2 , A 2 , C 3 , B 3 , and A 3 are all checked only once.
The remainder of this paper is organized as follows.In Section 2, related work is reviewed.The model and definition are described in Section 3. In Section 4, our algorithm is presented.Section 5 is experimental results and comparison, and Section 6 presents the conclusion.

Related Work
We surveyed related work in two categories: graph pattern matching models and incremental algorithms for graph matching on massive graphs.
Graph pattern matching is typically defined in terms of subgraph isomorphism [14,15].However, subgraph isomorphism is an NP-complete problem [16].In addition, subgraph isomorphism is often too restrictive because it requires that the matching subgraphs have exactly the same topology as the pattern graph.These hinder its applicability in emerging applications such as social networks and crime detection.Thus, graph simulation [17] and its extensions [18][19][20][21][22] are adopted for pattern matching.Graph simulation preserves the labels and the child relationship of a graph pattern in its match.In practical applications, graph simulation is so loosely that it may produce a large number of useless matches, which can flood useful information.Dual simulation [18] enhances graph simulation by imposing an additional condition, to preserve both child and parent relationships (downward and upward mappings).Due to the good balance and high practical value of dual simulation in response time and effectiveness, graph pattern matching is defined as dual simulation in this paper.
At present, the study of incremental graph pattern matching is still in its infancy; existing work [6][7][8][9][10][11][12] mainly focuses on the updates of data graphs.Fan [9].Wang and Chen proposed an incremental approximation graph matching algorithm, which transformed the approximate subgraph search into vector space relation detection [10].When inserting or deleting on the data graph, the vectors of relevant nodes are modified and whether the new vectors still contain the vector of the pattern graph is rechecked.Choudhury et al. developed a fast matching system StreamWorks for dynamic graphs [11].The system can real-time detect suspicious pattern graphs and early warn high-risk data transfer modes on constantly updated network graphs.Semertzidis and Pitoura proposed an approach to find the most durable matches of an input graph pattern on graphs that evolve over time [12].In [13], an incremental graph matching algorithm was proposed for updates of pattern graphs.
In big data era [23], graph computing is widely used in different fields such as social networks [24], sensor networks [25,26], internet-of-things [27,28], and cellular networks [29].Therefore, there is urgent demand for improving the performance of big graph processing, especially graph pattern matching.

Model and Definition
For graph pattern matching, pattern graphs and data graphs are directed graphs with labels.Each node in graphs has a unique label, which defines the attitude of the node (such as keywords, skills, class, name, and company).
Definition 1 (graph).A node-labeled directed graph (or simply a graph) is defined as  = (, , ), where  is a finite set of nodes,  ⊆  ×  is a finite set of edges, and  is a function that map each node  in  to a label (); that is, () is the attribute of .
For any  and , there exists a unique maximum matching relation   .Graph pattern matching is to find   , and the result graph   is a subgraph of  that can represent   .
Considering a real-life example, a recruiter wants to find a professional software development team from social network.Figure 2(a) is the basic organization graph of a software development team.The team consists of the following staffs with identity: project manager (PM), database engineer (DB), software architecture (SA), business process analyst (BA), user interface designers (UD), software developer (SD), and software tester (ST).Each node in the graph represents a person, and the label of node means the identity of person.The edge from node A to node B means that B works well under the supervision of A. A social network is shown in Figure 2(b).In this example,   is {(DB, DB 1 ), (PM, PM 1 ), (SA, SA 1 ), (BA, BA 1 ), (UD, UD 1 ), (SD, SD 1 ), (SD, SD 2 ), (ST, ST 1 ), (ST, ST 2 )}.Because BA 2 does not have a child matching UD and SA 2 does not have a parent matching DB, PM 2 does not keep the child relationship of PM.For the same reason, SD 3 (resp.,ST 3 ) does not match SD (resp., ST).Definition 3 (incremental graph pattern matching for pattern graph changing).Given a data graph  and a pattern graph , the matching result in  for  is (, ).Assuming that  changes Δ, the new pattern graph is expressed as  ⊕ Δ.As opposed to batch algorithms that recompute matches starting from scratch, an incremental graph matching algorithm aims to find changes of Δ to (, ) in response to Δ such that ( ⊕ Δ, ) = (, ) ⊕ Δ.
When Δ is small, Δ is usually small as well, and it is much less costly to compute than to recompute the entire set of matches.In other words, this suggests that we compute matches once on the entire graph via a batch-matching algorithm and then incrementally identify new matches in response to Δ without paying the cost of the high complexity of graph pattern matching.
In order to get Δ quickly, indexes can be prebuilt based on the selected data features of graphs to reduce the search space during incremental matching.The more indexes, the shorter the time to get Δ and the larger the space to store indexes.For large-scale data graphs, both response time and storage cost are needed to be reduced.Considering the balance of storage cost and response time, in this paper, three kinds of sets generated in the process of graph matching are used as index.(1) First are candidate matching sets cand(⋅); for each node  in , cand(u) includes all the nodes in  which only have the same label with .The nodes in cand(⋅) are called c-nodes.(2) The second are child matching sets sim(⋅); for each node  in , sim(u) includes all the nodes in  which preserve the child relationship of .The nodes in sim(⋅) are called s-nodes.(3) The third are complete matching sets mat(⋅); for each node  in , mat(u) includes all the nodes in  which preserve both the child and parent relationship of .The nodes in mat(⋅) are called m-nodes.
The symbols used in this paper are shown in Notions Section.

iDeltaP_IncGPM Algorithm
In this section, we propose the improved incremental graph pattern matching algorithm for pattern graph changing (Δ).

The Idea of PGC IncGPM Algorithm.
The basic framework of PGC IncGPM [13] is shown in Figure 3.
The graph pattern matching algorithm (GPMS) is first performed on the entire data graph  for the pattern graph .It computes the matching result graph   and creates the index needed for subsequent incremental matching.Δ may include edge insertions ( + ) and edge deletions ( − ).Incremental graph pattern matching algorithm PGC IncGPM first calls the subalgorithm AddEdges for  + to get    and index  and then calls the subalgorithm SubEdges for  − to get    and index  .   is the new matching result ( ⊕ Δ, ), and index  is the new index that can be used for subsequent incremental matching if the pattern graph changes again.
Edge insertions (resp., edge deletions) in Δ are processed one by one by AddEdges (resp., SubEdges).For example, when deleting multiple edges from , the processing of PGC IncGPM is as follows.
In the first step, the following operations are performed for each deleted edge (,   ): for each V ∈ (), whether V keeps the child relationship of  in ⊕Δ is checked.If V keeps the child relationship of u, then V is removed from cand(u) to sim(u) and the parents of V in cand(⋅) are also processed.
In the second step, each node in sim(⋅) is repeatedly filtered according to its parents and children; the new generated m-nodes are added to mat(⋅).
In the first step, when deleting (,   ) from P, some nodes in cand(u) and cand(  ) (  is an ancestor of ) may change from c-nodes to s-nodes.So when a c-node becomes an snode, a bottom-up approach is used to find its parents and ancestors from cand(⋅).If ( 1 ,   1 ) and ( 2 ,   2 ) are deleted, and  1 and  2 have a common ancestor   , then cand(  ) will be visited twice.In summary, there is a bottleneck of PGC IncGPM for multiple deleted edges.There is the same problem for multiple inserted edges.

Optimization for Matching State Detection Sequence.
Since PGC IncGPM deals with edge insertions (resp., deletions) one by one, the efficiency of it gradually decreases as the number of changed edges increases.To overcome the bottleneck of PGC IncGPM, multiple edge insertions (resp., deletions) should be considered together.In this paper, the optimization method for nodes' matching state detection sequence is proposed.The optimization can be applied to both insertions and deletions on .Taking SubEdges as an example, the optimization method is as follows.
First, analyze all edges deleted from  to determine which nodes' candidate matching sets may change.If cand(u) may change, then  is added to  − set.
Secondly,  − is sorted by the inverse topological sequence of .There may be some strong connected components in .In this case, we first find out all the strong connected components in Pand, then, converge each strong connected component into a node to get a directed acyclic graph   and find the inverse topological sequence of   ; finally, we replace the strong connected component convergence node with the original node set.Thus, the approximate inverse topological sequence of  is obtained.
Finally, for each  in filtorder − , cand() is processed in turn.Depending on whether there is a deleted edge from , two different filtering methods are used: (1) if  has at least one out-edge to be deleted, then each node in cand() is likely to keep the child relationship of  now.So whether they keep the child relationship of  should be checked; (2) if  does not have an out-edge be deleted, then only part of the nodes in cand() are needed to be checked.That is, a node in cand() will be checked only if it has at least one child which changes from c-node to s-node.
The visited times of some candidate matching sets can be reduced through the above optimization.

iDeltaP IncGPM Algorithm.
Based on the optimization method proposed in Section 4.2, iDeltaP IncGPM is proposed.It uses the optimized method for both multiple inserted edges and multiple deleted edges.The optimization algorithm for edge deletions is shown in Algorithm 1.In Algorithm 1,  − contains all the nodes which have outedge deleted.For a node  in , if the changes of  may result in some nodes in cand() becoming s-nodes, then  ∈  − . − is sorted by the inverse topological sequence of  (lines (1)-( 5)).If  has an out-edge removed, that is,  ∈ nodes − , then all the nodes in cand() need to be checked whether they keep the child relationship of u (lines ( 7)-( 12)).If  ∈ filtorder − and  is not in nodes − , then only part of nodes in cand() are checked.That is, if  has a child   and   is moved from cand(  ) to sim(  ) (  ∈ (  )), then whether  is still an s-node will be checked (lines ( 14)-( 20)).
Here we use an example to illustrate the implementation process of PGC IncGPM and iDeltaP IncGPM.The pattern graph  is shown in Figure 4, assuming that (E, H), (G, I), and (C, G) are deleted from P.
The process of PGC IncGPM is as follows.( 1) the deletion of (E, H) is processed, and each  in cand(E) is checked whether it keeps the child relationship of  in  ⊕ Δ.If  keeps the child relationship of , then its parents founded from cand(B) (resp., cand(C)) are checked.If these nodes keep the child relationship of B (resp., C), then they are removed to sim(B) (resp., sim(C)).After that, their parents founded from cand(A) are checked; (2) the deletion of (G, I) is processed, and the nodes in cand(G), cand(C), cand(D), and cand(A) are checked in turn; (3) the deletion of (C, G) is processed, and the nodes in cand(C) and cand(A) are checked in turn.From the above steps, it can be seen that cand(C) and cand(A) are visited three times, cand(G), cand(D), cand(E), and cand(B) are visited once.
For multiple edges inserted to the pattern graph, the similar optimization method is adopted.nodes + contains all the source nodes of inserted edges.If some nodes in sim() may become c-nodes because of edge insertions, then  is in filtorder + .filtorder + is ordered by the reverse topological sequence of the pattern graph.nodes + and filtorder + are used to reduce the visited times of sim(⋅) and mat(⋅).

Experiments and Results Analysis
The following experiments evaluate our proposed algorithm.Runtime is used as a key assessment of algorithms.In addition, in order to show the effectiveness of incremental algorithms visually, improvement ratio (IR) is proposed, which is the ratio of runtime saved by incremental matching algorithms to the runtime of ReComputing algorithm.Two real data sets (Epinions and Slashdot [30]) are used for experiments.The former is a trust network with 75879 nodes and 508837 edges.The latter is a social network with 82168 nodes and 948464 edges.In previous work, we experimented with normal size and large size pattern graphs, respectively, and the results show that the complexity and effectiveness of incremental matching algorithm are not affected by the size of pattern graph.Therefore, in this paper, by default, the number of nodes in P (|  |) is 9, the original number of edges in P (|  |) is 8 (resp., 16) for insertions (resp., deletions) and 9 for both insertions and deletions.
In order to evaluate the improvement of our proposed algorithm, iDeltaP IncGPM, PGC IncGPM, and ReComputing are all performed on Epinions and Slashdot under different settings.Each experiment was performed 5 times with different pattern graphs, and the average results are reported here.The experimental results are shown in Figure 5.The histogram represents the runtime of algorithm, and the line chart represents the improvement ratio of iDeltaP IncGPM and PGC IncGPM to ReComputing.to P, and so on.The figure tells us the following: (a) when insertions are no more than 10, the runtime of PGC IncGPM and iDeltaP IncGPM is significantly shorter than that of ReComputing, and iDeltaP IncGPM has the shortest runtime; (b) when insertions are 12 (new inserted edges account for 60% of the edges in  ⊕ Δ), the runtime of PGC IncGPM is longer than that of ReComputing, while iDeltaP IncGPM still gets the shortest runtime; (c) the improvement ratio of iDeltaP IncGPM and PGC IncGPM decreases with the increase of edge insertion, but the decrease of iDeltaP IncGPM is smaller.The more inserted edges, the better iDeltaP IncGPM than PGC IncGPM.When 12 edges are inserted to , the IR of iDeltaP IncGPM is 40% on average, and the IR of PGC IncGPM is 33% on average.Therefore, iDeltaP IncGPM is better than PGC IncGPM.The reason is that PGC IncGPM processes the inserted edges one by one.Therefore, as insertion increases, its runtime grows almost linearly.However, iDeltaP IncGPM integrates all the inserted edges, analyzes which matching sets are affected, and processes them in the appropriate order.This will prevent some matching sets to be processed repeatedly, which will shorten the running time.
Figure 5(c) (resp., Figure 5(d)) shows the runtime of three algorithms over Epionions (resp., Slashdot) for deletions on pattern graph.The -axis represents the number of deletions on , "−2" represents that two edges are deleted from , "−4" represents four edges are deleted from , and so on.It can be seen that (a) when deletion changes from 2 to 12, the runtime of all three algorithms increases, and iDeltaP IncGPM always has the shortest runtime; (b) as the deletion increases, the IR of PGC IncGPM decreases and the IR of iDeltaP IncGPM slowly increases.For 12 deletions, the IR of PGC IncGPM decreases to 7% on average, while the IR of iDeltaP IncGPM increases to 78% on average.The reason is that as the deletion increases, the runtime of ReComputing increases dramatically, while the runtime of iDeltaP IncGPM increases a little.iDeltaP IncGPM is better than PGC IncGPM because it compositely processes deleted edges and its runtime does not increase linearly as the number of deleted edges increases.
Figure 5(e) (resp., Figure 5(f)) shows the runtime of three algorithms over Epionions (resp., Slashdot) for both insertions and deletions on pattern graph.The -axis represents the number of insertions and deletions on P, "+2−2" means that two edges are inserted to  and the other two edges are removed from , and so on.As shown in the figure, iDeltaP IncGPM always has shorter runtime than the others do.
In conclusion, iDeltaP IncGPM effectively improves the efficiency of PGC IncGPM through the optimization strategy.For the same Δ, the runtime of iDeltaP IncGPM is shorter, and as |Δ| increases, the runtime increases less; the decrease of IR is also more moderate.Therefore, iDeltaP IncGPM can be applied to larger changes of the pattern graph, and it has a wider range of applications.

Conclusion
In this paper, we analyze PGC IncGPM to find its efficiency bottleneck and propose a more efficient incremental matching algorithm iDeltaP IncGPM.Multiple insertions (resp., deletions) are considered together and the optimization method for nodes' matching state detection sequence is used.Experimental results on real data sets show that iDeltaP IncGPM has higher efficiency and wider application range than PGC IncGPM.
Next, we will study the distributed incremental graph matching algorithm.Real-life graphs grow rapidly in size and hyper-massive data graphs cannot be centrally stored in one data center and need to be distributed across multiple data centers.It is very worthy studying how to make efficient incremental matching on distributed large graphs.

Figure 1 :
Figure 1: An example of incremental graph pattern matching.

Figure 2 :
Figure 2: An example of graph pattern matching.

Figure 5 (Figure 5 :
Figure 5: The runtime of different algorithms when pattern graph changed.