Optimal Representation of Large-Scale Graph Data Based on Grid Clustering and K 2 -Tree

. The application of appropriate graph data compression technology to store and manipulate graph data with tens of thousands of nodes and edges is a prerequisite for analyzing large-scale graph data. The traditional K 2 -tree representation scheme mechanically partitions the adjacency matrix, which causes the dense interval to be split, resulting in additional storage overhead. As the size of the graph data increases, the query time of K 2 -tree continues to increase. In view of the above problems, we propose a compact representation scheme for graph data based on grid clustering and K 2 -tree. Firstly, we divide the adjacency matrix into several grids of the same size. Then, we continuously ﬁlter and merge these grids until grid density satisﬁes the given density threshold. Finally, for each large grid that meets the density, K 2 -tree compact representation is performed. On this basis, we further give the relevant node neighbor query algorithm. The experimental results show that compared with the current best K 2 -BDC algorithm, our scheme can achieve better time/space tradeoﬀ.


Introduction
As a basic structure representing the relationship between data, graphs are widely used in web network analysis [1], biometric analysis [2], social group analysis [3], and other fields. With the continuous generation and accumulation of graph data, the traditional graph data representation method can no longer support the storage and operation of tens of thousands of nodes and edges [4]. According to Global Web statistics [5], the number of Facebook users exceeded 2.5 billion in 2019, and the average number of friends per person exceeds 300. If the adjacency list is used for storage, close to 900 TB is needed. According to CNNIC statistics [6], the number of Chinese web pages reached 2816 billion in 2019 and the number of hyperlinks is estimated to exceed 10 16 . If adjacency table is used for storage, 10 6 TB space is required. To support fast querying, the entire adjacency table is usually loaded into memory. However, in actual situations, this strategy requires an excessive amount of storage space. Furthermore, with the rapid growth of users, storage problems will only become more and more severe. In recent years, many scholars have designed numerous data structures for the compressed storage of graphs and have proposed algorithms to extend operations on these graphs.
ere are several existing methods that are noteworthy for their good performance. Adler and Mitzenmacher [7] proposed a web graph compression scheme by finding nodes with similar sets of neighbors. Randall et al. [8] first proposed using a dictionary ordering of web page URLs to compress web graphs. eir method exploits the fact that many web pages on a common host have many similar neighbors. Boldi and Vigna [9] continued exploiting properties of web graphs in lexicographical ordering to propose gap coding and differential compression. In 2009, Chierichetti et al. modified Boldi and Vigna's compression method to compress social networks. eir approach used the principle of locality and similarity of web pages and the existence of a large number of interactive edges in social networks and involved backlink compression [10]. In 2010, Maserrat and Jian [11] proposed a compression method for social networks that can query neighbors in sublinear time. ey achieve this by using an Euler data structure and multiple position linearization. Considering the similarity of neighbor nodes in the web page graph, LZ78 [12] and Repair [13] achieve compression by replacing frequent pairs of characters in the adjacency list. Exploiting the sparsity and clustering of web pages, Brisaboa et al. proposed K 2 -tree [14], which uses a bit string to store information about the adjacency matrix of the original web graph. Since most of the elements in the adjacency matrix are zero, this method effectively saves storage space. Although K 2 -tree can achieve satisfactory time-space tradeoff, many isomorphic subtrees remain. To address this problem, Gu et al. applied an MDD (multivalue decision diagram) [15] to K 2 -tree representation, and K 2 -MDD [16] compression scheme was proposed.
is method can compress web graph efficiently and compactly, but the query time is relative long. Delta-K 2 -tree [17] is an improved K 2 -tree algorithm that overcomes the shortcomings of the original K 2 -tree representation. Claude et al. proposed K 2 -partitioning [18] by using the unique rules of the domain in the web graph. Exploiting the distribution law of nodes in the web graph, Chang et al. [19] proposed an improved K 2 -tree algorithm K 2 -BDC, which can effectively represent the web graph and achieve the best time/space tradeoff. e algorithm is based on dividing the adjacency matrix into different squares along the main diagonal. Each square contains graph data of edges satisfying a certain density threshold. Each square is represented by K 2 -tree representation. However, it still has room for improvement in the following areas: (1) the technique considers that the adjacency matrix is divided along the main diagonal, the dense region away from the main diagonal cannot be well captured, and the dense structure may be destroyed. (2) K 2 -BDC uses the DAC coding technique [20] to further compress T vector and L vector, which may increase node neighbor query time. (3) K 2 -BDC cannot easily compress other types of the graph data. e method depends on the structure of graph data, and real clustering is not realized. e quadtree [21] can effectively represent the graph data, and the construction thought is similar to the K 2 -tree representation by recursively dividing adjacency matrix. However, the division rule of the quadtree depends heavily on the distribution of the submatrix in the adjacency matrix. In the real network graph, the number of submatrices whose distribution of submatrices of the corresponding adjacency matrix satisfying the division rule is relatively small. In the case of dealing with large-scale graph data, this approach undoubtedly increases the required storage space overhead.
In this paper, the authors continue efforts to exploit the distribution characteristics of the adjacency matrix of the web graph and further optimize the K 2 -BDC and K 2 -tree. We find that if the dense structure in the adjacency matrix can be accurately obtained, on the one hand, the results can not only avoid the problem of the dense region segmentation caused by the mechanically partitioned adjacency matrix in the K 2 -tree scheme but also avoid the dense region away from the main diagonal in the K 2 -BDC scheme which cannot achieve good clustering. On the other hand, the new scheme reduces the height of the tree in the query operation. e main contributions of this paper are as follows: (1) a new grid clustering algorithm is proposed that can fully exploit any dense areas in the web graph making up for the shortcomings of the original K 2 -BDC. (2) e node neighbor query algorithm after the compression structure is given. e results of the experiment are compared to those of the existing methods, and our method is found to achieve superior time/space tradeoff.

Related Work
In this section, we mainly introduce the related concepts of graphs and the construction principles of K 2 -tree and we analyze the edge distribution of the large web to provide theoretical support for subsequent clustering and K 2 -tree representation.

Graph and Related
Concepts. Consider a graph G � (V, E), where V represents the set of nodes, E represents the set of edges, n (n � |V|) represents the number of nodes, and m (m � |E|) represents the number of edges. e adjacency matrix and the adjacency list are usually used as the storage structure of the graph. Figure 1(a) is an undirected graph topology, Figure 1(b) is the adjacency matrix corresponding to the graph, and Figure 1(c) is the adjacency list of the graph. e adjacency list allows one to easily and quickly obtain the neighbors of any node in the graph and add new nodes to it. However, the list is not suitable for detecting connectivity between nodes. With the adjacency matrix, one can quickly increase or delete the edges of a node and can quickly detect the connectivity between nodes, but the storage space of the adjacency matrix is only related to the number of graph vertices and wastes a certain amount of storage space when storing sparse graphs. e addition of a new node requires space reallocation. Table 1 shows the spatial complexity required for the adjacency matrix and the adjacency list for the directed and undirected graphs, respectively. As can be seen from Table 1, for the adjacency matrix and the adjacency list, when a network graph of millions of nodes and edges is stored, the problem of excessive storage space becomes increasingly severe. [14] using the sparseness and clustering of the web graph and achieved a satisfactory time/space tradeoff. e structural idea of K 2tree is divided mainly into the following two steps:

K 2 -Tree. Brisaboa et al. proposed K 2 -tree
(i) For an n × n adjacency matrix, evaluate whether n is a power of k (k is usually equal to 2). If the condition is met, go to (ii) to divide. If n is not equal to the power of k, increase the row and column in the adjacency matrix such that n � k s (s is a positive integer), where the elements of the added row and column are padded with "0," and then go to (ii) for partitioning. (ii) According to the MXQuntree rule [22], the matrix is divided into k 2 submatrices of the same size. If at least one of the elements in the submatrix is "1," then mark the matrix as 1 otherwise mark 0, top to bottom, and arrange these values from left to right to serve as the four children of the root node. e first layer of the K 2 -tree node is constructed. en, the matrix labeled 1 is recursively processed and their 2 Mathematical Problems in Engineering values are used as the second layer nodes of K 2 -tree, and then it is repeated until the partitioned matrix elements is all 0 or the matrix has been divided into an element in the original adjacency matrix. As shown in Figure 2, the adjacency matrix and K 2 -tree correspond to a web graph with 16 vertices.
After the adjacency matrix is compressed, the structure information in the web graph is represented by a T vector and an L vector. e T vector stores all 0 values and 1 values of the K 2 -Tree except the last layer node, and the L vector stores the 0 value and the 1 value of the last layer node in the K 2 -tree. On the T vector and the L vector, the author of K 2tree representation proposed a rank operation to indirectly obtain the direct neighbors and reverse neighbors of any node. However, due to the K 2 -tree mechanical division of the adjacency matrix, the original dense structure in the adjacency matrix is broken, and the storage cost is increased. As the number of graph nodes increases, the height of the K 2tree increases, which necessitates more query time.

e Structural Characteristics of the Web Graph.
Web graphs are often used for modeling large networks, where each web page in the web is viewed as a node in the graph and the links between web pages are treated as one edge in the graph. e study by Broder et al., in 2008, showed that most of the web graph feature functions are subject to the power law distribution [23] and that the corresponding adjacency matrix has certain sparsity and clustering [24]. To reflect this law more intuitively, we visualize the data sets CNR-2000 and EU-2005, as shown in Figures 3 and 4, respectively. e x-axis and the y-axis are the node numbers in the adjacency matrix, and we map for each edge in the adjacency matrix to a point on the two-dimensional space. We can conclude that, from a partial perspective, the distribution of edges is relatively concentrated. Overall, the entire two-dimensional plane is relatively sparse. Based on the above analysis, if we can quickly capture the dense areas in the web graph and then use K 2 -tree representation for compact representation of each dense area, we can save storage space and reduce the query time.

Large-Scale Web Graph Storage and
Operation Scheme Based on Grid Clustering and K 2 -Tree e representation scheme proposed for graph data based on grid clustering and K 2 -tree is mainly divided into the following three steps. First, introduce the grid clustering algorithm and how to find the dense regions in the web graph. Second, introduce a compression algorithm to compress dense areas. Finally, introduce how to query a node neighbor for a given node.

Grid Clustering
(i) For a given graph G � (V, E) with N nodes and M edges, we divide its corresponding adjacency matrix into length d (here, we take d � 2) to produce N 2 /d 2 grid. For each edge in G, we map the edge to a different grid and count the density of each grid. en, each grid is traversed in turn, and a grid with a grid density greater than the grid density threshold is included in the List.
(ii) Traverse each grid in the List and calculate the Euclidean distance between the current grid and other grids according to a given distance threshold. If the distance is less than or equal to the distance threshold, combine these grids that meet the distance threshold and mark these grids and the current grids as accessed, counting the density of the merged large grid. If the density is greater than or equal to the density threshold, the large grid is included in the cluster_ list. Repeat (ii) until all the grids have been accessed. (iii) Repeat (ii) according to the results of the partition in cluster_ list until cluster_ list is no longer changed. At this point, cluster_ list records the location of each dense area and the clustering algorithm terminates. is pseudocode is shown in Algorithm 1.

Dense Area Compression. cluster_ list records different
dense regions in the adjacency matrix which can also be called a cluster after grid clustering. In addition, boundary_ list has recorded the starting row, the ending row, the starting column, and the ending column of each dense area. For each dense area, compression is performed with K 2 -tree representation. is pseudocode is shown in Algorithm 2.  To describe the clustering and node neighbor query process more intuitively, we present a graph G 1 with 16 nodes and 17 edges. We use the adjacency matrix to describe the structure information of the graph, as shown in Figure 5.
For G 1 in Figure 5, the traditional K 2 -tree representation is used. As shown in Figure 7, the sum of the T and L vectors is 112 bits; however, our method requires only 64 bits. e storage space occupied by the boundary list can be neglected when processing the web graph of millions of nodes and edges. Our method not only saves 43% of the storage space but also reduces the height of the K 2 -tree, so the query time is also reduced. In summary, our approach can achieve relatively strong time and space tradeoff.

Experiment
To verify that our method can achieve better time/space tradeoff, we compared it with K 2 -tree, LZ78, Repair, and K 2 -BDC.
ese algorithms are used to compactly represent large-scale web graph and offer satisfactory time/space tradeoff, with K 2 -BDC achieving the best time/space balance. Our experimental environment is configured with an Intel(R) Core(T) i5-4590 CPU@3.30 GHz, 4 GB of running    Table 2, including the number of nodes, the number of edges, the average number of edges of one node, the density of the graph, and the adjacency list are used to indicate the size required for the graph.
We use two indicators to evaluate the algorithm. e first is the bits needed for an edge average. e total storage space can be needed by calculating the compression divided by the number of edges of the data set. e second is the time required to query the neighbors of a node. For each node, we calculate the time to obtain all the neighbors of the node, and Input: an adjacency matrix M, a density threshold that satisfies the minimum density, and a distance threshold that plots the maximum distance between the grids. Output: a boundary_ list contains the cluster boundary information and an adjacency matrix M 0 for which the cluster is removed. A cluster_ list contains the position of the small grid in each cluster.
(1) Divide the adjacency matrix M into N 2 /d 2 grids, denoted as n; (2) n: � Number of grids to be filtered; List : empty queue, Boundary : empty queue; (3) for (i � 1 to n) (4) if (g i >� density threshold) then (5) add g i to the List; (6) end if (7) end for (8) Flag :� 1; (9) while (Flag �� 1) (10) m :� List.size (); (11) for (i � 1 to m) (12) for (j � 2 to m) (13) if (Distance (g i , g j ) <� distance threshold&& Isaccessed (g j ) �� false) then (14) Classify g i and g j into one class and mark g j as accessed; (15) end if (16) end for (17) Mark g i as accessed, count the density of all meshes belonging to the same class as g i , and merge them into a large grid. e grid name is recorded as g 1 i ; (18) if (g 1 i >� density threshold) then (19) add g 1 i to the cluster_ list; (20) end if (21)  then divide these time sums by the total number of edges of the data set. e unit of time we use is μs. As shown in Figure 8, in the data set Enron, compared with the result of LZ78 and Repair, our storage space is reduced by 57.1% and 32.6%. Compared with the results of traditional K 2 -tree, our storage space is reduced by 30.1%, and the corresponding node neighbor query consumption is also reduced by 15.6%. Relative to the result of the current best algorithm K 2 -BDC, our storage space is also reduced by 3.2%, and the corresponding node neighbor query consumption is also reduced by 6%.
As shown in Figure 9, in the data set CNR-2000, compared with the result of LZ78 and Repair, our storage space is reduced by 78% and 61%, respectively. Our node neighbor query consumption is also reduced by 16.9% compared to the Repair. Compared with the results of traditional K 2 -tree, our storage space is reduced by 44.6%, and the corresponding node neighbor query consumption is also reduced by 37.1%. Relative to the result of the current best algorithm K 2 -BDC, our storage space is also reduced by Input: n is the number of the vertex of the graph, boundary_list is the boundary value of the cluster, and cluster_ list is the actual position of each cluster in the adjacency matrix. Output: direct neighbor set List for node n; (1) m :� boundary_list.size()/2; (2) List :� empty set; Find the T vector and L vector corresponding to the cluster satisfying the boundary condition, and add the queried neighbors to the List; (5) end if (6) end for (7) Find the T vector and the L vector of the M 0 . If there is a neighbor of the node, add the queried neighbor to the List.   5% and the corresponding node neighbor query consumption is also reduced by 23%. As shown in Figure 10, in the data set EU-2005, compared with the LZ78 and Repair, our storage space was reduced by 71% and 57.3%, respectively. Compared with the result of traditional K 2 -tree, our storage space is reduced by 46.4%, and the corresponding node neighbor query consumption is reduced by 49.3%. Compared with results of the current best algorithm K 2 -BDC, our storage space is reduced by 15.8% and the corresponding node neighbor query consumption is reduced by 10.8%. e experimental results show that our method can achieve better time and space tradeoff.

Conclusion
is paper proposes a large-scale graph data representation method based on clustering and K 2 -tree, which adopts a grid clustering algorithm to fully exploit the dense regions in the adjacency matrix so that a large number of "1" values are included in the dense region. Compared with the original adjacency matrix, the edge length of each dense region is greatly reduced, which reduces the number of recursions required from the top layer to the leaf node in the K 2 -tree query operation and increases the storage space utilization.
is method can efficiently and compactly represent the graph data of millions of nodes and edges and can also support node neighbor query operations.
Compared with the current best K 2 -BDC, our method can achieve better time and space tradeoff. In future research, we plan to the multivalued decision diagram to further improve the isomorphic subtree problem caused by the K 2 -tree representation and support more graph data operations on the compressed structure. Another component of planned future is to use this algorithm to compactly represent additional large-scale graph data with various distribution characteristics.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.  Mathematical Problems in Engineering 7