Research of Improved FP-Growth Algorithm in Association Rules Mining

Association rules mining is an important technology in data mining. FP-Growth (frequent-pattern growth) algorithm is a classical algorithm in association rules mining. But the FP-Growth algorithm in mining needs two times to scan database, which reduces the efficiency of algorithm. Through the study of association rules mining and FP-Growth algorithm, we worked out improved algorithms of FP-Growth algorithm—Painting-Growth algorithm and N (not) Painting-Growth algorithm (removes the painting steps, and uses another way to achieve). We compared two kinds of improved algorithms with FP-Growth algorithm. Experimental results show that Painting-Growth algorithm ismore than 1050 andNPainting-Growth algorithm is less than 10000 in data volume; the performance of the two kinds of improved algorithms is better than that of FP-Growth algorithm.


Introduction
Data mining is a process to obtain potentially useful, previously unknown, and ultimately understandable knowledge from the data [1].Association rules mining is one of the important portions of data mining and is used to find the interesting associations or correlation relationships between item sets in mass data [2].Discovering frequent item sets is a key technology and step in the applications of association rules mining [3].The most famous algorithm is Apriori put forward by Agawal in the algorithms of discovering frequent item sets [4].Apriori algorithm through continuous connection scans the database removing unfrequented item sets to find all the frequent item sets in data.But the Apriori algorithm repeatedly scans the database in mining process and produces a large number of candidate item sets, which influence the running speed of mining [5].
FP-Growth (frequent-pattern growth) algorithm is an improved algorithm of the Apriori algorithm put forward by Jiawei Han and so forth [6].It compresses data sets to a FP-tree, scans the database twice, does not produce the candidate item sets in mining process, and greatly improves the mining efficiency [7].But FP-Growth algorithm needs to create a FP-tree which contains all the data sets.This FP-tree has high requirement on memory space [8].And scanning the database twice also makes the efficiency of FP-Growth algorithm not high.
In this paper, we worked out two kinds of improved algorithms-N Painting-Growth algorithm and Painting-Growth algorithm.N Painting-Growth algorithm builds two-item permutation sets to find association sets of all frequent items and then digs up all the frequent item sets according to the association sets.Painting-Growth algorithm builds an association picture based on the two-item permutation sets to find association sets of all frequent items and then digs up all the frequent item sets according to the association sets.Both of the improved algorithms scanning the database only once, improving the overhead of scanning database twice in traditional FP-Growth algorithm, and completing the mining only according to two-item permutation sets, thus, have the advantages of running faster, taking up small space in memory, having low complexity, and being easy to maintain.It is obvious that improved algorithms provide a reference for next association rules mining research.

The System Model of Association
Rules Mining Nature.All nonempty subsets of frequent item sets must be frequent.

FP-Growth
Algorithm.FP-Growth algorithm [10] compresses the database into a frequent pattern tree (FP-tree) and still maintains the information of associations between item sets.Then the compressed database is divided into a set of condition databases (a special type of projection database).Each condition database is dug, respectively, and associates with a frequent item.Transaction database is in Table 1 (support count is 2); mining process using FP-Growth algorithm is shown in Table 1.
Scanning the database for the first time, we can obtain a set of frequent items and their support count.The collection of frequent items is ordered by decreasing sequence of support count.The result set or list writes for .In this way, we have  = [C:4, D:3, E:3, A:2, B:2].
Building FP-Tree.First, the algorithm creates the root node of the tree, with the tag "null." Then it scans the database for the second time.Each item in a transaction is ordered by the sequence of .Later it creates a branch for each transaction.For example, the first transaction "001:A, B, C, D, E" contains five items {C, D, E, A, B} according to the sequence of , generating the first branch ⟨(C:1), (D:1), (E:1), (A:1), (B:1)⟩ for building FP-tree.The branch has five nodes.In it, C is the children link of root, D links to C, E links to D, A links to E, and B links to A. The second transaction "002:B, C, E" contains three items {C, E, B} according to the sequence of , generating a branch.In it, C links to the root, E links to C, and B links to E. This branch shares the prefix ⟨C⟩ with the existing path of transaction "001." In this way, the algorithm makes the count of node C increase by 1 and creates two new nodes ⟨(E:1), (B:1)⟩ as a link of (C:2).Generally, the algorithm considers increasing a branch for a transaction and when each node follows common prefix, its count increases by 1; algorithm creates node for the item following the prefix and linking.For convenience of tree traversal, the algorithm creates an item header table.Each item through a node link points to itself in FP-tree.After scanning all transactions, we get the FP-tree displayed in Figure 1.
FP-tree Mining Processing.The algorithm starts by the frequent patterns' length of 1 (initial suffix pattern) and builds its conditional pattern base (a "subdatabase, " consisting of the prefix path set which appears with the suffix pattern).Then, algorithm builds a (conditional) FP-tree for the conditional pattern base and recursively digs the tree.The achievement of pattern growth gets through the link between frequent patterns generating by conditional FP-tree and suffix pattern.The mining of FP-tree is summarized in Table 2.

System Model.
Algorithms of frequent patterns mining have been applied in many fields.Researching their system model can facilitate a better understanding of them.Figure 2 is a system model of the improved algorithms in this paper.
The user can get needed knowledge which passes data mining through the data mining platform.Data mining platform includes data definition, mining designer, and pattern filter.Through the data definition, we can do a pretreatment for data and make incomplete data usable; through the mining designer, we can use the improved algorithms to dig data and get useful patterns (here are frequent item sets); through the pattern filter, we can select interesting patterns from obtained patterns.

Improved Algorithms Based on the FP-Growth Algorithm
FP-Growth algorithm requires scanning database twice.Its algorithm efficiency is not high.This paper puts forward two improved algorithms-Painting-Growth algorithm and N Painting-Growth algorithm-which use two-item permutation sets to dig.Both algorithms scan database only once to obtain the results of mining.

Painting-Growth Algorithm.
Taking the transaction database in Table 1 as an example, the mining process with Painting-Growth algorithm is as follows.
(1) The algorithm scans the database once, obtains twoitem permutation sets of all transactions, and paints peak set (the peak set is a set of all different items in transaction database).Here we take the first transaction as an example.

Two-item permutation sets after scanning the first transaction are
Other transactions are similar to the first transaction.The peak set after scanning database is {A, B, C, D, E}.
(2) After obtaining the peak set and two-item permutation sets of all transactions, the algorithm paints the association picture according to two-item permutation sets and peak set.It links the two items appearing in each twoitem permutation.When the permutation appears again, the link count increases by 1.The association picture is shown in Figure 3.
(3) According to the association picture, algorithm exploits the support count to remove unfrequented associations.We can get the frequent item association sets as follows: {A(C:2,D:2);B(C:2,E:2); C(A:2,B:2,D:3,E:3); D(A:2,C:3, E:2); E(B:2,C:3,D:2)}.Here we take the item A as an example.A(C:2,D:2) shows that the support count of two-item set (A C) is 2 and the support count of two-item set (A D) is 2. Other items are similar to item A.
(4) According to the frequent item association sets, we can get all two-item frequent sets of this transaction database: {(A,C): (6) At this point, we get all frequent item sets.The algorithm pseudocode is as follows.
3.2.N Painting-Growth Algorithm.The thought of N Painting-Growth algorithm is similar to the Painting-Growth algorithm, but with different implementation method.N Painting-Growth algorithm removes the painting steps.The mining process of N Painting-Growth is as follows.
(1) The algorithm scans the database once and gets twoitem permutation sets of all transactions.
(2) Then, the algorithm counts each permutation in twoitem permutation sets getting all item association sets.
(3) Later, the algorithm removes infrequent associations according to the support count and gets frequent item association sets.
(4) Finally, it gets all frequent item sets according to the frequent item association sets.Mining ends.
From the above processes it can be seen that the N Painting-Growth algorithm is the removing of painting steps version of Painting-Growth.The implementation methods are different: Painting-Growth algorithm imports java.awt and javax.swing,implementing mining through calling super.paintComponents(g);N Painting-Growth algorithm only passes instantiation of a class in main function to implement.

Experimental Results Analysis
To improved algorithms-Painting-Growth and N Painting-Growth algorithm-the biggest advantage is reducing database scanning to once.Comparing with scanning database twice of FP-Growth algorithm, it has improved time efficiency.
Another advantage is that improved algorithms are simple, completing all mining only needing transactions' twoitem permutation sets.Although the FP-Growth algorithm is also getting FP-tree to complete mining, the FP-tree builds complexly and requires memory overhead largely.Relatively, the two-item permutation sets can be obtained easily.
Of course, improved algorithms have disadvantages.In Painting-Growth algorithm, the algorithm needs to build the association picture, leading to a large memory overhead.In N Painting-Growth algorithm, the implementation method is less vivid than Painting-Growth algorithm.When using the two improved algorithms to dig multi-item frequent sets, they scan the frequent item association sets repeatedly for count.This reduces the time efficiency.
In order to verify the two kinds of improved algorithms relative to the FP-Growth algorithm existing superiority, we use the Java language, in eclipse development environment, Windows 7 64-bit operating system, implementing the Painting-Growth algorithm, N Painting-Growth algorithm, and FP-Growth algorithm.The data in experiments come from Data Tang-research sharing platform.Transactions in database, respectively, are 1050, 5250, 10500, 21000, 31500, 42000, and 52500.
In experiments, three kinds of algorithms accept the same original data input and support parameter.The algorithms run 20 times in each bout, calculating the mean as a result.
Figure 4 is an execution time comparison figure for Painting-Growth algorithm, N Painting-Growth algorithm, and FP-Growth algorithm under the condition of different transactions.From the figure, on the one hand, starting from 1050 transactions, the execution time of N Painting-Growth algorithm is less than FP-Growth algorithm; at 31500 transactions, the execution time of N Painting-Growth algorithm and FP-Growth algorithm is very close.Afterwards, the time efficiency is not as good as FP-Growth algorithm.
On the other hand, from 1050 transactions, the execution  Finally, to FP-Growth algorithm, although the whole change trend of increase rate is similar to improved algorithms, it has more clear change than improved algorithms in stage 2 and stage 5. So, the FP-Growth algorithm is less stable than improved algorithms.
From what is above it can be concluded that our Painting-Growth algorithm has an obvious breakthrough in data analysis.Unhesitatingly, when the data size is suitable, we can consider adopting improved algorithms to achieve further performance.Carefully, the transactions are less than 10000 and we can consider N Painting-Growth algorithm.In other cases, the Painting-Growth algorithm performs better and we can consider adopting it.

Conclusions
In this paper, we put forward improved algorithms-Painting-Growth algorithm and N Painting-Growth algorithm.Both algorithms get all frequent item sets only through the two-item permutation sets of transactions, being simple in principle and easy to implement and only scanning the database once.So, at appropriate transactions, we can consider using the improved algorithms.But we also see the problems of improved algorithm: in large data, the performance of the N Painting-Growth is disappointing.Considering how to make the performance of the improved algorithms more stable, make the removal of unfrequented item associations efficient, and make the mining of multiitem frequent sets quick will be our future

Figure 5 :
Figure 5: The increase rate of three algorithms in different transaction stages.
[1] and database  is a collection of transactions.For a given transaction database , the total number of transactions it contains is .Define the support count() of item set ( ⊆ ) as the number of transactions  in  making  ⊆  and the support support() of item set  as count ()/[9].The number of items in an item set is called dimension or length of this item set, if the length of the item set is , called -item set[1].When the length of the item set  is  and support() ≥ minsup, one calls item set -item frequent set.If  ≥ 3, one can call item set  multi-item frequent set.
2.1.Frequent Item Sets.Set  = { 1 ,  2 , . . .,   } as a collection of all different items in the database, each transaction  is 2 Scientific Programming a subset of , that is,