An Incremental High-Utility Mining Algorithm with Transaction Insertion

Association-rule mining is commonly used to discover useful and meaningful patterns from a very large database. It only considers the occurrence frequencies of items to reveal the relationships among itemsets. Traditional association-rule mining is, however, not suitable in real-world applications since the purchased items from a customer may have various factors, such as profit or quantity. High-utility mining was designed to solve the limitations of association-rule mining by considering both the quantity and profit measures. Most algorithms of high-utility mining are designed to handle the static database. Fewer researches handle the dynamic high-utility mining with transaction insertion, thus requiring the computations of database rescan and combination explosion of pattern-growth mechanism. In this paper, an efficient incremental algorithm with transaction insertion is designed to reduce computations without candidate generation based on the utility-list structures. The enumeration tree and the relationships between 2-itemsets are also adopted in the proposed algorithm to speed up the computations. Several experiments are conducted to show the performance of the proposed algorithm in terms of runtime, memory consumption, and number of generated patterns.


Introduction
Association-rule mining (ARM) [1][2][3] from a transactional database is a fundamental task for revealing the relationships among items. The Apriori [4] was the first algorithm to mine the association rules in a level-wise way. It uses generateand-test mechanism to find the candidate itemsets and then derive the frequent itemsets based on the minimum support threshold. The association rules are then revealed from the discovered frequent itemsets based on minimum confidence threshold. The FP-growth algorithm [5] was the first algorithm to efficiently mine the frequent itemsets without candidate generation. It uses the FP-tree structure to compress the original database into a tree structure. An index Header Table with a designed FP-growth mining algorithm is also proposed to find the corresponding paths of the items for deriving the frequent itemsets. Many algorithms have been, respectively, proposed to efficiently mine the association rules based on either the level-wise or pattern-growth mechanisms [2,3]. Both the level-wise or pattern-growth approaches can only handle the static database in batch mode. When transactions are changed in the database, new information may arise and old ones may become invalid. The updated database is required to be processed to mine the updated information in batch mode, which is not suitable in practical applications.
To solve the above limitations of batch-mode algorithms [6,7], Cheung et al. proposed the Fast-UPdated (FUP) algorithm [8] to maintain and update the discovered information with transaction insertion. It divides the discovered frequent itemsets from the original database and all itemsets in the inserted transactions into four cases. The procedures for four cases are, respectively, designed to maintain and update the discovered frequent itemsets. When the itemsets are small in the original database (support ratio is lower than 2 The Scientific World Journal minimum support threshold) but large in the new database (support ratio is larger than or equal to the minimum support threshold), the original database is required to be rescanned to find the actual occurrence frequencies of the small itemsets in the original database.
For ARM, it only reveals the binary relationships among items. The implicit factors such as profit or quantity are not concerned in ARM. A pattern with highly frequency may not be interested if it cannot bring highly profit for retailer. For example, a sale of diamonds may occur less frequently than that of clothing in a department store, but the former gives a much higher profit per unit sold than the latter. Only the occurrence frequency is insufficient to identify highly profitable items in traditional ARM.
High-utility mining (HUM) [9,10] was thus proposed to partially solve the limitations of association-rule mining. It may be thought of as an extension of frequent-itemset mining by considering the sold quantities and profits of the items. The utility of an itemset can be measured in terms of quantity and profit, which can be defined by user preference. For example, someone may be interested in finding the itemsets with good profits and another may focus on the itemsets with low pollution while manufacturing. When the utility of an itemset is larger than or equal to the minimum utility count, an itemset is considered as a high-utility itemset (HUI). Several algorithms have been proposed to mine HUIs in a static database [11][12][13][14].
As previously mentioned in ARM, it is also an important issue to design an algorithm to efficiently maintain and update the HUIs when data or transactions are frequently changed in the original database. Some HUM algorithms have been proposed with transaction insertion [15][16][17]. The original database is still, however, required to be rescanned for maintaining and updating the HUIs in some cases. The problem of combination explosion based on level-wise approach is also a critical issue to be solved.
In this paper, a memory-based incremental approach for maintaining and updating the discovered HUIs is proposed with transaction insertion. The proposed algorithm inherits the HUI-Miner algorithm [18] to build the utility-list structures for mining HUIs in incremental mining. Since the utility-list structure is a condensed way to keep the related information for high-utility mining, all itemsets whether they are high transaction-weighted utilization itemsets (HTWUIs) or small in the original database should be kept. An estimated utility cooccurrence structure (EUCS) [19] is also applied in the proposed algorithm to speed up the performance of the proposed approach. Based on the designed algorithm, it outperforms the two-phase algorithm [12] and the state-of-the-art FHM algorithm [19] in batch mode and other previous algorithms for incremental mining [16,17].
The remaining of this paper is organized as follows. Related works are reviewed in Section 2. The preliminaries and problem statement are described in Section 3. The proposed incremental algorithm with transaction insertion is given in Section 4. An illustrated example to explain the proposed algorithm step-by-step is described in Section 5. Experiments are provided in Section 6. Conclusion is finally given in Section 7.

Review of High-Utility Mining
Traditional ARM only concerns the binary values of the itemsets in a transactional database. The frequent itemsets only reveal the occurrence frequencies of the itemsets in the transactions, which is not suitable in real-world applications. Other factors such as price, quantity, or cost can also be used as the important measurements to analyze and predict purchased behaviors of the customers. Besides, highly profitable products with lower frequencies may not be discovered in traditional ARM. For example, in the basket analysis, jewels and diamonds are high profitable items but may not be frequent compared to food or drink products.
High-utility mining (HUM) [9,10] is concerned as an extension of the frequent itemsets mining by considering both the quantities and profits of items to discover the valuable itemsets than the frequent ones. An itemset is concerned as a HUI if its utility value is larger than or equal to the minimum utility count. Chan et al. first proposed the top-objective-directed data mining to mine the top-closed utility patterns based on business objective [9]. Not only the frequent itemsets but also the HUIs can be thus discovered by the designed approach. Yao and Hamilton proposed the utility model to firstly consider both quantities and profits of the items to mine the HUIs [10]. Several mathematical properties of utility constraints and two pruning strategies are also designed to efficiently mine HUIs. Liu et al. proposed the two-phase model [12] to mine HUIs based on the developed transaction-weighted downward closure (TWDC) property. Based on two-phase model, the numerous candidates can be greatly reduced and the high-utility itemsets can be precisely obtained.
Many algorithms have been proposed to mine HUIs based on two-phase model. Lin et al. designed a high-utility pattern-(HUP-) tree algorithm [11] to compress the original database into a tree structure. A pattern-growth HUP-growth mining algorithm was also designed to mine HUIs. Tseng et al. then proposed the UP-tree structure with UP-growth and UP-growth + mining algorithms to efficiently mine HUIs [13]. Since the pattern-growth approach requires computations to trace the tree nodes in the tree structure, Liu and Qu then proposed a HUI-Miner algorithm [18] to compress the database into the utility-list structures. Each entry in the utility-list structure stores transaction IDs (TIDs), the utility of itemset in the transaction (Iutility), and the rest utilities of itemsets except in the transaction (Rutility). Based on the HUI-Miner algorithm and the designed pruning strategy of the enumeration tree, the HUIs can be easily discovered. Fournier-Viger et al. then modified the HUI-Miner algorithm and designed an estimated utility cooccurrence structure (EUCS) to keep the relationships between 2-itemsets, thus speeding up the computations compared to the HUI-Miner algorithm [19].
Most algorithms process the static database to mine HUIs. In real-world applications, transactions are dynamically changed in the original database. Ahmed et al. proposed an IHUP algorithm with three tree structures for mining HUIs with transaction insertion [15]. The proposed tree-based algorithm can be used to avoid the generateand-test mechanism for HUM. The IHUP-tree algorithm The Scientific World Journal 3 still requires to generate numerous HTWUIs based on the pattern-growth approach. Lin et al. proposed an incremental (FUP-HUI-INS) algorithm [17] for updating the discovered HUIs based on the FUP concept [8] and two-phase model [12] with transaction insertion. Two parts with four cases are then divided by the HTWUIs in the original databases and all itemsets in the inserted transactions. Each case is then processed by the designed procedure to maintain and update the discovered HUIs. Although the FUP-HUI-INS algorithm has good performance than the two-phase model, the original database is still required to be rescanned when an itemset is small in the original database but HTWUI in the inserted transactions. To solve the limitations of FUP-HUI-INS algorithm, Lin et al. then proposed an improved prelarge concept for mining high-utility itemsets with transaction insertion (PRE-HUI-INS) [16]. Based on the property of prelarge concept [20], prelarge transaction-weighted utilization itemsets (PTWUIs) are kept to avoid database rescan until the cumulative total utility of the inserted transactions achieves the safety bound. Since FUP-HUI-INS and PRE-HUI-INS algorithms are processed by two-phase model, an additional database rescan is still necessary to be performed to find the actually HUIs. Besides, it requires computations to find the HTWUIs based on the pattern-growth approach.

Preliminaries and Problem Statement
In this section, the preliminaries related to HUM are given below.

Notations
: original quantitative database, = { 1 , 2 , . . . , }, in which is the transactions number and each transaction includes a subset of items with quantities; : set of new transactions, = { 1 , 2 , . . . , }, in which each transaction includes a subset of items with quantities; : entire updated database, that is, ∪ ; : set of items, = { 1 , 2 , . . . , }, each item with a profit value ; TID: each transaction ∈ has a unique transaction identification; : utility value of each item in each transaction; : accumulated utility value of the items in each transaction; : quantity of item in each transaction; : predefined minimum high-utility threshold; TWU ( ): transaction-weighted utility of an item in the original database .

Definition 6.
Total utility of is denoted by TU , which can be defined as TU = ∑ ∈ ( ).
For example, the transaction utilities for 1 to 10 are, respectively, calculated as
For example, suppose a minimum utility threshold is set at 35%. An item ( ) is considered as a HUI since its utility is ( ) (= 1050), which is larger than or equal to the minimum utility count as 1050 > (0.35 × 2921) (= 1022.35). An itemset ( ) is not considered as a HUI in since its utility is ( ) (= 950), which is smaller than the minimum utility count as (950 < 1022.35). After the above definitions, the problem statement of HUM is described below.
Problem Statement. Given a transactional database , its total utility is defined as TU from , a minimum utility threshold is set at 0 < ≤ 1, and the HUM is to find the completeitemsets whose utilities are larger than or equal to minimum utility count as ( × TU).
Based on TWDC property of two-phase model, numerous candidates and combinational computations can be greatly reduced.

Proposed Incremental Algorithm for Transaction Insertion
In this paper, the HUI-Miner algorithm [18] is adopted to design the incremental algorithm for HUM. Before transactions are inserted into the original database, the utilitylist structures are built in advance to keep not only the HTWUIs but also those itemsets which are not the HTWUIs from the original database to avoid the database rescan with transaction insertion. Since the utility-list structure is a condensed structure to keep the related information from the original database, only fewer memories are required to keep the related information of the proposed algorithm.

Utility-List Structure.
Each entry in the utility-list structure of an itemset keeps the TID numbers of (TIDs), the utility of in (Iutility), and the remaining utility of in (Rutility).
Definition 10. An entry of in the utility-list structure consisted of the set TIDs for in of ( ⊆ ∈ ), the set of utility for in (Iutility), and the set of remaining utility for X in (Rutility), in which Rutility is defined as The construction procedures of utility-list structures are recursively processed for -itemsets if it is necessary to process the depth-first search in the search space. The construction algorithm is then shown in Algorithm 1.
In the construction process, the itemsets are sorted in ascending order of their transaction-weighted utility (TWU). For the Rutility of an itemset in a transaction, it keeps the rest utilities in the transaction except the processed itemset . Since the TWU values of the itemsets are changed with transaction insertion, the sorted order of the utility-list structures and the Rutility value should also be changed. The number of inserted transactions is, however, very small compared to the original database. In the proposed algorithm, the sorted order of the itemsets in the inserted transactions follows the initially TWU ascending order of itemsets in the original database. An example to show the utility-list structures of 1-itemsets is shown in Figure 1.
. is to sum the utilities of an itemset in database as Definition 12. The .

An Enumeration Tree.
The search space to mine HUIs is based on the enumeration tree to decide whether the supersets of the processed node are required to be determined. If the summation of the Iutility and Rutility of the current processed node is larger than or equal to the minimum utility count, the supersets of the processed node will be generated and determined. This criterion is based on the TWDC property of the two-phase model [12]. The enumeration tree is shown in Figure 2.
Definition 13. Any extension of an itemset is a combination of with the itemset(s) after an itemset , which is denoted by . Figure 2: The enumeration tree.

Pruning Strategy.
Based on the HUI-Miner [18], a pruning strategy can also be adopted to compress the border for determination than the TWDC property.
In addition, the estimated utility cooccurrence pruning (EUCP) strategy [19] is also adopted in the proposed algorithm to further keep the relationship of 2-itemsets, thus eliminating the extension itemsets with lower utility without reconstructing the utility-list structures. The constructed EUCS is shown in Table 3.

Proposed Incremental Algorithm. Based on the above properties inheriting from HUI-Miner and EUCS structures, the proposed incremental algorithm is described in Algorithm 2.
For the designed incremental algorithm with transaction insertion, the original database is firstly scanned to construct the utility-list structures for all 1-itemsets and the EUCS structure for each item (Lines 2-8). Similarly, the inserted transactions are also scanned to construct the utility-list structures for all 1-itemsets. Each related TWU values of items in the built EUCS are also updated by the inserted transactions (Lines 9-15). The designed merge-list algorithm is used to combine the utility-list structures from the original database and inserted transactions into an updated utility-list structures (Line 16). After that, the 1-extensions of an itemset are recursively processed (Lines 17-28) by using a depthfirst procedure. Each itemset is then determined by the designed condition to check whether it is a HUI (Lines [18][19][20]. If an itemset is not a HUI, its extension is then determined by the designed condition based on two-phase model (Line 21) for depth-first search. Theupdated EUCS structure is also used to prune the unpromising itemset, thus reducing the search space for mining high-utility itemsets (Lines 24-26). The construction of utility-list structure algorithm is then performed to construct the extULs of . The proposed HUIlist-INS algorithm is then recursively performed to mine HUIs (Lines 21-29). The algorithm is then terminated until no itemsets are generated. The merge-list algorithm to combine original database and the incremental one are described in Algorithm 3.
Assume the minimum high-utility threshold is also set at 35%; the updated minimum utility count for mining HUIs is calculated as (2921 + 1671) × 0.35 (= 1607.2). First, the utility-list structures for the incremental database are also constructed for all 1-itemsets. After the construction process, the results of utility-list structures in the incremental database are shown in Figure 3.
After that, the utility-list structures from the original database and the incremental ones are merged together.  (2,300,203), (3,450,6), (15,450,12)}. The other items { , , , , } are processed in the same way. After that, the final updated utility-list structures are then updated and shown in Figure 4.
In this example, since the utility-list structures are sorted in ascending order of their TWU values, the item ( ) is first processed to mine the related HUIs of ( ). The total utility of ( ) in the utility-list structure can be directly derived from Iutility, which can be calculated as (5 + 3 + 2 + 3 + 4 + 3 + 2) (= 22). The Rutility of ( ) is calculated as (69 + 143 + 576 + 150 + 409 + 359 + 309) (= 2015). Since the summation of ( ) is smaller than the updated minimum utility count, the summation of ( ) and ( ) is larger than minimum utility count as (22 + 2015 > 1607.2). Thus, the depth-search mechanism is then performed to find the supersets of the item ( ) in the enumeration tree. The item ( ) is then combined with item ( ). Both of them are appeared in transactions 3, 4, and 7, which can be observed from Figure 3, to construct the utility-list structures for ( ). The other items ( , , , ) are processed in the same way. After that, the supersets of ( ) are shown in Figure 5.

Experimental Evaluation
Several experiments in terms of execution time, memory consumption, and the number of patterns are conducted to show the performance of the proposed algorithm in four databases including both three real-life databases [21] and a synthetic database [22]. The two-phase algorithm [12], the state-of-the-art FHM algorithm [19], and two incremental FUP-HUI-INS [17] and PRE-HUI-INS [16] algorithms are used to evaluate the proposed algorithm. The experiments were performed in Java on an Intel Core2 Due with a 2.8 GHz processor and 4 GB main memory, running the Microsoft Windows 7 operating platform. The values of quantities and profits were assigned to the purchased items in all databases except Foodmart database. The two-phase simulation model [12] is adopted to set the quantity range from 1 to 5 and the profit range from 1 to 200 by log operation. Parameters and characteristics for four databases are, respectively, described in Tables 6 and 7. 6.1. Runtime. Experiments were made to show the runtime of the proposed algorithm compared to the two-phase and FHM algorithms in batch mode and the other two incremental algorithms. The runtime includes the construction and mining phases. Experiments are then conducted to show the comparisons under various minimum utility thresholds
(2) FOR each itemset and .UL ∈ .UL DO (3) IF .UL ̸ = null THEN (4) search itemset ∈ .UL in db.UL (5) IF ∃ ( ∈ .UL and ∈ .UL) THEN (6) FOR each element ∈ .UL and .UL ∈ .UL DO (7) .Iutility.sum ← .Iutility.sum + .Iutility; (8) .Rutility.sum ← .Rutility.sum + .Rutility; (9) .UL ← . (10) END FOR (11) END IF (12) .UL ← .UL. (MUs) with a fixed insertion ratio (IR). The results are shown in Figure 6. From Figure 6, it can be observed that the proposed algorithm has better performance than the two-phase and   Figure 7. From Figure 7, it also can be observed that the proposed algorithm outperforms the other algorithms under various IRs. Take an example of Figure 7(b), the MU is set at 0.15%, and the IRs are, respectively, set from 2% to 10%, with 2% The Scientific World Journal   From Figure 8, it can be observed that the FHM and the proposed algorithms require steady memory along with the increasing of MUs compared to the other algorithms. This is because the fact that the FHM and the proposed algorithms are necessary to build the utility-list structures for keeping the itemsets. When MU is set lower, the proposed algorithm requires fewer memory than the other algorithms, which can be observed from Figure 8(a). Experiments are then conducted to show the comparisons under various IRs with a fixed MU. The results are shown in Figure 9.
From Figure 9(a), it can be observed that the proposed algorithm requires less memory than the other incremental algorithms along with the increasing of IRs. From   Table 8. From Table 8, it can be observed that the two-phase, FUP-HUI-INS, and PRE-HUI-INS algorithms are performed in a level-wise approach to necessary generate the huge number of candidates for deriving the actual HUIs. Besides, the prelarge concept is adopted in the PRE-HUI-INS algorithm, thus keeping more candidates to reduce the computations of database rescan. Although the TWDC property is adopted in the two-phase mode to prune the unpromising candidate itemsets, it still requires computations to generate the amount of candidates in a level-wise way. Experiments are then conducted to show the comparisons under various IRs with a fixed MU. The results are shown in Table 9.
From Table 9, it can be observed that the number of candidates or HUIs is not dramatically increased along with the increasing of IRs. It can be concluded that different IRs would not seriously influence the number of patterns. From the observation of experiments, it can also be found that rare candidates or HUIs are generated in the incremental database. Thus, it is inefficient to rescan the original database and remine the HUIs based on the batch-mode mechanism of two-phase and FHM algorithms. The designed algorithm in real-world applications can thus be acceptable.

Conclusion
In the past, many algorithms have been proposed to efficiently mine HUIs from a static database. When some transactions are inserted into the original database, the original database is required to be rescanned to re-mine HUIs in batch mode. Fewer studies have been proposed to handle the dynamic database with transaction insertion in incremental mining. Most of them are also performed based on Apriori-like approach to generate and test HTWUIs in a level-wise way. In this paper, a novel incremental algorithm is proposed to maintain and update the built utility-list structures for mining HUIs with transaction insertion. Based on the utilitylist structures, related information in the original database can thus be compressed. The proposed algorithm also applies the estimated utility cooccurrence structure (EUCS) to keep the information between 2-itemsets, thus speeding up the computations. Without the level-wise approach for generateand-test candidates, HUIs can be easily discovered based on the designed algorithm for the incremental database. Experimental results show that the performance of the proposed algorithm outperforms that of other algorithms.