Improved Strategy for High-Utility Pattern Mining Algorithm

High-utility pattern mining is a research hotspot in the field of pattern mining, and one of its main research topics is how to improve the efficiency of the mining algorithm. Based on the study on the state-of-the-art high-utility pattern mining algorithms, this paper proposes an improved strategy that removes noncandidate items from the global header table and local header table as early as possible, thus reducing search space and improving efficiency of the algorithm. The proposed strategy is applied to the algorithm EFIM (EFficient high-utility Itemset Mining). Experimental verification was carried out on nine typical datasets (including two large datasets); results show that our strategy can effectively improve temporal efficiency for mining high-utility patterns.


Introduction
e main challenge of data mining is to find meaningful information from massive amounts of data. e technique of finding interesting, unexpected, and useful data patterns from large databases is called pattern mining. Many studies have focused on traditional frequent pattern mining and just concern the occurrence of itemsets/patterns in the database, without considering the internal utility values (i.e., quantity) and external utility values (e.g., importance, profit, and price) of each item in the itemset [1]. To address this issue, the utility information of each item or itemset is introduced into frequent pattern mining, hence the emergence of highutility patterns/itemsets (HUPs/HUIs) mining. HUP mining has been used in many fields, unfolding its commercial value in many application fields, such as website clickstream analysis [2,3], mobile commerce environment [4], crossmarketing commercial value of retail stores [5,6], and gene regulation and biomedical applications [7]. Utility patterns are also applied on sequential data, such as algorithm HUSP-ULL [8], and uncertain data, such as algorithm MUHUI [9].
Yao et al. [10] proposed the definition and mathematical model of high-utility pattern (HUP): the utility value U(X) of an itemset X on a dataset D is defined as the sum of the utility value U(X, t) of X on all transactions t (see Definition 3). e task of the high-utility pattern mining is to find all patterns whose utility value is not less than the minimum utility value (threshold value). e pruning strategy of traditional frequent pattern mining algorithms no longer works in HUP mining, because a superset of a non-HUP might be an HUP; this makes the search space of the mining algorithm even larger. Improving the spatial-temporal efficiency of the mining algorithm has been a challenge [11][12][13].
Existing HUPs mining algorithms may be categorized into a two-phase approach and one-phase approach. Twophase mining algorithms need two phases to find all HUPs: in the first phase, the candidate itemsets are generated by estimating the utility value of each candidate itemset, in the second phase, the true utility value of each candidate itemset is calculated by scanning the dataset. is two-phase approach is adopted by the algorithms of Two-Phase [11], IHUP [2], UP-Growth [14], and MU-Growth [15]. ese algorithms often generate a large number of candidate itemsets in the first phase, not only requiring much of storage memory but also drastically increasing the computation cost in the second phase.
In order to avoid the problems caused by candidate itemsets, newly proposed HUP mining algorithms tend to use a no-candidate approach, such as HUI-Miner [16] and d2HUP [17]; they try to find HUPs directly without generating candidate itemsets. Comparing with the twophase HUP mining algorithms, the speeds of algorithms d2HUP and HUI-Miner are greatly improved. Based on HUI-Miner, two improved algorithms HUP-Miner [12] and FHM [18] are developed. Also, there is a new fast mining algorithm EFIM [19] proposed by Zida et al., in which two new upper bounds of utility value are used to reduce the search space and consequently boost the performance greatly. e algorithm HMiner [20] proposed two pruning techniques, LAprune and C-prune, to reduce the search space for mining HUPs. e algorithm ULB-Miner [21] extended the algorithms FHM [18] and HUI-Miner [16] by utilizing a utility list buffer structure, which improved the performance of the FHM algorithm in terms of the memory and runtime usage.
Although the spatial-temporal efficiency of HUP mining algorithm has been greatly improved, the time cost is still relatively high. e improvement of algorithm performance, especially the improvement of its temporal efficiency, is still a challenge in this field. In this paper, an improved strategy is proposed to boost the temporal efficiency of HUP mining algorithm and is applied to the algorithm EFIM. e rest of this paper is organized as follows. Section 2 introduces the problem description and relevant definitions. Section 3 introduces the improvement strategy and the improved algorithm EFIM-IMP. Section 4 reports the experimental results. Section 5 draws the conclusions.

Problem Description and Definitions
A utility-valued transaction dataset D � {T 1 , T 2 , T 3 , . . . , T n } contains n transactions and m unique items I � {i 1 , i 2 , . . ., i m }. Each transaction T d (d � 1, 2, 3, . . ., n) is called a transaction itemset and includes a subset of all unique items in I, for example, T 1 � {(A,4) (C,3) (F,1)}. Each item i j in each transaction T d is attached with a quantity which is called internal utility (denoted as q(i j ,T d )); for example, the first transaction in Table 1 includes 3 items "A," "C," and "F"; their quantities are 4, 3, and 1, respectively, denoted as q(A,T 1 ) � 4, q(C,T 1 ) � 3, and q(F,T 1 ) � 1. Each item i j has a unit profit p(i j ), which is called external utility, for example, p(A) � 4 in Table 2. |D| indicates the number of transactions in the dataset D, and | T d | indicates the number of items in the transaction T d .

Definition 1.
e utility value of the item i j in a transaction T d is denoted as U(i j , T d ) and is defined as For example, in Tables 1 and 2, U(A,T 1 ) � 4 * 4 � 16, U(C,T 1 ) � 10 * 3 � 30, and U(F,T 1 ) � 1 * 1 � 1.

Definition 2.
e utility value of the itemset X in a transaction T d is denoted as U(X, T d ) and is defined as For example, in Tables 1 and 2

Definition 3.
e utility value of the itemset X in a dataset D is denoted as U(X) and is defined as For example, in Tables 1 and 2  e utility value of the transaction T d in a dataset D is denoted as TU(T d ) and is defined as For example, in Tables 1 and 2,

Definition 5.
e utility value of the dataset D is denoted as TU and is defined as For example, in Tables 1 and 2, TU � 47 + 58 + 54 + 46 + 30 + 49 � 284.

Definition 6.
e transaction-weighted utility value of the itemset X is denoted as TWU(X) (also called TWU value) and is defined as  e minimum utility threshold δ is a userspecified percentile of the total transaction utility value of the given dataset D; so the minimum utility value, MinU, in D is defined as Definition 8. An itemset X is called a high-utility pattern/ itemset (HUP/HUI) if its utility is not less than the minimum utility value.
Definition 9. An itemset/item X is called a candidate itemset/ item for high-utility itemset/item if twu(X) ≥ Min U, and it is also called a promising itemset/item; otherwise it is an unpromising itemset/item.

Theorem 1. Transaction-weighted downward closure property [4]: any nonvoid subset of a promising itemset is a promising itemset, and any superset of an unpromising itemset is an unpromising itemset.
Definition 10. Assume that the transaction T d in dataset D is ordered (e.g., in lexicographic order), for item i j , the sub- Definition 11. e utility value of the remaining itemset (i j , T d ) in transaction T d is denoted as RU(i j , T d ) and is defined as For example, items of each transaction in Table 1 are ordered lexicographically, so RU(C,

Definition 12.
e utility value of the remaining itemset of item i j in dataset D is denoted as RU(i j ) and is defined as For example, for Tables 1 and 2, RU(C) � 31 + 58 + 16 � 105.

Theorem 2. For an itemset/item {i j }, if RU(i j )<MinU, then it is not an HUP, and its any superset Y
Proof. According to Definitions 3 and 12, RU(i j ) > U(i j ) and

Algorithm EFIM.
e algorithm EFIM uses a patterngrowth approach to find HUPs; the main process is to find candidate items by scanning the dataset, and then iteratively generates new candidate items by scanning the local dataset of each candidate. We can see from the above that the fewer of the candidates, the less search space is needed in the iterative process, and the more efficient the algorithm will be.
So EFIM proposes two upper bounds of utility value of HUP, and apply them respectively to the global and local dataset, to tighten the criteria of candidates and to reduce the number of candidate items generated, resulting more efficient mining algorithm. e algorithm EFIM is shown as Algorithms 1 and 2.
Lines 1-8 in Algorithm 1 process the global dataset for all candidates with the remaining utility value not less than the minimum utility threshold. Lines 1-3 calculate twu values of each item (lu (α,i)), and candidates (whose twu values are not less than the minimum threshold) are stored to list Secondary (α). Line 4 sorts the items in Secondary (α) by ascending order of twu values. Line 5 deletes noncandidates from each transaction, sorts items in each transaction according to the order of list Secondary (α), and removes empty transactions. Line 6 sorts all transactions and merges the transactions with the same items. Line 7 calculates the remaining utility value of each item in transactions. By Line 8, items with remaining utility value note less than the minimum utility are stored to list Primary (α). Line 9 iteratively processes each candidate in Primary (α) and determines if it (and its extended itemset) is an HUP.
Detailed procedures of Line 9 (in Algorithm 1, for iteratively processing the local dataset) are shown in Algorithm 2, as a subroutine named Search. Lines 1-9 deal with each item in Primary (α). Line 3 scans dataset α-D (that contains itemset α), calculates utility value of itemset β, and gets dataset β-D (that contains itemset β). Line 4 outputs β as an HUP if its utility value is not less than the minimum threshold. Lines 5-7 scan β-D and get Primary(β) and Secondary(β) using the same mechanism as Figure 1 for Primary (α) and Secondary (α). Line 8 iteratively calls Search for iteration on Primary (β).
By utilizing two upper bounds on utility value, EFIM can effectively reduce the number of candidates (candidateitems) and boost the performance of the mining algorithm. But the number of candidates still can be reduced in algorithms like EFIM, so we propose two additional strategies to further reduce the number of candidates.

Improvement of Algorithm EFIM.
ere is a fact that EFIM does not take into consideration: when deleting noncandidate items from the dataset, the TWU values of candidate items might be affected and reduced, resulting in some of the candidate items to be changed to noncandidates.
is is an iterative process until the dataset goes stable for a Mathematical Problems in Engineering 3 specified minimum TWU threshold. We harness this fact for an improved strategy to reduce the number of candidate items effectively, applying this strategy to the candidate generating process of the EFIM algorithm from both the original (global dataset) and local datasets. Lines 1-3 of EFIM-IMP (Algorithm 3) function the same way as EFIM: scan the dataset once, get each item's TWU value, and put those items with TWU values not less than the minimum threshold to Secondary (α).
Lines 4-12 (Algorithm 3) are our improved strategy: Line 4 counts the number of unique items in the original dataset to count0; Line 5 counts the number of candidates to count1; Lines 6-12 are repeated deletions of noncandidates, as long as the number of candidates changes after a deletion; Line 7 deletes all noncandidate items from the dataset; Line 8 recalculates the TWU value of each item; Line 9 recounts candidates to Secondary (α); Line 10 saves the count of candidates of the last iteration; and Line 11 gets the count of the remaining candidates.
is iterative deleting strategy reduces the number of items in Primary (α) and Secondary (α) and hence reduces the search space of algorithm EFIM.

Improved Strategy on Local Candidate Itemset.
e algorithm in Algorithm 1 mainly deals with the original dataset, resulting in two lists Primary (α) and Secondary (α); these lists are called the global candidate lists. Algorithm 2 deals with the subset of a certain item/itemset β (called a Input: D: a transaction database; MinU: a user-specified threshold. Output: the set of high-utility itemsets (1) α � ∅ (2) Calculate lu (α,i) for all items i ∈ I by scanning D, using a utility-binary; (3) Secondary(α) � {i|i ∈ I∧lu (α,i) ≥MinU }; (4) Let ≻ be the total order of TWU ascending values on Secondary(α); (5) Scan D to remove each item i ∉ Secondary(α) from the transactions, sort items in each transaction according to≻, and delete empty transactions; (6) Sort transactions in D according to≻T; (7) Calculate the subtree utility su(α,i) of each item i ∈ Secondary(α) by scanning D, using a utility-bin array;   Mathematical Problems in Engineering local dataset, denoted as β-D) and generates two lists Primary (β) and Secondary (β), as the local candidate lists. e global candidate lists generated by EFIM in Algorithm 1 may contain noncandidate items; so do the local candidate lists generated by Algorithm 2: they may also contain noncandidates and hence reduce the mining efficiency. So, we propose Strategy 2 and apply it to the local dataset processing of EFIM; the improved algorithm is shown in Algorithm 4. Strategy 2. Repeatedly calculate TWU values and delete noncandidate items from the local header table, till the recalculation does not generate new noncandidate items. Strategy 2 is augmented to the local data processing of EFIM (between Line 7 and Line 8 in Algorithm 2); the revised procedure (named Search-IMP) is shown in Algorithm 4: Lines 9-10 record the number of candidates before and after recalculation, respectively, to count0 and count1; if the two counts differ, indicating that the local candidates have been changed, the proposed strategy is applied to delete those newly established noncandidates (Line 12), recalculate lists Secondary (β) and Primary (β) (Lines 13-15), recount items in these lists, and repeat the above process if the counts differ (Lines 16-17, Line 11), till Secondary (β) is stable through this process.
Algorithm 2 needs an additional scan on dataset α-D when processing each item to obtain subdataset β-D; to optimize this step, Algorithm 4 maintains an index of candidate items in each transaction (line 1) of α-D, to enable fast searches for transactions containing a certain item and locating the position of this item (line 4).

Algorithm Analysis.
Both our two improved strategies adopt the approach of deleting newly established noncandidates to reduce the number of candidate items in the global/local header table. e criteria for screening noncandidates are based on whether the TWU value of an item is less than the minimum threshold; and, according to Property 1, these strategies will not affect the mining result; that is, they do not cause loss on high-utility patterns during the mining process. e time complexity of EFIM is O (lnw), where l is the number of candidate items, n is the number of total transactions in the dataset, and w is the average length of transactions. e time complexity of the augmented part of our proposed strategy is O (krw), where k is the number of loops, r is the number of transactions containing noncandidates, and w is the average length of transactions. e core functionality of our improved strategy is to delete newly discriminated noncandidates constantly, that is, to reduce the number of candidates (l), the average length of transactions (w), and even the total number of transactions (n); so, the more noncandidates it deletes, the more efficiency boost will be achieved. e time complexity of the revised algorithm EFIM-IMP is still O (l 1 n 1 w), where l 1 is the number of candidate items, n 1 is the average number of transactions containing candidate items in the dataset, and w is the average length of transactions. According to strategies 1 and 2, l 1 is not bigger than l, and n 1 is not bigger than n. So, O(l 1 n 1 w) is not bigger than O (lnw).
But if there are not so many noncandidates to purge, due to the time-costing characteristics of the strategy itself, the overall performance boost for the mining process might be hampered, even be reduced to less efficient than the original EFIM.

Experiments
e improved algorithm of EFIM (we call EFIM-IMP hereafter) is the integration of EFIM with our proposed strategies. We compared the performance of EFIM-IMP with three algorithms, EFIM, D2HUP, and ULB-Miner. Source code of EFIM, D2HUP, and ULB-Miner can be downloaded from the website, http://www.philippefournier-viger.com/spmf/, and EFIM-IMP is a direct revise upon the source of EFIM. All programs are written in Java. Experimental platform, Windows 7 operating system, 16G of memory, Intel (R) Core I i7-6500 CPU @ 2.50 GHz, and Java Heap space, is set to 1.5 G. Experimental comparisons of these four algorithms are carried out on classic datasets. Nine standard datasets were used for our experiments, including 2 high volume dataset (Chain-store and Kosarak). ese datasets can also be downloaded from the above website. Table 3 shows the characteristics of these nine datasets.
We ran four algorithms on different datasets while decreasing the minimum utility threshold. On the dataset Pumsb, the algorithms D2HUP and ULB-Miner were slow, for example, D2HUP ran 930 s and ULB-Miner ran 650 s, so we did not run these two algorithms on the dataset Pumsb under different thresholds. Figure 1 is the time cost comparison of the four algorithms on different datasets. Figure 2 is the memory cost comparison of the four algorithms on different datasets. Multiple runs are conducted, and numbers are recorded and averaged as the final experimental results.
We can see from Figure 1 that the revised algorithm EFIM-IMP outperforms on running time except for the dataset Foodmart. EFIM-IMP is faster than EFIM on each Input: D: a transaction database; MinU: a user-specified threshold. Output: the set of high-utility itemsets (1) α � ∅ (2) Calculate lu (α,i) for all items i ∈ I by scanning D, using a utility-binary; Mathematical Problems in Engineering dataset. e revised algorithm EFIM-IMP has an obvious improvement on the datasets with more distinct items, for example, Chainstore and Kosarak. On the dense datasets, EFIM-IMP came near to D2HUP, but D2HUP used more memory than EFIM-IMP, as shown in Figure 2.
EFIM only calculates TWU values once, deletes noncandidate items according to the aforesaid values, and leaves all remaining items as candidates; EFIM-IMP iteratively repeats the TWU calculating and noncandidates deleting processes, till the number of candidates remains unchanged.
Among the aforementioned datasets, five of them (Accident, BMS, Chess, Connect, and Mushroom) include fewer distinct items (e.g., 75 distinct items in Chess), among which 4 datasets (except BMS) are dense. As a result, there are not so much number of candidate items to be cut down by strategies 1 and 2, and the improvement of algorithm EFIM-IMP is not obvious. As a contrast, for datasets Chainstore and Kosarak, the improvement of algorithm EFIM-IMP is obvious. e reason is that these two datasets include much more distinct items and transactions (e.g., Chainstore includes 46,086 distinct items and 1,112,949 transactions) and are sparse. Our strategies can efficiently reduce the number of candidate items on these two datasets.

Conclusion
is paper focuses on optimization approaches of high utility pattern mining algorithms and proposes an improved strategy that, by iteratively removing newly discriminated noncandidate items, reduces the search space of the mining process and boosts mining efficiency. e proposed strategy was applied to algorithm EFIM. Nine standard datasets were used for algorithm verification, including 2 high volume datasets. Experimental results show that the improved algorithm can reduce the number of candidates effectively and outperform EFIM in time efficiency; the improvement is significant on high volume datasets.

Data Availability
e data used to support the findings of this study can be downloaded from the SPMF website (http://www.philippefournier-viger.com/spmf/).

Conflicts of Interest
e authors declare that they have no conflicts of interest.