High-utility sequential pattern mining (HUSPM) is an emerging topic in data mining, where utility is used to measure the importance or weight of a sequence. However, the underlying informative knowledge of hierarchical relation between different items is ignored in HUSPM, which makes HUSPM unable to extract more interesting patterns. In this paper, we incorporate the hierarchical relation of items into HUSPM and propose a two-phase algorithm MHUH, the first algorithm for high-utility hierarchical sequential pattern mining (HUHSPM). In the first phase named Extension, we use the existing algorithm FHUSpan which we proposed earlier to efficiently mine the general high-utility sequences (

Sequential pattern mining (SPM) [

Firstly, frequency does not fully reveal the importance (i.e., interest) in many situations [

Secondly, in sequential pattern mining, the hierarchical relation (e.g., product relation and semantic relation) between different items is ignored, so some underlying knowledge may be missed. In general, the individual items of the input sequences are naturally arranged in a hierarchy [

An example of a taxonomy of biology.

However, to the best of our knowledge, there is no related work taking consideration of both two limitations. In this paper, given a quantitative sequence database

To address the above issues, we propose a new algorithm called MHUH (mining high-utility hierarchical sequential patterns) to mine high-utility hierarchical sequential patterns (to be defined later) by taking several strategies. The major contributions of this paper are as follows.

Firstly, we introduce the concepts of hierarchical relation into high-utility sequential pattern mining and formulate the problem of high-utility hierarchical sequential pattern mining (HUHSPM). Especially, important concepts and components of HUHSPM are defined.

Secondly, we propose a two-phase algorithm named MHUH (mining high-utility hierarchical sequential patterns), the first algorithm for high-utility hierarchical sequential pattern mining. So that the underlying informative knowledge of hierarchical relation between different items will not be missed and to improve efficiency of extracting HUHPs, several strategies (i.e., FGS, PBS, and Reduction) and a novel upper bounder TSWU are proposed.

Thirdly, substantial experiments were conducted on both real and synthetic datasets to assess the performance of the two-phase algorithm MHUH in terms of runtime, number of patterns, and scalability. In particular, the experimental results demonstrate that MHUH can extract more interesting patterns with underlying informative knowledge efficiently in HUHSPM.

The rest of this paper is organized as follows. Related work is briefly reviewed in Section

In this section, related work is discussed. The section briefly reviews (1) the main approaches for sequential pattern mining, (2) the previous work of high-utility sequential pattern mining, and (3) state-of-the-art algorithms for sequential pattern mining with hierarchical relation.

Agrawal et al. [

It is known that database scans will be time-consuming when discovering sequential patterns. For this reason, a set of pattern growth sequential pattern mining algorithms that are able to avoid recursively scanning the input data were proposed. For example, Han et al. [

There are some drawbacks to pattern growth sequential pattern mining algorithms. Obviously, it is time-consuming to build projected databases. Consequently, some algorithms with early pruning strategies were developed to improve efficiency. Chiu et al. [

To address the problem that frequency does not fully reveal the importance in many situations, utility-oriented pattern mining frameworks, for example, high-utility itemset mining (HUIM), have been proposed and extensively studied [

The two-phase algorithms mentioned above have two important limitations, especially for low

Yin et al. then enriched the related definitions and concepts of high-utility sequential pattern mining. Two algorithms, USpan [

Sequential pattern mining with hierarchical relation can be traced back to article [

Plantevit et al. [

There are also several hierarchical frequent itemset mining algorithms, which are more or less similar to sequential pattern mining with hierarchical relations. For example, Kiran et al. [

Let

A

An example of a hierarchical utility sequence database.

Quantitative-sequence database DE

Taxonomies

External utility table

The hierarchical relation of different items is represented in the form of taxonomy which is a tree consisting of items in different abstraction levels. We assume that each item is only associated with one taxonomy. Figure

Given two itemsets

Each item

The utility of a

Given an itemset

Given a sequence

Obviously,

Given a minimum utility

Given a minimum utility

In this section, we present the proposed algorithm MHUH for HUHSPM. We incorporate the hierarchical relation of items into high-utility sequential pattern mining, which makes MHUH able to find the underlying informative knowledge of hierarchical relation between different items ignored in high-utility sequential pattern mining. In other words, MHUH can extract more interesting patterns. The mining process of MHUH mainly includes two phases named Extension and Replacement. MHUH finds high-utility sequences by the existing algorithm FHUSpan (also named HUS-UT) which we proposed earlier based on the prefix-extension approach in the first phase. For a

Without the loss of generality, in this section, we formalize the theorems under the context of a minimum utility

Before mining the sequential patterns, MHUH adopts the Reduction strategy in the data preprocessing procedure, which removes useless items to reduce search space in advance. It mainly consists of two points, removing the unpromising items from the

An item is unpromising if any sequence containing this item is not high-utility. Here, we propose a novel upper bound TSWU (Taxonomy Sequence-Weighted Utility) based on SWU [

Given an item

For example, in Figure

Given a

Let

For any sequence

From Theorem

For a given

We say that an item is redundant if it (1) appears in taxonomy but does not appear in

In the first phase named Extension, we use the existing algorithm FHUSpan [

In fact, no

Given two sequences

We first prove that

Given a

Theorem

All items contained in g-sequence are root items.

Given a

We then introduce how to find

Here, we briefly introduce the mining process of FHUSpan, which finds high-utility sequences based on the prefix-extension approach. It first finds all appropriate items (only the sequence starting with these items may be high-utility). Then, for each appropriate item, it constructs a sequence containing only this item and extends the sequence recursively until all sequences starting with the item are checked. In particular, two extension approaches are used,

In the second phase named Replacement, we mine the special high-utility sequences with the hierarchical relation (

For a

Algorithm

Search for specific.

1:

2:

3: return

4:

5:

6:

7:

8: ^{’}

9: SearchForSpecific(

10:

11: SearchForSpecific(

12:

13:

We also use a strategy, PBS (Pruning Before Search), to reduce search space before Algorithm

We illustrate this strategy through an example under the context of Figure

Reduce the size of

In the above example, the max count of sequences that are more specific than

In the rest of this subsection, we prove the conclusion left before. We first prove that removing redundant items has no effect on correctness.

For a sequence

Then, we prove the conclusion the correctness of the algorithm which finds

Firstly, the PBS strategy does not ignore the underlying

We performed experiments to evaluate the proposed MHUH algorithm which was implemented in Java. All experiments were carried out on a computer with Intel Core i7 CPU of 3.2 GHz, 8 GB memory, and Windows 10.

Five datasets, including three real datasets and two synthetic datasets, were used in the experiments. DS1 is the conversion of Bible where each word is an item. DS2 is the conversion of the classic novel called Leviathan. DS3 is a click-stream dataset called BMSWebView2. The three datasets can be obtained from the SPMF website [

Characteristic of datasets.

Dataset | Type | ||||
---|---|---|---|---|---|

DS1 | 36,369 | 13,905 | 100 | 21.64 | Real (text) |

DS2 | 5,834 | 9,162 | 100 | 33.81 | Real (text) |

DS3 | 77,512 | 6,120 | 161 | 4.62 | Real (click stream) |

DS4 | 10,000 | 4,000 | 40 | 20.54 | Synthetic |

DS5 | 60,000 | 5,000 | 20 | 10.50 | Synthetic |

Note that these datasets do not contain taxonomies. So, for each dataset, we generated taxonomies based on the items it contains. The max depth and degree of these taxonomies are 3, which indicates that the max number of leaf items contained in taxonomy is 27. The datasets and source code will be released at the author’s Github after the acceptance for publication.

We evaluated the performance of the proposed algorithm on different datasets when varying

The execution times of MHUH and MHUH_base on DS1 to DS3 are shown in Figure

Execution time on three datasets when varying

Figure

Distribution of discovered patterns on three datasets when varying

We conducted this experiment to evaluate the utility difference between the patterns discovered by MHUH and that discovered by the existing algorithm FHUSpan [

Figure

Sum utility of top # patterns discovered from three datasets.

Average utility per length of top # patterns.

We conducted experiments to evaluate MHUH’s performance on large-scale datasets. For each dataset, we increased its data size through duplication and performed the MHUH algorithm with different

Scalability test on two datasets.

In this paper, we incorporate the hierarchical relation of items into high-utility sequential pattern mining and propose a two-phase algorithm MHUH, the first algorithm for high-utility hierarchical sequential pattern mining (HUHSPM). In the first phase named Extension, we use the existing algorithm FHUSpan which we proposed earlier to efficiently mine the general high-utility sequences (

In the future, we will generalize the proposed algorithm based on the more complete concepts. Besides, several extensions of the proposed MHUH algorithm can be considered such as improving the efficiency of the MHUH algorithm based on better pruning strategies, efficient data structures [

The data used to support the findings of this study are available from the corresponding author upon request.

The authors have declared that no conflict of interest exists.

This work was supported by the Natural Science Foundation of Guangdong Province, China (Grant No. 2020A1515010970) and Shenzhen Research Council (Grant No. GJHZ20180928155209705).