Finding flexible periodic patterns in a time series database is nontrivial due to irregular occurrence of unimportant events, which makes it intractable or computationally intensive for large datasets. There exist various solutions based on Apriori, projection, tree, and other techniques to mine these patterns. However, the existence of constant size tree structure, i.e., suffix tree, with extra information in memory throughout the mining process, redundant and invalid pattern generation, limited types of mined flexible periodic patterns, and repeated traversal over tree data structure for pattern discovery, results in unacceptable space and time complexity. In order to overcome these issues, we introduce an efficient approach called HOVA-FPPM based on Apriori approach with hashed occurrence vectors to find all types of flexible periodic patterns. We do not rely on complex tree structure rather manage necessary information in a hash table for efficient lookup during the mining process. We measured the performance of our proposed approach and compared the results with the baseline approach, i.e., FPPM. The results show that our approach requires lesser time and space, regardless of the data size or period value.
Data mining allows us to discover useful information from the data that would otherwise be very hard to uncover. The process of data mining involves various tasks including classification, clustering, pattern mining, and several others [
Time series datasets are common these days, having some of the application areas such as economics, social sciences, epidemiology, medicine, and physical sciences, for instance, measuring a person’s heart rate after every minute, readings of air temperature or wind after every hour, stock rates at mid and end of every day, and so on. Mining the patterns from time series data is usually referred to as periodic pattern mining [
The use of suffix tree (prefix trie or trie) data structure is prevalent among state-of-the-art approaches towards flexible periodic pattern mining. Rasheed et al. [
In this paper, we introduce an efficient strategy based on hashed occurrence vectors and Apriori approach, which does not rely on suffix tree data structure [ The contributions of our work are as follows. We propose HOVA-FPPM to simplify flexible periodic pattern mining process through an efficient key-value pair data structure. We maintain necessary information (patterns and their occurrence vectors) and compute efficiently (through basic arithmetic on occurrence vectors, reducing redundant and invalid patterns) to significantly improve time and space complexity of our approach. Our approach requires single pass to discover symbol, partial, and full-cycle periodic patterns at any periodic level. Comprehensive performance analysis on real-world datasets with the baseline algorithm shows the run time effectiveness and space efficiency of our approach while maintaining the same accuracy.
The rest of the paper is organized as follows. We discuss related works followed by preliminaries section. The proposed methodology with detailed explanation of the algorithm is discussed afterwards. Subsequently, we outline the experimental analysis of the proposed methodology in comparison to the baseline approach on real-world datasets. We also provide a detailed discussion at the end. Lastly, we conclude the paper along with its future directions.
In this section, we briefly discuss various approaches in the context of periodicity mining, which is subfield of pattern mining. Pattern mining has other subfields depending on the type of mined patterns such as frequent item sets [
After an extensive literature review of the periodic pattern mining algorithms, we noticed that all of these algorithms can be categorized based on the approaches they used in their algorithmic solutions (refer to the taxonomy of frequent pattern mining in Figure
Taxonomy of research in frequent pattern mining.
The earlier works based on Apriori approaches were not able to mine all types of patterns at once. However, some important properties, monotone and anti-monotone, were introduced later that helped in improving the performance of mining algorithms [
Many researchers were utilizing an important notion called “closed patterns” in the context of periodic pattern mining that has reduced search space towards mining redundant patterns [
In this category, unlike other approaches, researchers used the projection database to mine the patterns with multiple pruning techniques. They introduced the concepts of flexible periodic pattern mining in the domain of closed pattern mining [
There are other approaches in literature for pattern mining that includes dynamic time warping [
It is evident from the above discussion that the existing approaches either employ complex data structure for pattern mining process or generate excessive redundant or invalid patterns. For instance, in Apriori-based approaches, excessive generation of redundant or invalid patterns reduces time and space efficiency. The effect of excessive pattern generation on tree- and projection-based approaches results in complex tree structure, which further increases the memory usage as well as traversal times. The approach with an effective pattern generation strategy and efficient repeated traversal over data has the potential to overcome the aforementioned limitations of existing studies.
In this section, we explain existing terminologies to understand the problem and its solutions.
(periodicity). In a time series
For example, in time series
Daily activity schedule of a working individual at office.
(perfect periodicity). In a time series
Perfect periodicity is denoted by PP and defined as follows:
For example, in time series
(periodic pattern). When a pattern repeatedly occurs for a specific period of time and satisfies the support threshold, we call it periodic pattern.
For example, in time series
(flexible periodic pattern (FPP)). Any periodic pattern that contains “do not care” events is called flexible periodic pattern.
For example, in time series
(occurrence vector). An occurrence vector is a list that represents the index positions of a unique element (event) in the data. Every unique element (event) has one occurrence vector.
For example, in time series
(difference vector). Given a pair of occurrence vectors, the difference vector generates a pattern along with its frequency count.
The objective of a difference vector is to obtain a pattern along with its frequency. It is computed by comparing each value of the first occurrence vector with every value of the second occurrence vector. For a given transaction in Figure
The proposed mining process for one length flexible periodic patterns.
The proposed mining process for two length flexible periodic patterns.
(confidence). Confidence of a pattern is the ratio of actual periodicity and perfect periodicity, as shown in the following equation:
For example, in time series
The goal of HOVA-FPPM is to overcome the limitations of existing approaches, by eliminating the need of complex data structure such as suffix tree, through Apriori-like approach over occurrence vectors maintained in a hash table to mine flexible periodic patterns. It helps us not only to reduce the space requirements but also to avoid temporary patterns with limited calls to periodicity detection module, which eventually reduces the overall time requirements of the mining process. In the following passages, we discuss the components of our strategy and the proposed algorithm for pattern mining.
The key components involved in our proposed approach are discretization, event lookup table construction, periodicity detection, and pattern mining as depicted in Figure
The sequentially dependent components of our proposed strategy.
The discretization process converts each unique element from the data series to a simplified unique element. The sole purpose of the discretization process is to make the process of pattern mining easier, since it is easier to work with a series of data with unique characters than to work with a series of alphanumeric product IDs. For instance, if we have a data series
The events table consists of all unique events and their occurrence vectors. Occurrence vectors are of significant importance because they hold the index position of each unique element in the input data. They are even more important because our proposed algorithm relies heavily on the occurrence vectors. Each unique element in the input data has its own occurrence vector. The values in the occurrence vector are the index positions of that element where it is located in the input data. For example, if we have an input data
Once we have gone through the input data, we should have the occurrence vectors of all the elements in the data. We now calculate the periodicity of each element by passing the element and its occurrence vector to the periodicity detection algorithm. Our periodicity detection algorithm is slightly modified compared with the FPPM’s [
The pattern mining module of our proposed strategy reduces the time required to discover flexible periodic patterns compared with its counterpart. The mining process is based on Apriori approach and involves basic arithmetic operations among occurrence vectors, i.e., difference vector. The occurrence vectors along with the events are managed effectively using key-value pairs with inherent hashing strategy to provide quick lookups for time efficiency. The pattern enumeration step is guided through unique events and polarity of the difference vector values, while frequencies help in deciding whether the pattern is frequent or not given the threshold.
We illustrate the process of our pattern mining algorithm with the help of a toy example. For this illustration, we assume the data
The proposed mining process for three length flexible periodic patterns.
Since we have already passed through the data once, to discretize it, we have the occurrence vectors of all the unique elements in the data. The unique elements are
For two length patterns, we compare the occurrence vector of one length mined pattern with the occurrence vectors of all other unique elements in our events table. It may generate all the patterns involving elements
The difference operation reveals important information related to frequency, position, and occurrence of patterns. In each difference vector, we have positive and negative values. Since we are performing “
If we find 1 positive value 4 times in the difference vectors obtained through the difference operation, then we can conclude that
The above process is repeated for all successive patterns to be mined from data without redundant and unnecessary computations (see Figure
Input: event Output: a list of mined patterns NP = {}, New_Events = slice_even ( for Key in { in { New_Events(Keys) do Difference ← Periodic − Item if continue end Difference > Max_Pattern_Length then if Item > Periodic then if Difference > KeySize then star_count = Difference− KeySize end if star_count > star_limit then end continue end end stars = calculate_stars(star_count) pattern_key = end patterns_with_stars = Periodic[ else pattern_key = Key + E patterns_without_stars = Periodic[ NP ∪ pattern_without_star_count end if Difference > KeySize then star_count = Difference − KeySize else if star_count > star_limit then continue end end end stars = calculate_stars (star_count) pattern_key = Key + stars + E pattern_with_star end =Periodic[ pattern_key = Key + Epattern_without_star = Periodic[ pattern_without_star end for NL_Patterns = calc_next (New_Events, NP, Max_Pattern_Length) NP ∪ update (NL_Patterns) end end ReturnNP
In this section, we discuss the details of our proposed algorithm (see Algorithm Discretize the events while passing through the time series data and count the total occurrences, occurrence vectors, and maximum pattern length for the entire time series. We also make pairs and hash their values in the same pass. Find the differences in occurrence vectors of each unique event. Make two length pairs of the frequent pairs that have surpassed the user given support threshold. Add the unimportant symbol for each pattern based on the difference of occurrences. Find the differences in occurrence vectors of each consecutive event pair. The frequent events with highest difference count values are considered in the successive levels. Add the unimportant symbol for each pattern based on the difference of occurrences. Repeat for the next level of patterns.
In the subsequent paragraphs, we explain the steps our proposed algorithm. In line 1, we initialize an empty list to save the mined patterns, and line 2 generates a list of events that we have to mine. Line 2 calls the slice_events function that takes the hash table and previous keys as an input and slices the previous key hashes along with their occurrence vectors. It helps us in reducing mining time. The repetition construct at lines 4–6 scans entire hash-table and its occurrence vectors to compare the events' occurrences. Line 7 computes the difference of elements, and lines 8 and 9 perform a check on the difference value and maximum pattern length. At this stage, we eliminate any difference that is greater than the maximum allowable skippable unimportant event.
Since we expect different patterns depending the polarity of the difference value, we need separate pattern name making sections for it. We perform the operation
We discuss the algorithmic complexity of both approaches based on their key steps. The baseline approach involves three steps that includes suffix tree construction, periodicity check, and pattern generation towards mining. The authors of FPPM algorithm claim that the pruned suffix tree construction has same time complexity as the suffix tree contraction, i.e.,
On the other hand, the proposed approach consists of events table construction, periodicity detection, and pattern mining. The event table construction scans the input data and generates events with corresponding occurrence vectors, linear in time complexity
In this section, we perform experiments to evaluate the efficiency of our proposed approach against baseline approach, i.e., FPPM. In the experimental analysis, we focus on time and space efficiency of both algorithms (the proposed HOVA-FPPM and baseline) by varying period value and data size. Since we use the same settings/parameters (i.e., confidence 60% and support 50% inspired from baseline approach) during our experiments, obtained results are justified as per our claim in this article to prove the superiority of the proposed approach. However, we understand that different values of these parameters will affect the mining results. For instance, increasing the support threshold will reduce the number of mined patterns. We assume that both algorithms produce the same results (no difference in result accuracy), rather, have different requirements for time and space during execution.
We briefly discuss the datasets used in our experiments and describe the results followed by discussion.
In our experiments, we use two datasets namely diabetes and bike sharing datasets. Both datasets are publicly available on the UCL Machine Learning Repository with the same names. The diabetes dataset contains a total of roughly 8000 records. It contains records of diabetic patients and their health related results. A total of 20 unique codes are used throughout the dataset to calculate the health of the patients. All of these 20 codes represent measurement for the patient. Each of these codes will have a value for each patient along with a date column to represent time of the patient history record. On the other hand, the bike sharing dataset had a total record of 17000 (roughly) for the hours that the bikes were shared and 730 records for the same bikes but on a day level. The specification of the system used for the experiments is an i5-3320M 2.6 GHz processor with 8 GB RAM and Windows 10 operating system. The algorithms were implemented in Python.
We analyze the performance of the proposed algorithm with the baseline approach, i.e., FPPM, by measuring execution time and memory space usage.
We try to understand the performance aspect of the proposed algorithm with the baseline approach by varying the size of the data. If we look at the performance of our proposed algorithm (Figure
FPPM vs. proposed algorithm: time and space results on diabetes dataset of varying size.
Now we look at the data from varying period value perspective (see Figure
FPPM vs. proposed algorithm: time and space results on diabetes dataset of varying period value.
We also performed experiments on the second dataset, i.e., bike sharing dataset, which is comparatively bigger than the first one. It contains 17000 rows of data containing hourly updates of bike sharing information. We performed experiments on this dataset to understand the scalability aspect of both algorithms. Figure
FPPM vs. proposed algorithm: time and space results on bike sharing dataset with varying data size.
The results of space allocated to both algorithms are quite interesting in the case of varying dataset, as shown in Figure
Our discussion revolves around selected related questions. In those questions, we aim to analyze the effects of Apriori approach over flexible periodic pattern generation with varying starting position to identify the factors affecting the generation of invalid or redundant patterns and to understand the relationship between varying dataset length and result’s accuracy.
In order to analyze the effects of Apriori approach, we performed experiments on two datasets with varying properties and compared the results with the FPPM performance, which is a tree-based approach for flexible periodic pattern mining. As per our knowledge, no other algorithm is designed to mine flexible periodic patterns with variable starting position while using an Apriori-based approach. We already discussed that it is possible to mine patterns just from the occurrence vectors of the events without using a tree structure to simplify the process. As per the experiments and analysis of the results, we concluded that Apriori approach is better when (i) the available space for mining patterns is not an issue and (2) there are no strict restrictions on the space usage. An Apriori-based approach keeps the occurrence vectors and patterns in the memory, which can fluctuate depending on the data and unique elements in the data.
There is prominent positive effect of Apriori approach on the time required to mine patterns because it lacks tree traversal process. The existing Apriori-based algorithms, being capable of mining the flexible periodic patterns, were costly in space and time requirements. We overcome the excessive time requirements of Apriori-based approaches by introducing an efficient algorithm to mine the patterns. We achieve this by tweaking our algorithm to take advantage of the pattern generation process, which can generate patterns from multiple passes. Therefore, our proposed algorithm is able to mine both
In order to identify the factors affecting the generation of invalid or redundant patterns, we first analyzed the suffix tree behavior of the FPPM and its effect on pattern generation. We achieve this by running the FPPM on varying data size and varying unique values. The reason for focusing on the suffix tree was the way the FPPM mining process works. FPPM mines patterns level by level by visiting each immediate descendent nodes from root node. We used dataset of varying size and number of unique values to change the size of the suffix tree for analyzing its effect on the performance of the algorithm and redundant pattern generation. Varying the data size increases the depth of the suffix tree, whereas large total number of unique elements increases the breadth of the suffix tree. The breadth of the suffix tree has no significant effect on the redundant pattern generation; however, more redundant patterns were generated with the increased length of the data with more depth of the tree. Since FPPM holds the redundant patterns for each branch, the cost of redundant patterns is a lot higher with more depth of the suffix trees. On the other hand, more unique values did not affect the performance of redundant pattern generation because FPPM keeps patterns in memory for short period of time. One of the possible reasons is that the shallow tree allows FPPM to switch to a newer branch and discard the previous nonfrequent patterns.
It is evident from our experiments that there is a relationship between varying data size and the accuracy of results. The results obtained have accuracy of 33–50% reaching the lower limit on bigger dataset. It proves that the algorithms generate large patterns based on the initially generated patterns. Any mismatched pattern at the beginning leads to an increased reduction of accuracy. After careful observation, it was revealed that the unmatched patterns were produced by both algorithms. Any unmatched pattern at initial phase results in lowering the accuracy at successive phases. However, in very few cases, the patterns generated in the later phases did match. It reveals that in order to improve the accuracy of the proposed algorithm, the patterns generated in the initial phase require more attention. If the initial patterns of both algorithms are similar, then it would increase the accuracy of the mined patterns significantly because both algorithms use the initial patterns to generate the next phase patterns.
We presented an efficient strategy, i.e., HOVA-FPPM, to mine flexible periodic patterns from time series database without using complex data structures. We identified the limitations of tree-based approaches and discovered various factors that cause the redundant or invalid pattern generation during the mining process. We surpassed those issues with the help of hashing-based data structure while minimizing the number of redundant and invalid patterns through manipulation of occurrence vectors. The proposed solution outperformed the baseline algorithm in terms of time needed to mine the same patterns. We empirically justified that Apriori-based approaches are effective for the mining process without excessive pattern generation. For small dataset, our algorithm is space efficient compared with the FPPM. On a larger dataset, the space requirement of our strategy fluctuates due to incremental growth of accumulated data in hash table in contrast to baseline approach. We aim to improve the space complexity of our proposed algorithm on vary large datasets as an extension to this work. The analysis of combining hashing strategy with tree-like structure is another aspect to discover in future.
The datasets used in this research could be downloaded freely at UCL Machine Learning Repository (diabetes:
The authors declare that there are no conflicts of interest regarding the publication of this paper.
MFJ was solely responsible for the data curation, software development, and initial report. WN handled the original draft preparation, project administration, and funding acquisition. MFJ and KUK were responsible for the conceptualization of the idea, methodology, and formal analysis. WN and KUK reviewed and edited the manuscript.
This study was partially supported by Deanship of Research at Islamic University of Madinah (IUM), Saudi Arabia (Tamayuz-1 program of academic year 1439–1440 AH; research project number: 24/40).