Discovering Significant Sequential Patterns in Data Stream by an Efficient Two-Phase Procedure

. One essential topic of mining sequential patterns in the data stream is to optimize the time-space computations. However, more importantly, it should pay more attention to the signifcance of mining results as a large portion of them just response to the user-defned constraints purely by accident and they may have no statistical signifcance. In this paper, we propose FSSPDS, an efcient two-phase algorithm to discover the signifcant sequential patterns (SSPs) in the data stream with typical sliding windows, which has never been considered in existing problems. First, for generating SSPs candidates with high-quality, FSSPDS takes testable support and pattern length constraints into account and insignifcant patterns were removed timely by a pattern-growth method. In the second phase, appropriate permutation testing is used to test the signifcance of the SSPs candidates. Exact permutation p values are obtained in a novel combination way based on unconditional Barnard’s test statistic which better refects the process of data generations and collections. Experimental evaluations show that FSSPDS allows the discovery of SSPs in the data stream and rivals the state-of-the-art approaches efciently under the control of family-wise error rate (FWER), especially for time efciency, which was approximately an order of magnitude higher.


Introduction
Mining sequential patterns [1] in the transaction data stream is to output patterns that satisfy the user-defned constraints, such as support (the number of patterns that appear in the transactions), utility, and length. Lots of efcient algorithms are proposed to deal with this kind of problems, including CM-Spam [2], HUSP-ULL [3], and NegPSpan [4], but when there is transactions with the feature of labels, many of the results by traditional mining algorithms may lack statistical signifcance, i.e., they are appeared just by chance, and we are going to draw patterns which are statistically signifcant to one of the labels.
Statistically signifcant pattern mining (SSPM [5]) algorithms are used to solve such problems. It can be used in many applications; for example, in medical treatment, doctors are interested in some sequence of treatments that are statistically signifcant in the process of adverse drug reaction (ADR) signal detection (responsive vs. unresponsive). Often, SSPM is a two-phase method: producing signifcant pattern candidates frst and then test their signifcance. It tests the signifcance of the patterns based on hypothesis testing. If the p value of the testing pattern is less than the test level threshold, it will be fagged as a signifcant pattern. Tere may exist a large number of patterns waiting for testing, so it turns into a multiple hypothesis testing problem under FWER or false discovery rate (FDR [6]).
Sliding window [2] is a typical data generation mode of the data stream. In this paper, we focus on mining SSPs in the data stream with sliding windows under the control of FWER. Te challenge for achieving this goal is twofold.
One is producing candidates for SSP. Most of the current SSPM methods consider frequent sequential patterns (FSPs) as SSP candidates and many efcient FSPs mining algorithms are used to be candidates producing methods [7]. However, most of the algorithms did not consider the signifcant factors in the process of candidates generation. In fact, some low-support patterns (although they may be FSPs) do not meet the minimal testable support requirements [8,9], and they should be removed in the process of candidates generation so as to reduce the testing number in the second phase, and we are going to consider such requirements in candidates mining phase by introducing pattern lengths and testable support constraints. Te insignifcant pattern would be removed timely by a pattern-growth method.
Te other is testing the candidates. Te data size in a sliding window of the data stream is usually small and such data may not have sample representativeness. Permutation p values would be an efective strategy for calculating approximate or exact p values to test the signifcance of small datasets [10,11]. In [10][11][12], the permutation testing algorithm can be acted as a useful method to get better results compared with traditional methods. Nevertheless, such an approach has two disadvantages when they are used to fnd SSPs in candidates produced in the frst phase. One is existing literature produce p values based on the random swapping strategy [11] and it will take a high computational cost to get the fnal results. Te other is that the p values based on the random sampling strategy [10] may be equivalent to 0, and the approximate p values may result in a bad estimation.
Additionally, in the second phase of pattern testing, Fisher's test statics [7,13] is a frequently used data evaluation statistics, and it can be understood that the data generating process is similar to the observed data sample, and the supports are fxed. However, actually data generations and collections do not show such rules, especially, since the support of the pattern may be changing with time passing. For the example of click patterns drawing in an e-commerce website of members from two districts (regarding two classes), to test the signifcance of a certain click behavior occur more for one class (district), Fisher's test means that the behaviors are collected to a certain amount of members (overall), and the repeated experiment maintains the test pattern's support. However, there may be another way to collect the data in a fxed period of time; by such a method, the frequency of a click behavior occurrence is not fxed and would be always changed in the experiments. In such a scenario, the latter method can better refect the process of data generation and collection. Unconditional test statistic such as Barnard's test statistic [9] calculates test statistic value which does not fx the support of testing pattern, and it can be a more appropriate than the calculation rule of traditional Fisher's test statistic.
In this study, we serve to propose FSSPDS for mining SSPs in a novel way. Our contributions are listed as follows.
We produce SSP candidates with a length control by a pattern-growth method under the testable support requirement of Barnard's unconditional test statistic and insignifcant candidates are removed timely so as to increase the test level and fnd more SSPs.
We introduce the usage of Barnard's unconditional test statistic. For reducing the computational time, an approximate upper bound is proposed to reduce calculation time.
We discover SSPs in the data stream based on exact permutation p values by a new combinatorial calculation approach. By producing p values in a short time, it shows superiorities compared with the state-of-the-art exact permutation p values algorithms.
We run experiments on real-world datasets to prove the efectiveness of PSSPDS. Our method can be considered with higher efciency compared with its counterparts.
Te rest of this article is as follows. Section 2 reviews the related works. Section 3 describes the problem and defnes related terms. Section 4 gives our corresponding algorithm PSSPDS. Teoretical analysis is given in Section 5. Section 6 shows experimental results on real datasets, and Section 7 gives conclusions.

Related Works
Many efcient algorithms have been introduced to deal with the problem of sequential pattern mining with frequency constraints, including Spade [14], PrefxSpan [15], CM-Spam [2], and Lapin [16]. In recent years, specifc constraints-based sequential pattern mining has been paid much attention. Sequential association rule mining [17,18] looks up association rules in transactional data. It does not consider the sequence of items but focuses on the fact that there is an intersection between the front and back itemsets. Episode sequential pattern mining [19] is used to look for patterns in a single sequence, rather than a group of sequences. Periodic sequential pattern mining [20,21] is used to fnd patterns that occur frequently and periodically in long sequences. Subgraph mining [22,23] is another feld of sequential pattern mining, which aims to discover all frequent subgraphs in graph databases, the corresponding algorithms based on diferent data structures (such as liststructure [24], pattern-tree [25], and optimization algorithm [19]) are proposed to solve the related pattern mining problem from sequences database, and all these pattern mining approaches are based on the sequential database and the constraints threshold (selected by the user). Recently, Wang gives Miner-K [25] algorithm to mine the patterns with length constraints, and Nader proposes NEclat-Closed [24] to mine the closed pattern based on a vertical structure. Tey can obtain the related pattern results in a short time.
When the transaction with label feature, the results returned by the above algorithms may lack signifcance and some patterns are not statistically meaningful. SSPM algorithms look for signifcant patterns and have been widely used in e-commerce searching [26], essential protein recognition [27], and community detection [28,29]. Hämäläinen [30] proposes the SSPM model frst and regards signifcant pattern mining as a multiple-hypothesis testing problem. Webb [31] controls the error rate by introducing FWER and FDR in signifcant pattern discovery. Bonferroni's control [32] is a traditional correction method under the control of FWER.
Te test statistic is an important part of SSPM. Fisher's conditional test statistic is a popular approach to measure the signifcance [7,10,31], and LAMP [8] strategy is used to reduce the calculation time as it puts forward testable support requirements to the testing patterns based on Fisher's test statistic. Barnard's test statistic [9,14] is known to be another efective method for calculating p value. Leonardo et al. [9] propose a novel structure UT for evaluating the signifcance of a pattern and it gives the testable support requirements based on Barnard's test statistic. Jiang et al. [33] put forward an unconditional test to get the p values of two diferent distributions, and it gives the conclusion that the p value produced by Barnard's test statistic has less risk but usually the computation is expensive when the test data sample is large.
Te permutation-based method could be acted as an excellent technology to mine signifcant patterns on small data samples. He et al. [10] use a permutation test by returning all exact p values of the patterns. Llinares-L_opez and Sugiyama [34] propose a permutation test for the process of mining signifcant sequential patterns, and Pellegrina and Vandin [35] apply a permutation test to mine the top-k signifcant sequential patterns in the database. Recently, Tonon and Vandin [11] propose the algorithm PROMISE with two strategies: itemsets swapping and random permutations, and it can be known as a state-of-the-art method that draws signifcant sequential patterns in the transactional database under the control of FWER.
Tere are also some other methods for studying signifcant patterns. Riondato et al. [36] propose a signifcant pattern mining algorithm based on progressive sampling pattern testing. Tien et al. [37] apply the mining results of the signifcant pattern test to utility dataset analysis. Zihayat et al. [38] extract signifcant patterns in gene sequences. Fournier-Viger et al. [39] output signifcant subgraphs in large graphs. Cheng et al. [40] propose the algorithm LTC to look for signifcant patterns in the data stream which are not only frequent but also persistent.
To the best of our knowledge, no SSPM studies have hitherto considered the signifcant factors in candidates mining and focused on producing permutation p values based on unconditional test statistics for mining statistically signifcant sequential patterns in the data stream. Our present paper demonstrates the feasibility and the advantages of our efcient two-phase algorithm.

Signifcant Sequential Patterns.
A transaction data stream DS (data stream) can be known as DS � {t 1 , t 2 , . . . , t n }, where t i is the ith transaction. I � {x 1 , x 2 , . . ., x m } be a set of literals. Each transaction is assigned to the label G 0 or G 1 . A sliding window W is defned as drawing transactions from ith to jth arrival of transactions with a pregiven sliding length. We are going to mine SSPs based on the following defnitions.
Defnition 1. Pattern X is fagged as a candidate of SSP if its support is higher than or equal to MinS.
X could be a sequence of items in I, the number of X occurrences in W is known as the support S(X), and given a support threshold λ (0 < λ < 1), MinS � λ|W|, and if S(X) ≥ MinS, it is said to be a candidate of SSP. Defnition 2. FSP X is fagged as SSP if its p value is less than a test level threshold.
Hypothesis testing is used to highlight the signifcance of the pattern in a sliding window. π(X, G i ) is the probability that X with label G i (i ∈ 0, 1 { }). H 0 : π(X, G 0 ) � π(X, G 1 ) is considered as the null hypothesis; our goal is to assess the signifcance of X based on the observed contingency table [15] for evaluating whether it supports H 0 . If the p value of X is known, H 0 will be rejected if P X < α, where α is the signifcant level, and then X is considered as a signifcant pattern. P X is the p value of X based on the observed data sample.

Permutation Testing and Test Statistic Selection.
According to excellent performance on small data samples, we are going to produce p values in Defnition 2 by permutation testing. Permutation testing judges whether the observed patterns are signifcant through the distribution of the patterns. Te general process of the testing signifcance of pattern X can be shown in Figure 1.
Fisher's test statistic is a frequently used statistic value in the traditional permutation testing [15][16][17]. Its calculation process is based on the 2 × 2 contingency table which is known in Table 1.
S 1 (X) and S 0 (X) are the supports of X belonging to G 1 and G 0 . n 1 and n 0 are the total numbers of rows with each label, and n is the whole transaction number. Given the support of X, P F (S 1 (X)) is calculated as follows: Te fnal test statistic value for X was established as follows: Barnard's test statistic is another test statistic as previously mentioned and diferent from Fisher's test statistic, it does not fx the row or column value of the contingency table which will better refect the data generations and collections, and for X, the nuisance parameter π is the assumed value based on the hypothesis H 0 . Te following is defned: where n 0 and n 1 are fxed and S(X) acted as a random variable according to the support of the pattern. Given the value of π, defne the function as follows: P S (S(X), π) � P(x,y,π)<�P S(X),S 1 (X),π ( ) P s (S(X), π) is the sum of the test statistic of observing a contingency table for X that is as or more extreme than the Mathematical Problems in Engineering 3 observed one if H 0 is true. Te nuisance parameter π is in the range of (0, 1), and the fnal test statistic value of X is known as follows: Tis maximum is calculated over all possible values of the nuisance parameter. Its test statistic value is usually less than which is produced by Fisher's test statistic.
For a given test level α, a testable support requirement by Barnard's test statistic [24] is given by the following: Based on formula (6), for a testing pattern X, if it is to be a signifcant pattern, then its support must meet S(X) > x s .
In step 2 of Figure 1, generating permuted datasets will take high computational cost; in most practical cases, the permutation number is constrained as a fxed value for reducing the running time, but p value produced by such strategy is an approximation of the exact distribution which may lead to a bad estimation.

The Method
We are going to mine SSPs in a novel way under the framework of FWER. In the mining process of SSPs candidates, we introduce pattern lengths and the testable support constraints and insignifcant patterns are removed timely. In the testing process, a close upper bound of Barnard's test statistic is proposed to reduce calculation time, and permutation testing with a combination strategy is introduced to get exact p values, and the proposed algorithm is verifed to have a signifcant improvement compared with state-of-the-art methods.

Mining SSP Candidates.
Longer patterns tend to have low support and are more likely to be insignifcant by testable support requirement. In the process of SSP candidates mining, we introduce a user-specifed length to reduce the search computations; according to the excellent mining performance by tree structure with length constraint in [37,38], we establish a pattern tree and mine all SSP candidates from the tree under pattern length and testable support control; the mining process mainly consists of two phases, as shown in Algorithm 1.
To draw SSPs candidates more efciently, two pruning strategies are proposed to optimize the mining process, and they are used in the process of CreateTree and SSPs_Candidates in Algorithm 1, respectively.
Proof. Let Xe be a super set of X, then S(Xe) ≤ S(X); if S(X) ≤ MinS, then X will not be a candidate based on Defnition 1, so Xe must not be a SSP candidate. (6), supersets of X are not SSP candidates.

Theorem 2. If S(X) < x s in formula
Proof. Let Xe be a super set of X. Te same as Teorem 1, S(Xe) ≤ S(X). When S(X) < x s , then S(Xe) < x s ;based on formula (6), Xe will not be an SSP candidate.
Based on the efcient two pruning strategies, we use the efcient tree structure by a pattern-growth method to produce the candidates. We take the data in Figure 2(a) as an example and set λ � 0.5 and k � 3 as follows: (1) Calculate the testable support value x s � 3 based on formulas (4) and (6). Remove the unpromising items whose support is less than 3. Terefore, delete "G." (2) Te header table consists of two parts which are the support and the link pointer. By one scan, create the header table H, add T 1 to the tree, and there are two types of nodes as shown in Figure 2(b); one is an ordinary node, such as node "A" and "C." Another node is a leaf node such as "F.1," which means "F" is the leaf node and the support of the sequential path is 1.  Calculate p-value of X. The p-value is calculated based on the different number with same test statistic values in the distribution.
transaction items already exist in the tree, you only need to add the corresponding support. Te result after adding all transactions to the tree is shown in Figure 2(e). In order to describe the sequence more clearly, the pointer link is hidden in Figure 2(e).
Te creation algorithm is shown in Algorithm 2. We frst calculate the testable support value, remove these items whose support values or support are not satisfed from the header Te above process efectively constructs the global tree and maintains the data into a tree and a header table, and algorithm SSPs_Candidates uses pattern-growth method to mine all SSPs candidates. Te algorithm processes the items in the header table from a bottom-to-top sequence. Based on the fnal tree and head table in Figure 2(e), here we demonstrate the example by mining SSPs candidates with item "F" and "E" as the tail nodes. Te process could be known in Figure 3.
Te support of item "F" is bigger than 3, and we can create a subtree and subheader table for base-item ("F"). According to the node pointer, analyze the paths with "F." Te path <"A" and "C"> is obtained from <"A," "C," and "F"> with leaf node "F," <"A," "C," "D," "B," and "E"> and <"C" and "E"> are obtained from the leaf node "C" and "B." Te subtree and subheader table of item ("F") are shown in Figure 3(a). Delete the items whose support does not meet the support requirements; thus, only item "C" is left. Ten, the subheader table and subtree with base <"C" and "F"> are established, as shown in Figure 3(b). However, since the support of the remaining item "A" is less than 3, the program is interrupted. Continue to search the candidates of item "E" by the same operation steps, the subtree and subheader table with base "E" are shown in Figure 3(c), and the tree with base <"B" and "E"> is shown in Figure 3(d). It can be seen that patterns whose last item is "E" are also removed as the supports that are not satisfed and continue to look for candidates with the next item until all items in header table H are processed.
Te specifc process of SSPs_Candidates is shown in Algorithm 3. For the current processing item, if the length and support constraints are satisfed, it will be added to the base item (Lines 1-5). If the length is not satisfed but the support is reached, then a subheader table and a subtree will be established and continue to pursuit the candidates by a recursion (Lines 6-9). When the current item is completed and processed, remove it and go to look for candidates of the next item in H (Line 10). Line 12 returns the fnal candidates.   (5), one has to overcome the high computation to calculate the sum of the value which has equal or lower probability than being observed, and we look for a close upper bound of Barnard's test statistic.
Proof. To formula (3), when nuisance parameter π is unique variable and other variables are fxed, based on [24], P(S(X), S 1 (X), π) is a function of π (0 < π < 1) and the derivative of π could be calculated as follows:

Mathematical Problems in Engineering
It could be known that when π � S(X)/n, the value of the frst derivative is 0. In this circumstance, the second derivative could be calculated as follows:  (11) Insert the sequential itemset S of T d (12) For each item X in S (13) Update H.X.support (14) Add the links (15) Header Table  item support  Header  Mathematical Problems in Engineering and the second order derivative is always less than 0, and Lemma 1 holds. For any nuisance parameter π, to formula (3), P(S(X), is at q 1 � n 1 + 1 π � Q 1 π, and to B(n 0, S 0 (X), π) is at q 0 � n 0 + 1 π � Q 0 π, assuming S 1 (X) ≤ q 1 and S 0 (X) ≤ q 0 (the results for other situations are analogous), we can know that the p value of data points in the range of (S 1 (X), q 1 ) and (S 0 (X), q 0 ) is not less than the observed. Defning D(q 1 , q 0 ) as a minimal product of the data point number of each range, we can calculate the value of D(q 1 , q 0 ) by Lemma 2. Header

Permutation-Based p
Values. Similar to [10], suppose there are N patterns (including X) in the observed data window waiting for test, the p value of X is calculated as follows: where D p is the number of permutation datasets, P B is the test statistic value according to the tested pattern X on observed dataset, and P i are test statistic values according to N patterns with permutated datasets; they are all calculated by Lemma 3. Each permutation dataset matches a contingency table. So, looking for p values is transformed to look for contingency tables, the contingency table is decided by n 1 and S 1 (X), and the same contingency table produces the same p values; if permutation datasets are produced randomly, it will be very time-consuming. In fact, there are many permutation datasets with the same contingency tables, and we are going to produce permutation p values based on a combination strategy for improving the calculation efciency by following two steps.
Step 1. (permutation datasets generation). Unlike [10], our permutation strategy produces a permuted dataset that does not change the pattern support and the length of each transaction. Tis strategy ensures that besides the label number, everything else is fxed. Suppose a sliding window W consists of 6 transactions in Figure 2(a), random two permutation datasets using this approach are shown in Figures 4(b) and 4(c). Te transactions have the same order as them in W, and the support of the pattern is not changed; instead, the sequential patterns with the certain label are changed: S 0 (<A and C>) is 2 in Figure 2(a) but it is 1 in the Figure 4(a). So, we will take n 1 in Table 1 as a variable quantity based on the calculation of Barnard's test statistic which is more in line with the process of data generations and collections.
Step 2. (datasets combination with the same test statistic). Literature [10] proposed the algorithm by producing permutation p values based on a combination strategy. We follow the strategy based on Barnard's test statistic. Diferent from [10], in our setting, n 1 is variable and it could be changed in (0, n). For a testing pattern X, set L � min(S(X), n 1 ) and U � max(0, (n 1 − (n − S(X)))). S 1 (X) is in the range of (L, U); selecting x transactions from S(X) transactions (contain X) and S(X) − x transactions from n − S(X) transactions (not contain X) could be known as follows: Based on the contingency table, x takes the value in (L, U). Tus, according to the diferent value of n 1 , the total number of the contingency tables is calculated as follows: Based on formulas (12) and (13), the p values of SSP candidates could be known in Algorithm 4. For each testing pattern (Line 2), the number of permutation datasets could be calculated, and the combination process could be known in Lines 3-13. Te original test statistic value for each pattern in the observed data window could be calculated by Line 15, and to each test statistic value in list p by combination, it fnds the number that is less than the observed one (Lines [16][17][18][19][20], and Lines 21-22 calculate each p value of testing pattern in the candidates and add it to the fnal results. Algorithm 4 can be considered an efcient algorithm to optimize the calculation of p values.

FSSPDS.
We are going to draw signifcant sequential patterns in the data stream under the framework of FWER based on the p values by Algorithm 4. FSSPDS could be designed for m windows. W 1 is the frst window that can reach W m after m − 1 sliding. To data stream with m windows, the process of FSSPDS for returning SSPs in the data Mathematical Problems in Engineering stream under the control of FWER could be known in Algorithm 5. For each sliding window, SP i in Line 3 returns SSP candidates by algorithm 1, and R i in Line 4 outputs the p values by Algorithm 4. If its p value does not exceed the corrected test level, the pattern is added to the result set (Lines 5-12). Line 14 returns the fnal results.

Complexity Analysis
First, we study the computational complexity of producing SSP candidates. By existing algorithms in [1,2], SSP candidates are obtained in time O(|DS| 2 ·L·M 2 ). DS is the dataset, L is the distinguish items in DS, and M is the maximum sequential length in the dataset. By introducing the length constraint k, the complexity is performed in time O(|D| 2 ·L·M·k). It is known that k is not bigger than M, and often it is far smaller than H. Te complexity of FSSPDS achieves a better result than traditional candidates mining methods. Of course, it is important to note that some efective SSP candidates with long lengths may be removed, but actually, we can know that there are few candidates which are lost when we choose a relatively bigger length constraint value in the experiments. Instead, in most cases, the mining algorithm with length constraint accelerates the completion of candidates discovering tasks and fnding more SSP patterns.
Second, we study the complexity of test statistics with permutation p values. According to Lemma 3, the test statistic could be obtained in O(1); such an upper bound is   L � min(S(X), n 1 ) (5) U � max(0, (n 1 − (n − S(X)))) (6) For s � L to U do (7) num � r(s, n 1 ) (8) p � P B (X)//by Lemma 3 (9) total+ � num (10) P.add (<X, p, and num>) (11) End for (12) End for (13) End for (14) For each FSSP candidate X do (15) px � P B (X) (16) For each item Pitem in P do (17) If Pitem.p < px then (18) number+ � Pitem.num (19) End If (20) End for (21)  efcient-to-compute. In fact, in [24], a close upper bound of Barnard's test statistic is established as P B (X) ≤ P(X(S), X 1 (S), S(X)/n) ((n 1 + 1) (n 0 + 1)) which can be used as an efective value to assess the test static value of the signifcant patterns. Our new upper bound which is proved in Lemma 3 is less than this efective value. It is more close to the exact test statistic value. Algorithm 4 produced permutation p values based on combination strategy; if permutation p values in a sliding window W are produced randomly and there are J FSSP candidates waiting for testing, there should be 2 |W| permutation datasets, and the complexity of such calculation should be acted as O(J·2 |W| ). However, by our efcient combination strategy, the complexity in Algorithm 4 is O(J·| W|·(U − L) ≤ J·|W| 2 ). It can be proved that 2 |W| > |W| 2 by simple mathematical induction when |W| ≥ 4. Tus, we can know that the running time by our permutation strategy could be greatly reduced and no signifcant pattern is lost.

Environment and Dataset.
Te code used for the evaluation has been developed in Python, and the platform of the experiment is confgured as follows: Windows 10 system, 2G Memory, and intel (R) Core(TM) i3-2310 CPU @2.10 GHz. We evaluate the performance of FSSPDS on six datasets, mushrooms, a2a are from libSVM [41]. T10I4D100K, bms-web2, retail, and bms-pos are obtained from SPMF [42], and they are labeled by [32]. Tey all have two classes. Table 2 shows the details of datasets, where |I| is the size of the alphabet, avgLength is the average length of the transactions, and |D| is the number of sequences. Te window size is initialized with 100 transactions and the length k � l × MaxLength, and MaxLength is the maximal transaction length and l is initialized as 0.85.  Figure 5, FSSPDS achieves the best time performance of the four algorithms. Miner-K takes less time than TSPIN and CM-Spam. FSSPDS spends signifcantly less time than the other three methods by benefting from removing the insignifcant patterns earlier. By the two requirements of testable support and length constraints, FSSPDS can always achieve better efciency under diferent support thresholds. From Figure 6, we also come to the conclusion that FSSPDS can complete all tasks with less memory consumption. Table 3 shows the number of candidates produced by the four algorithms. TSPIN and CM-Spam return exact number without length constraint and K-miner returns number with length constraints which is less than TSPIN and CM-Spam. Te number returned by FSSPDS is the least of the four algorithms. For example, on Mushrooms when the threshold is 0.6, FSSP produces 616 candidates (the number of real SSPs is 535), but K-miner returns 786 candidates and the number by TSPIN and CM-Spam is 1332. Te insignifcant patterns are removed timely in the mining process and the testing number could be reduced. Overall, FSSPDS has obvious advantages in producing candidates with less number and running time, and FSSPDS can return SSPs smoothly.

Signifcance Evaluation.
For evaluating the signifcance of FSSPDS, it is compared with four algorithms. Te frst is the recent advanced permutation p values producing method in [11] with Fisher's test statistic which is denoted as FSSPPROM; the second version that we denote FSSPEPAR is the one based on Fisher's test statistic with combined permutation p values strategy in [10]; the third, we call it PROMBD, is the one that uses upper bound of Barnard's test statistic by SPuManTE [9] with permutation p values producing method in [11]; and the last one FSSPDS * Input: data stream W, test threshold, minimum support threshold λ, and length k Output: SSPs with FWER ≤ α.
(1) SSP � (), calculate testable support x s by α (2) For each sliding window W i do (3) SP i � SSPsCandidates (W i , λ, x s , and k) (4) R i � p values (λ, α, W i ) (5) For each pattern X in SP i do (6) Add X to SSP (7) End For (8) For each pattern X in SSP do (9) If R i .X. p value > α/|SSP| then (10) Remove X from SSP (11) End If (12) End For (13) End For (14) Return SSP ALGORITHM 5: FSSPDS.   focusses on SSPs mining the same as FSSPDS but without the testable support and length constraints. Figures 7(a)-7(f ) show the running time on six datasets with diferent support thresholds. It shows that FSSPDS spent the shortest time on each dataset, benefting from reducing search space by length and testable support constraints, and mining SSP candidates by FSSPDS can save a lot of time. Also, by using our efective combination strategy for permutation testing, the testing time is greatly reduced. For example, on a2a with a support threshold 0.3, FSSPDS spent 52.03125 seconds, while FSSPPROM, FSSPDS * , FSSPBD, and PSSPEPAR spent 667.53125 s, 465.4 s, 476.5 s, and 500 s, respectively. With threshold increases, the number of the SSPs decreases, so the running time became less with a bigger support threshold. However, from Figure 7, the running time of FSSPDS always have computational advantages. It can also show that the running time of FSSPDS * is less than other algorithms except for FSSPDS; this could confrm that by our efective candidates mining and testing process, the running time can be reduced even without considering length constraints. Additionally, from Figure 7, the time consumption of FSSPDS is relatively stable compared with other algorithms according to the increase of support thresholds. Algorithm FSSPDS reduces the number of items by using Teorems 1 and 2, and also, FSSPDS combines the same p values for calculation and thus the storage space could be reduced greatly. Benefting from the pruning and combination strategies, it is known that the memory consumption of FSSPDS * is less than other algorithms except for FSSPDS. From Figure 8, FSSPDS is relatively stable compared with other algorithms and has a certain advantage in memory utilization.
As previously mentioned, the p values produced by these algorithms may be 0 which may result in a bad estimation. Te percent of SSPs whose p values are zeros by the fve algorithms could be seen from Table 4, it could be known that the proportion by FSSPDS is the smallest, and FSSPDS * is performed better than the other algorithms except for FSSPDS. We can get all p values based on formula (7) and FSSPDS achieves a better result than others. On mushrooms, Figure 9(a) shows p values distribution, and Figure 9(b) is the corresponding p values, the variance of p values could be known from Figure 9(c), the distribution by FSSPDS is more concentrated, and there are more patterns worth testing. Figures 10(a)-10(f ) show the pattern number comparison of the fve algorithms on the six datasets. We can observe that the number by FSSPDS is signifcantly bigger   than the other algorithms. For example, on dataset retail when the support threshold is 0.16, FSSPDS produces 237 patterns, but the number by FSSPPROM, FSSPDS * , PROMBD, and FSSPEPAR is 163, 124, 87, and 145, respectively. FSSPDS can always mine more patterns of all algorithms. On most of the datasets, the number returned by FSSPDS * is close to FSSPDS and more than the other algorithms by benefting from the calculation advantage of FSSPDS.
We also analyze the impact of length parameters l on the number of SSPs. Te number of SSPs with diferent length parameters by FSSPDS could be known from Table 5. It could see that when l is more than 0.8, the number returned by FSSPDS is bigger than the number returned by FSSPDS * and Figure 11 shows the ratio of the two returned numbers with diferent length parameters. It is known that when l increases, the value of the ratio is bigger than 1 which means that we can obtain more SSPs by length constraint than without constraint. Such conclusion could be explained that when we make length constraint on the datasets, insignifcant patterns are removed timely in the mining process; thus, more signifcant patterns could be drawn under the calculation rule of FWER.

Scalability Evaluation.
In order to test the scalability of the FSSPDS, we evaluate its efciency according to the diferent window sizes. We assess the performance under varied size. With λ � 0.2, Figure 12(a) shows the running time on mushrooms, Figure 12(b) shows the memory consumption on T10I4D100K, and Figure 12(c) shows the number of tested patterns whose p values are 0 on bms-pos. According to the results of our calculations on all these six datasets, it can be known that FSSPDS can achieve a better result for all fve algorithms in terms of running time, memory consumption, and efective p values. It can mine SSPs smoothly under diferent data sizes. Figure 13 shows the number of SSPs with diferent length parameters and window sizes, Figure 13(a) returns the number of SSPs with λ � 0.16 and l � 0.7 on retail, Figure 13(b) returns the number of SSPs with λ � 0.005 and l � 0.8 on bms-web2, and Figure 13(c) returns the number of

Mathematical Problems in Engineering
SSPs with λ � 0.3 and l � 0.9 on a2a. It can get the same conclusions that pattern numbers by FSSPDS are higher than the other algorithms. FSSPDS is relatively stable and it has little efect on varied window size.

Conclusion and Feature Work
We introduce FSSPDS, an efcient algorithm to mine statistically signifcant sequential patterns in the data stream. Firstly, insignifcant candidates can be removed timely by introducing testable supports and length constraints with a pattern-growth method based on the tree structure. According to better refect the data generation process in the time sliding window, diferent from the traditional Fisher's test statistic, we mine SSPs based on unconditional test statistics with permutation p values under the framework of FWER. To overcome the computation drawbacks of p values production, we proposed a close upper bound of the unconditional test statistic and used a combination strategy to look for efective permutation p values. Te experimental results on real datasets demonstrate the efectiveness of FSSPDS.
FSSPDS is still time-consuming on some datasets, especially on dense datasets where the length of each transaction is related long, and the permutation process spent lots of time. Some useful pruning strategies could be used for accelerating the calculation process. For example, the boundary pruning and static bufering technology in [10] could be used for reducing the computational cost. Te continuous computation technology could be used in formula (8) as r(S 1 (X), n 1 + 1) � r(S 1 (X), n 1 )(n 1 + 1) 2 /(n 1 − s + 1)(n − n 1 ). Additionally, the data size is fxed in our experiments, efciency evaluation on datasets with unfxed size could be studied, and we will investigate these problems in our future work.

Data Availability
Te data used to support the fndings of this study are included within the article.

Conflicts of Interest
Te authors declare that they have no conficts of interest to report regarding the present study.