Time Series Classification by Shapelet Dictionary Learning with SVM-Based Ensemble Classifier

Time series classification is a basic and important approach for time series data mining. Nowadays, more researchers pay attention to the shape similarity method including Shapelet-based algorithms because it can extract discriminative subsequences from time series. However, most Shapelet-based algorithms discover Shapelets by searching candidate subsequences in training datasets, which brings two drawbacks: high computational burden and poor generalization ability. To overcome these drawbacks, this paper proposes a novel algorithm named Shapelet Dictionary Learning with SVM-based Ensemble Classifier (SDL-SEC). SDL-SEC modifies the Shapelet algorithm from two aspects: Shapelet discovery method and classifier. Firstly, a Shapelet Dictionary Learning (SDL) is proposed as a novel Shapelet discovery method to generate Shapelets instead of searching them. In this way, SDL owns the advantages of lower computational cost and higher generalization ability. )en, an SVM-based Ensemble Classifier (SEC) is developed as a novel ensemble classifier and adapted to the SDL algorithm. Different from the classic SVM that needs precise parameters tuning and appropriate features selection, SEC can avoid overfitting caused by a large number of features and parameters. Compared with the baselines on 45 datasets, the proposed SDL-SEC algorithm achieves a competitive classification accuracy with lower computational cost.


Introduction
Time series classification (TSC) is a theoretical abstraction of many engineering problems, such as fault diagnosis, speech recognition, and electroencephalogram (EEG) identification. It has become an active research field in recent years [1]. To address the challenge of TSC problems, several methods have been proposed in the literature. An intuitive way to model time series by comparing the differences in time domain is called time domain similarity method. Time domain similarity methods process time series as highdimensional points. A basic framework of time domain similarity based TSC method is utilizing different distance measurements to quantify similarity between time series and then combining 1-NN algorithm as the classifier. Several L1 or L2 norm distance measurements have been widely researched in [2]. To solve time axis distortion problem in time series, elastic distance measures are employed in TSC. Dynamic time warping (DTW) comes to be a popular elastic distance measurement. Several variants of DTW are compared in [2]. At the same time, some time series discretization techniques are proposed, such as symbolic aggregate approximation (SAX) [3], to reduce time series dimension and improve computational efficiency. Fuzzy similarity (FS) [4] is adapted into the characterizing defects problem, which is capable of processing time series signals affected by uncertainty and inaccuracy. In literatures, time domain similarity methods are proved intuitive and efficient but not suitable for complex dynamic systems. e model similarity method is another well-known way to deal with TSC problems. ese methods represent time series by using multiple statistical models, such as Auto Regressive Moving Average (ARMA) [5], Hidden Markov Model (HMM) [6], and Gaussian Mixture Models (GMM) [7]. By comparing the parameters of the models, different class time series can be distinguished. In recent years, Deep Learning has become a common approach in machine learning and artificial intelligence fields, more and more neural network-based models have been introduced into TSC [8,9]. Although model similarity methods have a high accuracy, the models used have a high level of abstraction, so these methods are not interpretable.
Recently, more researchers are interested in the shape similarity method. It is human instinct to distinguish objects by their shapes. Such methods suggest imitating human intuition and use shapes to distinguish different classes of time series. Shape similarity methods discover local shape features, while other methods discover global statistic features. erefore, these methods can obtain high accuracy with better interpretability.
e base model of this article, the Shapelet algorithm, is a kind of shape similarity methods. Shapelet is a special concept which means one discriminative subseries in the time series [10]. More than high accuracy, Shapelet can also provide visualization results, which can point out further research directions for domain experts [11].
Most of the existing Shapelet-based algorithms attend to discover Shapelets by searching candidates in the training dataset. Such algorithms have two drawbacks. First, the search requires extensive computation. As an illustration, the complexity of the original Shapelet algorithm is O (m 4 n 2 ), where m is the length of time series and n is the number of instances in the dataset. Second, the searched Shapelet lacks generalization. Each Shapelet must be a segment in the existed instance, while the most discriminative Shapelet may never appear in the historical data. us, this paper proposes a novel algorithm named Shapelet Dictionary Learning with SVM-based Ensemble Classifier (SDL-SEC), which contributes following two points: e rest of this paper is organized as follows: Section 2 reviews the research literature related to Shapelet algorithms and Dictionary Learning; Section 3 describes the structure of the proposed algorithm SDL-SEC; Section 4 presents the implementation of the SDL-SEC and compares the results with the baselines in 45 datasets; Section 5 concludes the paper and discusses future research directions.

Shapelet Algorithm.
e original Shapelet algorithm was constituted in [10] for time-series classification problem. In brief, there are three steps in the training stage of the original edition, as exhibited in Figure 1. First, a sliding window is used to extract time series segments, which are candidate Shapelets, and further figure out the minimum distances between candidate Shapelets and total time series in the dataset.
en, a candidate Shapelet orderline is built by arranging the minimum distances from small to large; thus, we can figure out the Information Gain (IG) and Optimal Split Point (OSP) of each orderline. High IG represents a good discrimination. e third step is to choose k-best Shapelets with higher IG. At the inference stage, a decision tree is built by k-best Shapelets' OSPs, and untagged time series are classified by this decision tree.
Followed the principle of Shapelet, many modified Shapelet-based algorithms have been proposed in recent years. ese approaches have two main directions: speed-up the running time and improve the accuracy. e original Shapelet algorithm as described above is a recursive search method, calls for a high time complexity. Hence, some speed-up techniques are indispensable for improving Shapelet algorithm efficiency. Some efficient pruning strategies are suggested to avoid searching for unproductive Shapelets [11]. e representation methods map time series from the source space to a reduced-dimensional space, which reduce the searching time significantly [12]. Moreover, parallelizing the Shapelet algorithm makes it possible for GPU operations [13]. However, the speed-up techniques do not involve the substantial improvement of Shapelet discovery and therefore limit the accuracy of this family of methods.
On the other hand, some Shapelet-based algorithms focus on improving accuracy. Grabocka et al. [11] suggest getting optimal Shapelet by optimizing a logistic loss objective function, which can further improve the classification accuracy. Instead of IG, Kruskall-Wallis and Mood's median are employed as quality measurement for Shapelet selection [14]. Meanwhile, Hills et al. [14] suggest a feature transformation technique, which unbind the Shapelet algorithm from the decision tree classifier. Generally, this family of methods obtains high accuracy, but the computation cost is still high.
In summary, most Shapelet-based algorithms do not achieve a good balance between efficiency and performance. A better Shapelet discovery mechanism is needed to improve the accuracy and reduce the runtime.

Dictionary
Learning. Dictionary Learning is a widely used machine learning algorithm. Its main idea is assuming that signals can be represented by a linear combination of dictionaries. Strictly, the mathematical form of Dictionary Learning can be organized as arg min α ∈ R n×K is a sparse coefficient, and we call Z k ∈ Z an atom of dictionary. Problem (1) is nonconvex when optimizing α and Z together. However, we can draw this problem to a two-step method: coefficient updating step and dictionary updating step. If Z is fixed, α updating subproblem is convex. Orthogonal Matching Pursuit (OMP) [15], Lasso [16], or Alternating Direction Method of Multipliers (ADMM) [17] are common methods to update α. If α is fixed, Z updating subproblem is a constrained least squares problem. Many convex optimization methods can solve this problem such as Fast Iterative Shrinkage-resholding Algorithm (FISTA) [18], gradient descent [19], and K-SVD [20]. Perform the above two steps alternately until convergence.
ere are some similarities between Shapelet and Dictionary. Shapelet represents the time series local shape features; thus, the linear combination of Shapelets can represent time series partially or totally. is behaviour is similar to the main idea of Dictionary Learning. e linear combination of Dictionaries is considered to have the high representation ability for raw data. From this sight, those two kinds of methods can share the same underlying technique.
To this end, Dictionary Learning technique is introduced to Shapelet discovery, which obtains a set of Shape-lets by solving several convex optimization problems.
is will greatly reduce the amount of computation in the training phase. Different from search-based methods, the optimized Shapelet is a novel segment which does not appear in the existing data. is will improve the generalization ability of Shapelets.

Shapelet Dictionary Learning with SVM-Based Ensemble Classifier
In this section, we introduce the structure of Shapelet Dictionary Learning with SVM-based Ensemble Classifier. SDL is a generative Shapelet discovery algorithm which combines DL and Shapelet. We train SDL in a supervised way and get subdictionary for each class. en, the transformation technique uses subdictionaries to map time series from time domain to feature domain. Last, we present an SVM-based Ensemble Classifier to improve accuracy further.

Build SDL Atom.
e atoms of the original DL present global features, while Shapelets present local features. So, we must reconstruct the SDL atoms before making the overall model. From the introduction in Section 2.2, the original DL atom dimension is equal to the input signal dimension. More specifically, consider the following situation, an input signal T with m dimension, and the atom in DL with p dimension. In the original DL based algorithms, p is set equal to m inevitably. Nevertheless, in the SDL algorithm, an atom is a representation of the Shapelet; thus, the dimension p needs small than m. To address this problem, we introduce the Shift-Invariant Dictionary Learning (SIDL) technique [21] into our work, which use a sliding window to build the SDL atom and relax the constraint of p to p ≤ m. Figure 2 explains how to generate the SDL atom. e SDL atom is denoted as d′ ∈ R m , and the Shapelet is denoted Computational Intelligence and Neuroscience as d ∈ R p , where p ≤ m. Shapelet may match anywhere in the time series, thus the sliding window which represents a Shapelet, which is sliding along the SDL atom to match the optimal location. In an SDL atom, only the Shapelet part has practical significance, and the rest part is assigned 0. In Figure 2, a Shapelet is matched at location q, the values in red blocks need to optimize, while the blue blocks are assigned 0. e matched location q is unique for each time series. Q(d, q) is a function describes the mapping relationship between the Shapelet and the SDL atom. Given q and d, the SDL atom can be generated as follows [21]:

e SDL Model.
To build the complete SDL model, further more constraints must be made. e complete SDL model is arg min where T i is a time series from the input dataset; the sparse coefficient {α ik }, the corresponding shifts {q ik }, and the outputs of the model which need to be optimized; the length of the Shapelet p, the number of Shapelets K, the weight λ, and the scale factor c are hyperparameters which need to be tuned. e scale constraint of d k avoids producing trivial solutions. e constraint of q controls the sliding window does not exceed the boundary. e nonnegative constraint of sparse coefficients is added to avoid the Shapelet inverted.
Equation (3) is an unsupervised learning model, while time series classification is a supervised problem. To deal with this, we learn sub-Shapelet dictionaries for each class independently. (3) is a nonconvex problem. Commonly, a two-step optimization strategy is suitable for this problem. Sparse coefficients and Shapelet dictionary are updated alternately. e subobject of each step is convex.

Update Sparse Coefficients.
In this step, sparse coefficient is updated. Fix the d, and equation (3) turns to arg min We optimize α and q for each T i independently, the problem turns to arg min where {α j } j≠k and {q j } j≠k are fixed.
We use an enumeration method to find the best q k , which calculates the object value with all possible locations, and the solution is located which minimizes equation (5): where t ≜ T i − j≠k α j Q(d j , q j ).
Because of the nonnegative constraint of α, equation (5) is different from that of [21]. With the optimal q * k , the update solution of α k is 3.3.2. Update Dictionary. In this step, we fix α and the q and update the Shapelet dictionary d. Further, we fix {d j } j ≠ k and optimize d k independently. us, the problem (3) turns to arg min Equation (8) is a least square with quadratic constraints, and it can be optimized via the Lagrange Multiplier method. e optimal d k is Sliding window at location q Figure 2: Construction of SDL atom. e red window represents the Shapelet, and the blue blocks are assigned to 0. 4 Computational Intelligence and Neuroscience , and superscript [1 + q ik , p + q ik ] means the segment from index 1 + q ik to p + q ik . A proof can be found in [21]. In the original Shapelet algorithm, the decision tree is embedded in the training process. In other words, the original Shapelet algorithm cannot use classifiers other than the decision tree. In order to unbind Shapelet discovery and decision tree, we use Shapelet transformation [14] technique to generate features which are suitable for any classifiers. Suppose H � L × K means the number of Shapelets in S. Transformation technique calculates the minimum distance v ij between T i and d j , and formed the feature vectors

Supervised Shapelet Dictionary
. V is actually a local reconstruction error. We do not feed sparse coefficients α to classifier because they lack discriminative ability. By implementing the transformation technique in both training set and testing set, we get V train and V test . e choice of classifier is discretionary, and we use an SVM-based Ensemble Classifier which is described below.

SVM-Based Ensemble
Classifier. SVM [22,23] is a widely used classifier with high performance and low computational cost. We choose SVM as a base classifier. However, a single classifier is easily affected by uncontrollable factors such as noise, resulting in unstable performance. e Ensemble Learning method combines the results of weak base classifiers to form a strong classifier and improves the overall robustness of the algorithm [24]. Many ensemble SVM algorithms have been proved to be more powerful than single SVM algorithms [25,26].
Specifically, two problems need to be considered in the construction of ensemble methods. e first problem is the selection of features. We use redundant dictionaries, which result in a large number of Shaplets, and the sample after the transformation operation is a high-dimensional vector. Generally, researchers use ICA or other dimension reduction methods to reduce the features. However, different samples may be sensitive to the features of different dimensions. Dimension reduction will lead to the loss of information. e second problem is the selection of SVM parameters. ere are many superparameters that need to be tuned in the SVM algorithm. Finding the optimal parameters is a complex problem. At the same time, like the first problem, different samples may be sensitive to different parameters.
In order to address these two problems, we designed two strategies. First, instead of using all features, each base SVM uses randomly selected features for training. As long as the number of base SVMs is enough, each feature will be used, and no information will be lost without dimension reduction. Second, the parameters of each base SVM are generated randomly, which avoids the overfitting caused by using a set of specific parameters. As shown in Figure 3, the steps of the SVM-based Ensemble Classifier are as follows: (1) Construct B training subsets and testing subsets.
Each training subset randomly selects E features from the training feature set V train , and the same E features are selected from the testing feature set V test to construct the corresponding testing subset. A feature is a column in the feature set. So, we get a series of subsets, is a random column of V train , and v train be ≠ v train bf , ∀e ≠ f. In the same way, we get testing subsets

Dataset and Baseline
Description. e 45 datasets from UCR Time Series Classification Repository [27] are used to verify the proposed algorithm. Table 1 summarises the  details of 45 datasets. SDL-SEC is a modified version of the classical Shapelet algorithm; thus, we chose two typical Shapelet-based algorithms for comparison. More than that, a Deep Learning algorithm is also compared. e following three algorithms are chosen as baselines: Scalable Shapelet Discovery (SD) [11] uses an online clustering and pruning technique to avoid repeatedly measuring the classification accuracy of similar subsequences. SD takes low computational cost and handles GB-scale data in several minutes. Learning Time-Series Shapelets (LTS) [28] learns top-K Shapelets through optimizing a cost function which consists of classification accuracy and regularization terms. LTS reaches a strong performance among many Shapelet-based algorithms. Fully Convolutional Network (FCN) [29] is a Deep Learning structure method. Since the Deep Learning Computational Intelligence and Neuroscience method plays an important role in many domains, and FCN is a strong baseline as authors claimed in their papers.
We reproduced the results of LTS and FCN with the open-source code (http://www.timeseriesclassification.com/ code.php, https://github.com/cauchyturing/UCR_Time_ Series_Classification_Deep_Learning_Baseline). SD is well tested, and the hardware performance is close to ours, so we reuse the results in [11] directly. Also, we compare single SVM classifier with SVM-based Ensemble Classifier. Single SVM version is denoted as SDL-S, and ensemble version is denoted as SDL-SEC. Classification Accuracy is used as a quantitative performance comparison measure, which is the ratio of truly classified instances to total test instances. e runtime is used as a quantitative efficiency comparison measure, which includes training time and testing time.
Our hardware environment is CPU: i7-9750H, RAM: 16 GB. We use Matlab 2019b to implement the SDL-SEC algorithm. SD and LTS are implemented by JAVA, and FCN is implemented by Python with TensorFlow. To be fair, we use CPU to compute only and use the default setting for all baselines.

Results and Analyses.
is section describes and analyses the experimental results from two aspects: classification accuracy and runtime. At the same time, the limitation of proposed algorithm is also discussed.

Classification Accuracy Comparison.
We use two statistical indicators to compare the classification accuracies of all algorithms, total wins and average rank. e best algorithm has the most total wins and the highest average rank. e algorithm with better generalization ability can obtain higher accuracy in more data sets, so it is also an important index to evaluate the classification accuracy. Table 2 summarises the classification accuracy statistical indicators of experiments. It should be noted that the accuracy of SDL-S is slightly different from those presented in [30], due to the updated parameters searching strategies. As shown in Table 2, the proposed algorithm has considerable classification effects. SDL-SEC achieves the best average rank of 1.91 and wins 15 times in 45 datasets. is result is better than that of SDL-S, showing that the SVM-based Ensemble Classifier is effective. FCN obtains the most total wins, 28 times, but the average rank is 1.96, lower than SDL-SEC. LTS also has good performance, but both indicators are worse than SDL-SEC. e accuracy of SD is obviously lower than other algorithms. Figure 5 is a 1-vs-1 comparison of SDL-SEC algorithm with other algorithms. In Figure 5, the points above the red line represent the higher accuracy of the baseline, and the points below the red line represent the higher accuracy of SDL-SEC algorithm. In comparison with SD and LTS algorithms, the points are more distributed in the area below the red line. In comparison with FCN, the points are roughly evenly distributed on both sides of the red line. It can be seen intuitively that the overall accuracy of SDL-SEC is close to FCN and higher than SD and LTS.
By analyzing Table 2 and Figure 5, we can draw a conclusion that SDL-SEC has a high classification accuracy with good generalization ability, which can adapt to most time series classification problems.

Runtime Comparison.
e runtime of each algorithm is analysed below. Table 3 shows the runtimes of the comparison algorithms. e "n/a" in Table 3 indicates that the time or memory required by the algorithm exceeds the limitation of our hardware, and Table 2 uses the accuracy provided by the author. e results of LTS FCN and SDL-SEC are rounded. e runtime of SD algorithm is the shortest. As it was claimed, SD is an efficient time series classification algorithm. Although SDL-SEC runs longer than SD, it is generally acceptable. LTS and FCN run 2-4 orders of magnitude longer than SDL-SEC. Although FCN can be accelerated by the GPU, it is still a high computational expensive algorithm.  Computational Intelligence and Neuroscience To sum up, SDL-SEC has achieved a good balance between accuracy and runtime, greatly reducing the operating time while also having a high accuracy.

Conclusion and Future Work
In this paper, we present a Shapelet-based time series classification algorithm called Shapelet Dictionary Learning with SVM-based Ensemble Classifier. First, we propose a Shapelet discovery method Shapelet Dictionary Learning, which combines Dictionary Learning and Shapelet and generates a group of Shapelets instead of searching. e generated Shapelets are totally new subsequences which contain local shape features of all the time series data but not exist in original data.
is encourages the generalization ability of Shapelet, reaches a higher accuracy in time series classification, and reduces runtime simultaneously. Furthermore, we propose an SVM-based Ensemble Classifier to execute the classification operation. e proposed SVM-based Ensemble Classifier trains a series of base SVM classifiers which are fed random features and parameters.
is method decreases the dependence on feature and parameter selection and thus further improves the classification accuracy. Extensive experiments are performed on 45 benchmark datasets. e results show that the proposed algorithm has high accuracy, good robustness, and high efficiency.
We also look into the future work from the limitations of the proposed algorithm. First, SDL-SEC discovers the time domain shape features. It will fail when the shape feature is not obvious. In order to address this issue, we may exploit a feature fusion method by combining frequency domain features and other statistical features. Second, the current SDL hyperparameter selection method is a grid search approach. is approach requires to design searching strategy manually. As an improvement direction, intelligent optimization algorithms can be exploited, such as evolutionary algorithms, to search hyperparameters automatically [31]. And, the third, only Euclidean Distance is considered in this work, so it is difficult to deal with real-world data with noise and uncertainty. Fuzzy Measurement [32,33] is an excellent tool to handle this kind of problems, which we will investigate in our future work.

Data Availability
e dataset used in this paper can be found in http://www. timeseriesclassification.com.

Conflicts of Interest
e authors declare that they have no conflicts of interest.

12
Computational Intelligence and Neuroscience