^{1}

^{1}

^{2}

^{1}

^{1}

^{2}

Unsupervised mining of electrocardiography (ECG) time series is a crucial task in biomedical applications. To have efficiency of the clustering results, the prominent features extracted from preprocessing analysis on multiple ECG time series need to be investigated. In this paper, a Harmonic Linear Dynamical System is applied to discover vital prominent features via mining the evolving hidden dynamics and correlations in ECG time series. The discovery of the comprehensible and interpretable features of the proposed feature extraction methodology effectively represents the accuracy and the reliability of clustering results. Particularly, the empirical evaluation results of the proposed method demonstrate the improved performance of clustering compared to the previous main stream feature extraction approaches for ECG time series clustering tasks. Furthermore, the experimental results on real-world datasets show scalability with linear computation time to the duration of the time series.

Clustering multiple time series data have received considerable attention in recent years in various applications, such as industries of finance, business, science domains, and medicine [

Since the quality of clustering results relies strongly on good features extracted from multiple time series, a very important processing step is to identify compact features extracted from multiple coevolving time series. This step can be used to not only convert a series of original values to more meaningful information, such as more understandable and interpretable information, but it also can lead time series to a lower dimensionality with the most relevant features.

This paper is motivated by mining these essential features of the medicine applications, namely, electrocardiography (ECG) time series. The problems are studied based on the challenges across time series applications such as time shifts effects, nearby frequencies, and harmonics. By exploiting the temporal evolving trends and the correlation characteristics of coevolving time series, meaningful features which help to achieve the best clustering accuracy can be captured.

Feature extraction can efficiently describe time series since suitable representations reduce the feature spaces. This provides highly efficient features for knowledge discovery so that they help to improve performance as well as the speed of the mining algorithms.

Many time series representations have been proposed to extract features. The well-known dimension reduction approaches, namely, Principal Component Analysis (PCA) and Singular Value Decomposition (SVD), are powerful tools to discover linear correlations across multiple time series [

One of the popular alternative approaches for feature extraction on multiple time series is Discrete Fourier Transform (DFT), where the original time series data is projected into the frequency domain [

The Linear Dynamical System, known as Kalman filters, has been commonly used for time series analysis because of its simple implementation and extensibility [

Another popular feature extraction method, known as Dynamic Time Warping (DTW), can handle time shifts across the sequences. However, this approach ignores temporal dynamics. Therefore, applying DTW directly for clustering purpose cannot give good results [

Autoregression Moving Average (ARMA) is also a method for feature extraction. However, ARMA parameters do not provide a reliable method since different sets of parameters can be obtained even from time series with similar structures. As a result, the clustering performances will be affected dramatically [

In this paper, we applied an approach of prominent feature extraction based on a Harmonic Linear Dynamical System (HLDS) [

Our prime objective is to exploit two common characteristics of multiple coevolving time series: correlations and temporal dynamics for meaningful feature extraction. Correlation reflects the relationships among multiple time sequences, while dynamic property discovers the temporal moving trends of multiple time series by automatically identifying a few hidden variables. For example, a particular medical signal of physiological records in ECG application characterizes a specific symptom of a patient such as a malignant ventricular arrhythmia person. Therefore, each time series differs from the others in dynamics since time series encodes temporal dynamics along the time ticks. By capturing correlations, we can achieve good interpretable features in the presence of time shift effects and small shift in frequency. By exploiting the evolving temporal components, we can find the clusters of time series by grouping them with similar temporal patterns.

In order to evaluate the effectiveness of the applied method considering clustering accuracy, reliability, and complexity aspects, this paper demonstrates the clustering results of ECG time series using

The rest of the paper is organized as follows. Background of the underlying Linear Dynamical System theory and the proposed model setup are given in the upcoming Section

Multidimensional time series dataset

In particular, each row vector includes all of the observations for one certain time tick which is an ordered sequence of

Linear Dynamical System (LDS), known as Kalman filter, has been used to model multidimensional time series data. By taking the definition of time series above as a matrix for a dynamical system, this means that multidimensional time series data can be presented by a matrix

In this section, the proposed method for ECG dataset is set up to illustrate how to exploit the interpretability of prominent features extracted from multiple time series in order to improve the clustering quality. Since each row of output transition matrix

First of all, the hidden dynamics are learned via Linear Dynamical System (LDS), capturing the series of hidden variables which are evolving according to the linear transformation

Secondly, after achieving the hidden variables from the LDS system, the canonical form of the hidden variables is identified. However, these hidden variables are hard to interpret since they are mixed in the observation sequences. Therefore, we need to make them compact and uniquely identify. Equation (

In LDS, the output projection matrix

Thirdly, the harmonic mixing matrix

Lastly, in order to obtain the interpreted features for each sequence, we apply the dimension reduction approach with SVD on the harmonic magnitude matrix,

In summary, HLDS includes four steps:

In order to show the validation of the clustering by HLDS, we carry out experiments on real ECG dataset taken from PhysioNet

MIT-BIH Healthy/Normal Sinus Rhythm Database

MIT-BIH Malignant Ventricular Arrhythmia Database

MIT-BIH Supraventricular Arrhythmia Database

There are two collections which are investigated. Collection 1 contains a group of healthy people and malignant ventricular arrhythmia while collection 2 is obtained by the group of healthy and supraventricular arrhythmia people. To evaluate the effectiveness of the feature extraction for the clustering, both the quality and scalability of normalized time series are considered against previous feature extraction approaches such as LPCC, PCA, DFT, and original Kalman filter. Normalization is carried out to compensate the differences in level and scale of dataset to a zero-mean and unit variance. In the experiment, we use the first two coefficients and cluster them by

To evaluate the quality of the clustering results, we use the confusion matrix since we know the ground truth labels of each sequence. The clustering performance of different methods on collection 1 of real ECG time series is recorded as follows: the proposed method (94.29%), LPCC (85.71%), KF (42.9%), DFT (57.14%), and PCA (40%). Compared to the previous feature extraction methods, the average performance of applied HLDS on real ECG datasets demonstrates significant performances, that is, 9.1%, 54.5%, 39.39%, and 57.57% clustering improvement against the LPCC, original Kalman filter, DFT, and PCA, respectively.

Since this method can discover deeper hidden patterns which can capture correlation and temporal dynamics successfully, it provides the group of distinct harmonics that helps to handle the presence of the time shift effect with small shifts in frequency. These harmonic groups represent good resulting features which lead to good clustering as well as visualization.

In more detail, Figure

The first two extracted features of algorithms on EEG dataset.

The proposed method gives wrong clustering results for the 17th and 32nd signals. Even though the 17th and 32nd signals are malignant ventricular arrhythmia signals actually, the shapes of these signals are very similar to the normal case. Therefore, when applying method, they discover similar features as the normal cases; consequently, they are clustered to the normal cluster.

To verify them, the 17th and 32nd time sequences of Figure

Original samples of normal and malignant ventricular arrhythmia time series, respectively.

The 17th and 32nd malignant ventricular arrhythmia patterns clustered to be wrong cluster.

Figure

Scatter plot of clustering visualization.

The proposed method again performs the best clustering accuracy for collection 2 which consists of healthy and supraventricular arrhythmia persons. The error rates for all of experimental methods are shown as follows: proposed method (0.139), LPCC (0.2326), KF (0.4419), PCA (0.3256), and DFT (0.3953).

The computational complexity of the HLDS is shown in Figure

Linear execution time with respect to the length of sequences.

The proposed method, HLDS, considers the problem of handling the challenges of time series, namely, time shift, nearby frequencies, and harmonics. The applied method demonstrates the efficiency for solving these challenges in real applications of ECG time series domain. Interpretability of prominent features was discovered for the clustering as well as visualization. In most cases, HLDS gives the best result compared to the other feature extraction techniques such as Kalman filter, LPC cepstrum, DFT, and PCA. Moreover, the performance results show almost a linear speedup as we increase the input of the dataset.

For further study, we will investigate the harmonic linear dynamical system over much longer time series with missing values in various applications.

The authors declare that there is no conflict of interests regarding the publication of this paper.

This research was supported by the MKE (The Ministry of Knowledge Economy), Korea, under the ITRC (Information Technology Research Center) support program supervised by the NIPA (National IT Industry Promotion Agency) (NIPA-2013-H0301-13-3005). This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2013-052849).