A Novel Long- and Short-Term Memory Network with Time Series Data Analysis Capabilities

Time series data are an extremely important type of data in the real world. Time series data gradually accumulate over time. Due to the dynamic growth in time series data, they tend to have higher dimensions and large data scales. When performing cluster analysis on this type of data, there are shortcomings in using traditional feature extraction methods for processing. To improve the clustering performance on time series data, this study uses a recurrent neural network (RNN) to train the input data. First, an RNN called the long short-term memory (LSTM) network is used to extract the features of time series data. Second, pooling technology is used to reduce the dimensionality of the output features in the last layer of the LSTM network. Due to the long time series, the hidden layer in the LSTM network cannot remember the information at all times. As a result, it is diﬃcult to obtain a compressed representation of the global information in the last layer. Therefore, it is necessary to combine the information from the previous hidden unit to supplement all of the data. By stacking all the hidden unit information and performing a pooling operation, a dimensionality reduction eﬀect of the hidden unit information is achieved. In this way, the memory loss caused by an excessively long sequence is compensated. Finally, considering that many time series data are unbalanced data, the unbalanced K -means ( UK means) algorithm is used to cluster the features after dimensionality reduction. The experiments were conducted on multiple publicly available time series datasets. The experimental results show that LSTM-based feature extraction combined with the dimensionality reduction processing of the pooling technology and cluster processing for imbalanced data used in this study has a good eﬀect on the processing of time series data.


Introduction
Time series data are a common type of data in work and life. e time series dataset is a collection of observations at different moments collected with a certain collection technology and at certain time intervals. erefore, each observation result in time series data is often time stamped. With the continuous improvement of computer technology and storage capabilities, storage devices store many time series data. Time series data are widely generated and exist in various industries or fields. For example, camera systems in shopping malls collect a large amount of consumer-related information. e data storage departments of large financial centers and security companies collect a large amount of stock information. In terms of environmental data, real-time weather information can be observed by artificial satellites, and a large amount of geological and mineral data can also be detected by related instruments. In meteorology, the recorded data of precipitation, temperature, and air quality pollution in each city or region are all time series data. In e-commerce consumption, customer consumption habits, commodity transaction volume, logistics data, and commodity evaluation data are usually time series data. erefore, research on time series data exists in all walks of life. Extracting the required information from time series data has important practical significance.
Time series data have the characteristics of large data volume, high dimensionality, and continuous updating; so, time series data are relatively complicated. Because of the complexity of time series data, analysis and research on time series data have not achieved great results. e initial study of time series data analysis can be mainly categorized into several major periods, such as descriptive time series analysis, statistical time series analysis, frequency domain analysis, time domain analysis, and time series data mining. In the past 20 years, the analysis of time series data has received widespread attention, and various time series data research methods have been proposed. Related research mainly includes similar time series research [1][2][3], time series search and query [4], dimensionality reduction [5,6], segmentation [7,8], anomaly detection [9,10], topic discovery [11], prediction [12,13], clustering [14][15][16][17][18][19][20][21][22], classification [23][24][25], and segmentation [26,27].
is article focuses on the clustering analysis of time series data using LSTM combined with pooling technology for feature extraction and the UK-means for clustering. e work of this research is summarized as follows: (1) e general method of using an RNN for feature extraction is to select the last hidden unit of the network to express the data. However, the original time series data cannot be expressed well when using only the last layer, so this article will use pooling technology to combine the hidden layer expression of all time steps. is method effectively reduces the data dimension while preserving the original information to the greatest extent. (2) A K-means clustering algorithm suitable for imbalanced data is introduced. e algorithm first oversamples the dataset to construct multiple balanced training subsets. Second, traditional K-means is used to cluster each subset and get the clustering result of each training subset. Finally, the integrated strategy is used to get the final clustering result.
(3) e dimensionality-reduced feature data are input into the UK-means model to obtain the clustering results for the time series data. By comparing the experimental results of different feature extraction methods, dimensionality reduction techniques, and clustering models, it is verified that the method used in this study has the best analysis effect on time series data. values mainly describe the relationship between a time series and other business activities or other derivative data, for example, whether the appearance or function of a certain product plays a key role in the sales of the entire enterprise.

Time Series Data Analysis Technology.
e time series data analysis is mainly for prediction, classification, and anomaly detection. Regardless of the specific purpose, the technology used can be roughly divided into two categories. One is based on traditional analysis techniques. e second is based on deep learning technology.

Traditional Analysis Technology.
Traditional analysis methods can be divided into two categories: qualitative analysis and quantitative analysis. Qualitative analysis is often used to predict trends, and judgments can be made without referring to industry values. ese values include an industry's previous sales volume, market competition intensity, product planning strategy, and many other issues. Industry experts use a large amount of existing data and, then, use their own personal analysis to synthesize conclusions and judge the trend in a certain indicator in the next stage in the future [28]. When market data are not accurate and complete, especially when there are no numerical measurement data, this method is the main reference method for product sales analysis in the industry. is method relies on the long-term experience and intuition of industry experts and can give results quickly, but it is not accurate enough. Qualitative forecasting is a common time series analysis method in practice [29].
Quantitative analysis is a numerical analysis method. e biggest difference is that quantitative analysis relies on objective numbers, while qualitative analysis relies on intuition. Quantitative analysis is more accurate than qualitative analysis, and of course, the corresponding time complexity increases accordingly. Quantitative analysis is more convincing for more precise issues such as changes in product sales and fluctuations in traffic flow. It can objectively discover inherent statistical laws through numbers. Currently, quantitative analysis mainly includes regression analysis methods [30], traditional statistical analysis methods [31][32][33], and machine learning [15][16][17][18][34][35][36][37][38][39] methods.

Deep Learning Technology.
e theory of deep learning was developed based on the artificial neural network model. Deep learning methods introduce deeper network structures based on artificial neural networks and conduct deeper analyses of extracted features and timing relationships in data. For example, the pretraining method proposed in a deep belief network model allows an artificial neural network to initialize the weights before optimizing training so that the values will not be too far from the ideal value. is processing method can significantly save computing resources. After pretraining, fine tuning is used for training. A deep network can be realized through these two methods. For deep network models, a layer-by-layer training method is generally adopted. A loss function is constructed from the output layer and the expected output to optimize the stochastic gradient descent so that training is conducted from the outside to the inside, layer by layer. e advantage of a deep network is that each layer may extract different features and ultimately form a powerful learning ability. Typical deep learning models can be found in references [40][41][42][43].

LSTM Network and Pooling Technology.
e LSTM network contains a cell and three gates, which are input gates, output gates, and forget gates. e function of the gates is to limit the flow of data using activation functions. In the figure, f and g represent activation functions. Figure 1 shows that the input gate, output gate, forget gate, and cell unit calculations involve input data and hidden data.
e data of the cell state depend on the forget gate, the input gate, its previous state, and the cell unit. e hidden layer depends on the output gate and the cell state after activation. e hidden layer information at the previous moment and the input information are gathered into three gates, and after the weight connection, the activation function is used to suppress the value. In this way, the output values of the three gates at the current moment are obtained. Additionally, the temporary cell information is the same as the case of the three gates. rough the hidden information and the current input obtained at the previous moment, the output value at the current moment is obtained with the activation function. en, the temporary cell information at the previous moment and the output values of the input gate and forget gate are used to obtain the current cell state information. Finally, the output information of the current hidden layer is obtained by using the output gate and cell state information.
Each parameter in the LSTM network is defined as follows. e meaning of each symbol is shown in Table 1.
e forward formula of the network is as follows. Input gate, Forget gate, Output gate, Cells, Hidden layer, Output, e RNN can obtain the required hidden layer information and, then, use average pooling to obtain the features after dimensionality reduction. Let the transformed feature be H � [h 1 , h 2 , . . . , h C ] and the feature dimension be C; then, we can obtain Before clustering, the elements obtained by the hidden layer need to be pooled. Pooling is generally used in convolutional neural networks to reduce the dimensionality of the feature vector output with the convolutional layer, and it  Value of an after activation can also prevent overfitting. Its purpose is to use a value to represent a small area. Commonly used methods include maximum pooling, average pooling, and additive pooling. Maximum pooling is often applied to images because this method can ensure rotational invariance and expansion of the image. For time series data, to make the information of each time dimension supplement the last dimension, it is most appropriate to use average pooling or additive pooling [44]. Because it is easy to obtain a relatively large value and increase the amount of calculation in additive pooling, this study uses average pooling. e principle comparison diagrams of the maximum pooling and average pooling operations are shown in Figure 2. A 2 × 2 window is selected in the figure. Figure 2 shows that average pooling takes the average value in the window, and maximum pooling takes the maximum value in the window. In the pooling operation, in addition to selecting the size and value of the window, it is also necessary to select the span of the window, that is, the stride of the window moving on the matrix. e span in Figure 2 is 2. In actual applications, the span is sometimes selected as 1.
is can easily capture the relationship of translational invariance.

Unbalanced K-Means Algorithm.
e traditional Kmeans algorithm is mainly aimed at clustering analysis of balanced data. e time series data are mostly unbalanced data. When traditional K-means processes such data, the effect will drop a lot. In order to improve the clustering performance of the K-means algorithm on unbalanced data, this paper uses an unbalanced K-means algorithm (UKmeans). e principle of the algorithm is shown in Figure 3.
It can be observed from Figure 3 that UK-means first oversampled the dataset and randomly selected M data from the majority of categories, and each data subset contained N samples. Each majority class sample subset and the minority class sample set are fused into M datasets with a sample size of 2N. Secondly, K-means clustering is performed on each training subset to obtain the clustering result of the training subset. Finally, the results of each training subset are averagely weighted to obtain the final clustering result.

Framework of the Method Used.
For cluster analysis of time series data, this paper uses a clustering framework that combines LSTM and pooling technology. Figure 4 shows a framework diagram of the cluster analysis used in this study. In Figure 4, the original sequence is first input one by one into the LSTM for feature extraction, and the hidden states obtained at each moment are gathered. Second, average pooling technology is used to reduce the dimensionality to obtain the transformed data expression. Finally, the unbalanced data clustering algorithm is used to cluster the converted data. is study uses the UK-means model for clustering operations. e introduction of UK-means is in section 2.4. In summary, the main idea of this research is to use LSTM to extract features and reduce dimensionality of datasets. en, UK-means clustering feature data are used to get the final clustering result. e flow chart of this method is shown in Figure 5.

Parameter Update in the LSTM Model.
e mean square error is used to obtain the target expression as follows: We differentiate the input gate, forget gate, output gate, and internal state and obtain their expressions as follows: Input gate, Forget gate, Output gate, Internal state, e cell state can be obtained from the hidden layer or from the cell state at a later time. erefore, the cell state gradient is as follows: e gradient update formula of the hidden layer is e weight parameters mainly include w h1 , w il , w iφ , w ic , w iw , w hl , w hφ , w hc , and w hw . ese weight parameters can be divided into 3 categories, namely, w h1 , w i (w il , w iφ , w ic , w iw ), and w h (w hl , w hφ , w hc , w hw ). e gradient descent method is used to update these three types of weight parameters, and the expression is as follows: 4 Mathematical Problems in Engineering

Experimental Data.
is article conducts experiments on the UCR public dataset.
e UCR dataset provides multiple sets of datasets based on real scenes, each of which contains training sets and test sets in the same format. Each row in each dataset is recorded as a piece of data, the first number of a piece of data is the category, and the remaining part is a time series. For ease of use, all data in the UCR are standardized using z-scores. All the experiments use the UCR's default training set and test set without redivision.  Mathematical Problems in Engineering e introduction of the dataset used in the experiment is shown in Table 2.

Experimental Environment and Evaluation Index.
e implementation language for all algorithms in this study is python 2.7. In the hardware environment, the CPU is an Intel core i7-7700k 4.4 G, the memory is 16 G, and the graphics card is an NV GTX 1070. is study uses the accuracy (Acc) and F-score as two evaluation indicators to evaluate the clustering performance of the clustering framework used on time series data. e calculation formulas for the evaluation indices are shown as follows: e larger the Acc value, the better the clustering performance. When the values of P and R are both high, the larger the F-score value is, the better the clustering effect is.
e F-score is more suitable for the classification and evaluation of unbalanced data. A description for each parameter in the abovementioned formula is shown in Table 3. In Table 3, P stands for positive, N stands for negative, T stands for true, and F stands for false.

Experiments on Different Clustering Algorithms.
e samples are input to the LSTM network for feature extraction, and then, the average pooling technology is used to reduce the dimensionality. Finally, different clustering algorithms are used for the transformed features to explore the influence of different clustering algorithms on the clustering effect on time series data. e comparison algorithms used are K-means [17], fuzzy C-means clustering (FCM) [16], soft subspace clustering (SSC) [18], and unbalanced K-means clustering (UK-means) [38]. e experimental results are shown in Table 4. e experimental results in Table 4 verify the following three conclusions. (1) Under the premise of the same feature extraction and dimensionality reduction technology, different clustering algorithms will have different clustering effects. (2) On the Adiac, Beef, CBF, and ChlorineCon datasets, the UK-means algorithm in the sixth column has the best clustering effect. is shows that the clustering effect obtained by using the unbalanced clustering algorithm is better. For time series data, an unbalanced clustering algorithm is more suitable for clustering analysis. (3) On the CinCECGTorso and Coffee datasets, the clustering effect under the SSC algorithm is the best. is is because the SSC algorithm not only reflects the relationship between the sample attributes and clusters but also reflects differences in the related attributes. e clustering results obtained by the UK-means on these two datasets are slightly lower than those obtained by SSC, but the performance difference is not very large. is shows that the UK-means can also be used on unbalanced datasets without significantly reducing the clustering performance.

Experiments on Different Dimensionality Reduction
Techniques. After the sample is output by the LSTM network, the feature vector dimension at this time is still very large. is study uses an average pooling technique to reduce its dimensionality. To verify the effectiveness of this dimensionality reduction method, the method of no dimensionality reduction, the maximum pooling method, and the average pooling method are compared here. e final clustering algorithm uses the UK-means. e comparative experimental results are shown in Table 5. e third column in Table 5 shows the clustering results obtained without adding dimensionality reduction technology. e fourth column shows the clustering results obtained by using the maximum pooling technique. e fifth column shows the clustering results obtained by dimensionality reduction using average pooling technology. Clearly, the value in column 5 is significantly greater than that in column 3.
is shows that the method of using pooling technology to enhance the expression ability of the hidden layer is very effective. e values in column 5 are slightly larger than the values in column 4, which shows that the average pooling technique is more suitable for dimensionality reduction processing on time series data.

Feature Extraction Experiment.
is research uses the LSTM model to extract the features from the input samples. To verify its effectiveness, the contrast feature extraction method used is the wavelet transform [45]. In the framework of the entire clustering method, the dimensionality reduction technology still uses the average pooling technology, and the clustering algorithm uses the UK-means algorithm. e final experimental results are shown in Table 6.  Time series  length  Adiac  37  390  391  176  Beef  5  30  30  470  CBF  3  30  900  128  ChlorineCon  3  467  3840  166  CinCECGTorso  4  40  1380  1639  Coffee  2  28 28 286  Table 6 shows the experimental results of feature extraction using the LSTM model. e third column shows the experimental results obtained by using the wavelet transform for feature extraction. On the datasets Adiac, Beef, ChlorineCon, and CinCECGTorso, the values in the fourth column are all greater than those in the third column.
is shows that the LSTM-based feature extraction method works better on these 4 datasets. On the datasets CBF and Coffee, the values in the third column are greater than the values in the fourth column, which shows that the clustering results obtained by the wavelet transform are slightly better than those obtained by LSTM. However, the difference in the effect is not very large. e clustering effect when using the LSTM network for feature extraction on 4 datasets (Adiac, Beef, ChlorineCon, and CinCECG-Torso) is better than that based on the wavelet transform. On the remaining 2 datasets (CBF and Coffees), the effects of the two feature extraction methods are not much different. Considering the details, it is more sensible to use the LSTM method for feature extraction.

Conclusions
Because time series data have many special properties, commonly used clustering algorithms cannot achieve satisfactory results when clustering time series data. e purpose of this research is to find suitable models for various time series data. e research on time series data generally focuses on the chronological nature of time series data. To capture this property, this study uses an RNN that can process the data in chronological order to train the data. Due to the gradient problem in traditional RNNs, they have shortcomings in practical applications.
is study uses a special CNN, namely, LSTM, to learn the dimensionality reduction expression of a time series. For long time series, the hidden layer of the network cannot remember all of the time information. As a result, it is difficult to compress the global information in the last layer. In response to this problem, this research performs an average pooling operation after stacking all the hidden unit information to further complete the dimensionality reduction of the data. Finally, the UK-means algorithm is used to cluster the feature data after dimensionality reduction. e experiments are conducted on multiple UCR public datasets. e experimental results verify the effectiveness of the clustering algorithm used. e method used in this research involves a large number of matrix operations, such as matrix addition and matrix multiplication. is requires high hardware performance. e next step will be to study how to improve the efficiency of the algorithm.

Data Availability
e labeled dataset used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.