The completion of missing values is a prevalent problem in many domains of pattern recognition and signal processing. Analyzing data with incompleteness may lead to a loss of power and unreliable results, especially for large missing subsequence(s). Therefore, this paper aims to introduce a new approach for filling successive missing values in low/uncorrelated multivariate time series which allows managing a high level of uncertainty. In this way, we propose using a novel fuzzy weighting-based similarity measure. The proposed method involves three main steps. Firstly, for each incomplete signal, the data before a gap and the data after this gap are considered as two separated reference time series with their respective query windows
Nowadays huge time series can be considered due to the availability of effective low-cost sensors, the wide deployment of remote sensing systems, internet based measure networks, etc. However, collected data are often incomplete for various reasons such as sensor errors, transmission problems, incorrect measurements, bad weather conditions (outdoor sensors), for manual
Most proposed models for multivariate time series analysis often have difficulties processing incomplete datasets, despite their powerful techniques. They usually require complete data. Then the question is how can missing values be dealt with? Ignoring or deleting is a simple way to solve this drawback. But serious problems regularly arise when applying this solution. This is prominent in time series data where the considered values depend on the previous ones. Furthermore, an analysis based on the systematic differences between observed and unobserved data leads to biased and unreliable results [
Considering imputation methods for multivariate time series, taking advantage of the correlations between variables is commonly applied to predict lacking data [
Particularly, imperfect time series can be modelled using fuzzy sets. The fuzzy approach makes it possible to handle incomplete data, vague, and imprecise circumstances [
Thus, this paper aims to propose a new approach, named FSMUMI, to fill large missing values in low/uncorrelated multivariate time series by developing a new similarity measure based on fuzzy logic. However, estimating the distribution of missing values and whole signals is very difficult, so our approach makes an assumption of effective patterns (or recurrent data) on each signal.
The rest of this paper is organized as follows. In Section
This section presents, first, related work about multivariate imputation methods, followed by a review on the fuzzy similarity measure and its applications.
Up to now, numerous successful researches have been devoted to complete missing data in multivariate time series imputation such as [
In view of the model-based imputation, two main methods were proposed. The first method was introduced by Schafer [
According to the concept of machine learning-based imputation, many studies focus on completion of missing data in multivariate time series. Stekhoven and Bühlmann [
Besides these principal techniques, clustering-based imputation approaches are considered as power tools for completing missing values thanks to their ability to detect similar patterns. The objective of these techniques is to separate the data into several clusters when satisfying the following conditions: maximizing the intercluster similarity and minimizing intracluster dissimilarity. Li et al. [
In general, most of the imputation algorithms for multivariate time series take advantage of dependencies between attributes to predict missing values.
Indeed similarity-based approaches are a promising tool for time series analysis. However, many of these techniques rely on parameter tuning, and they may have shortcomings due to dependencies between variables. The objective of this study is to fill large missing values in
where
measures based on the operations of union and intersection, measures based on the maximum difference, measures based on the difference and the sum of membership grades.
In [
Concerning the similarity between two subsequences of time series, we can use the DTW cost as a similarity measure. However, to deal with the high level of uncertainty of the processed signals, numerous similarity measures can be used to compute similarity like the cosine similarity, Euclidean distance, Pearson correlation coefficient. Moreover, a fuzzy-weighted combination of scores generated from different similarity measures could comparatively achieve better retrieval results than the use of a single similarity measure [
Based on the same concepts, we propose using a fuzzy rules interpolation scheme between grades of membership of fuzzy values. This method makes it possible to build a new hybrid similarity measure for finding similar values between subsequences of time series.
The proposed imputation method is based on the retrieval and the similarity comparison of available subsequences. In order to compare the subsequences, we create a new similarity measure applying a multiple fuzzy rules interpolation. This section is divided into two parts. Firstly, we focus on the way to compute a new similarity measure between subsequences. Then, we provide details of the proposed approach (namely, Fuzzy Similarity Measure Based Uncorrelated Multivariate Imputation, FSMUMI) to impute the successive missing values of low/uncorrelated multivariate time series.
To introduce a new similarity measure using multiple fuzzy rules interpolation to solve the missing problem, we have to define an information granule, as introduced by Pedrycz [
To answer the first condition, we take into account 3 different distance measures between two subsequences Cosine distance is computed by ( Euclidean distance is calculated by To satisfy the input condition of fuzzy logic rules, we normalize this distance to Similarity measure is defined by the function (
To answer the second condition, we use these 3 distance measures (or attributes) to generate 4 fuzzy similarities (see Figure
Computing scheme of the new similarity measure.
Membership function of fuzzy similarity values.
And, finally, the new similarity measure is determined by Rule R:
Let us consider some notations about multivariate time series and the concept of large gap. A multivariate time series is represented as a matrix
Here, we deal with large missing values in low/uncorrelated multivariate time series. For isolated missing values (
The mechanism of FSMUMI approach is demonstrated in Figure
Scheme of the completion process:
This method concentrates on filling missing values in low/uncorrelated multivariate time series. For this type of data, we cannot take advantage of the relations between features to estimate missing values. So we must base our approach on observed values on each signal to complete missing data on itself. This means that we can complete missing data on each variable, one by one. Further, an important point of our approach is that each incomplete signal is processed as two separated time series, one time series before the considered gap and one time series after this gap. This allows increasing the search space for similar values. Moreover, applying the proposed process (one by one), FSMUMI makes it possible to handle the problem of wholly missing variables (missing data at the same index in the all variables).
The proposed model is described in Algorithm The first phase: Building queries (cf. 1 in Figure For each incomplete signal and each The second phase: Finding the most similar windows (cf. 2 and 3 in Figure For the We first find the threshold, which allows considering two windows to be similar. For each increment We then find the most similar window to the query The same process is performed to find the most similar window In the proposed approach, the dynamics and the shape of data before and after a gap are a key-point of our method. This means we take into account both queries The third phase (cf. 4 in Figure When results from both referenced time series are available, we fill in the gap by averaging values of the window preceding
(12) (13) (14) (15) (16) (17) (18) (19) (20) (21) (22) (23) (24) (25) (26) (27) (28)
(29) (30) (31) (32) (33) (34) (35) (36) (37)
The experiments are performed on three multivariate time series with the same experiment process and the same gaps, described in detail below.
For the assessment of the proposed approach and the comparison of its performance to several published algorithms, we use 3 multivariate time series, one from UCI Machine Learning repository, one simulated dataset (this allows us to handle the correlations between variables and percentage of missing values), and finally a real time series hourly sampled by IFREMER (France) in the eastern English Channel.
where These data are very large so we choose only a subset of 3 signals for performing experiments.
After completing missing values, completion data will be compared with the actual values in the completed series to evaluate the ability of different imputation methods. Therefore, it is necessary to fill missing values in the water temperature. To ensure the fairness of all algorithms, filling in the water temperature series is performed by using the na.interp method ([
In the present study, we perform a comparison of the proposed algorithm with 7 other approaches (comprising Amelia II, FcM, MI, MICE, missForest, na.approx, and DTWUMI) for the imputation of multivariate time series. We use R language to execute all these algorithms.
In order to estimate the quantitative performance of imputation approaches, six usual criteria in the literature are used as follows: Similarity evaluates the similar percent between the estimated values ( where T is the number of missing values. The similarity tends to 1 when the two curves are identical and tends to 0 when the amplitudes are strongly different.
RMSE (Root Mean Square Error) is computed as the average squared difference between It is now well admitted that good imputation performance does not lead automatically to good estimation performance. It is why other indices like FSD, FA2, and FB (that enable evaluating the shape of the two signals) are used in this study. FSD (Fraction of Standard Deviation) is defined as This fraction points out whether a method is acceptable or not. Applying to the imputation task, when FSD value approaches 0, an imputation method is impeccable. FB: Fractional Bias: determines the rate of predicted values FA2 defines the percentage of outlier between two variables When FA2 value is close to 1, a model is considered perfect.
Indeed, evaluating the ability of imputation methods cannot be done because the actual values are lacking. So we must produce artificial missing data on completed time series in order to compare the performance of imputation approaches. We use a technique based on three steps to assess the results detailed in the following:
In this paper, we perform experiments with seven missing data levels on three large datasets. On each signal, we create simulated gaps with different rates ranging from 1%, 2%, 3%, 4%, 5%, 7.5%, to 10% of the data in the complete signal (here the biggest gap of MAREL-Carnot data is 3,533 missing values corresponding to 5 months of hourly sampled). For every missing ratio, the approaches are run 5 times by randomly choosing the positions of missing in the data. We then perform
This section provides experiment results obtained from the proposed approach and compares its ability with the seven published approaches. Results are discussed in three parts, i.e., quantitative performance, visual performance, and execution times.
Tables
Average imputation performance indices of various imputation algorithms on synthetic dataset (100,000 collected points).
Gap size | Method | Accuracy indices | Shape indices | ||||
---|---|---|---|---|---|---|---|
1-Sim | 1- |
RMSE | FSD | FB | 1-FA2 | ||
1% | FSMUMI | 0.136 | 0.261 | 0.051 | 0.358 | 3.253 | 0.364 |
Amelia | 0.275 | 0.999 | 0.143 | 0.409 | 2.252 | 0.773 | |
FcM | 0.231 | 0.722 | 0.096 | 1.889 | 2.208 | 0.996 | |
MI | 0.275 | 0.999 | 0.142 | 0.421 | 2.091 | 0.773 | |
MICE | 0.258 | 0.944 | 0.13 | 0.406 | 2.452 | 0.72 | |
missForest | 0.248 | 0.915 | 0.122 | 0.389 | 3.976 | 0.744 | |
na.approx |
|
|
|
|
|
|
|
DTWUMI | 0.257 | 0.713 | 0.88 | 0.725 | 0.405 | 0.69 | |
|
|||||||
2% | FSMUMI |
|
0.295 |
|
|
|
|
Amelia | 0.259 | 0.998 | 0.147 | 0.275 | 2.005 | 0.803 | |
FcM | 0.208 | 0.686 | 0.104 | 1.863 | 2.289 | 0.987 | |
MI | 0.259 | 0.998 | 0.147 | 0.268 | 2.11 | 0.81 | |
MICE | 0.244 | 0.968 | 0.14 | 0.255 | 7.616 | 0.759 | |
missForest | 0.239 | 0.968 | 0.133 | 0.279 | 3.156 | 0.792 | |
na.approx | 0.104 |
|
0.047 | 0.224 | 0.398 | 0.347 | |
DTWUMI | 0.237 | 0.775 | 0.867 | 0.509 | 8.449 | 0.646 | |
|
|||||||
3% | FSMUMI |
|
|
|
0.219 |
|
|
Amelia | 0.218 | 0.911 | 0.127 |
|
6.128 | 0.76 | |
FcM | 0.214 | 0.601 | 0.1 | 1.832 | 1.759 | 0.989 | |
MI | 0.253 | 0.993 | 0.141 | 0.236 | 2.295 | 0.775 | |
MICE | 0.21 | 0.873 | 0.118 | 0.208 | 5.118 | 0.703 | |
missForest | 0.188 | 0.796 | 0.102 | 0.215 | 1.846 | 0.627 | |
na.approx | 0.148 | 0.43 | 0.072 | 0.372 | 2.382 | 0.577 | |
DTWUMI | 0.231 | 0.799 | 0.874 | 0.332 | 27.952 | 0.69 | |
|
|||||||
4% | FSMUMI |
|
|
|
|
|
|
Amelia | 0.208 | 1 | 0.14 | 0.213 | 2.171 | 0.807 | |
FcM | 0.155 | 0.759 | 0.095 | 1.85 | 2.09 | 0.986 | |
MI | 0.208 | 0.999 | 0.14 | 0.196 | 2.302 | 0.807 | |
MICE | 0.209 | 0.987 | 0.138 | 0.22 | 3.748 | 0.801 | |
missForest | 0.196 | 0.968 | 0.127 | 0.216 | 3.94 | 0.827 | |
na.approx | 0.145 | 0.721 | 0.092 | 0.252 | 5.251 | 0.689 | |
DTWUMI | 0.148 | 0.586 | 0.918 | 0.185 | 12.688 | 0.719 | |
|
|||||||
5% | FSMUMI |
|
|
|
|
|
|
Amelia | 0.214 | 0.997 | 0.15 | 0.147 | 2.238 | 0.79 | |
FcM | 0.179 | 0.715 | 0.108 | 1.818 | 2.194 | 0.993 | |
MI | 0.231 | 0.996 | 0.167 | 0.206 | 3.094 | 0.808 | |
MICE | 0.221 | 0.968 | 0.152 | 0.222 | 2.3 | 0.79 | |
missForest | 0.212 | 0.944 | 0.143 | 0.315 | 4.547 | 0.819 | |
na.approx | 0.16 | 0.8 | 0.118 | 0.352 | 18.217 | 0.622 | |
DTWUMI | 0.186 | 0.885 | 0.88 | 0.213 | 0.723 | 0.694 | |
|
|||||||
7.5% | FSMUMI |
|
|
|
0.069 |
|
|
Amelia | 0.197 | 0.998 | 0.147 | 0.045 | 1.305 | 0.792 | |
FcM | 0.158 | 0.809 | 0.104 | 1.813 | 1.866 | 0.991 | |
MI | 0.2 | 0.992 | 0.15 |
|
1.645 | 0.797 | |
MICE | 0.205 | 0.988 | 0.15 | 0.057 | 10.744 | 0.799 | |
missForest | 0.188 | 0.97 | 0.136 | 0.284 | 4.396 | 0.812 | |
na.approx | 0.192 | 0.971 | 0.142 | 0.669 | 2.163 | 0.712 | |
DTWUMI | 0.133 | 0.653 | 0.908 | 0.064 | 1.113 | 0.571 | |
|
|||||||
10% | FSMUMI |
|
|
|
0.114 |
|
|
Amelia | 0.202 | 0.999 | 0.147 | 0.034 | 4.062 | 0.788 | |
FcM | 0.164 | 0.872 | 0.104 | 1.837 | 2.201 | 0.992 | |
MI | 0.21 | 0.997 | 0.155 | 0.12 | 2.954 | 0.785 | |
MICE | 0.209 | 0.996 | 0.15 | 0.055 | 3.994 | 0.779 | |
missForest | 0.194 | 0.97 | 0.135 | 0.308 | 3.024 | 0.811 | |
na.approx | 0.183 | 0.997 | 0.129 | 0.372 | 1.455 | 0.719 | |
DTWUMI | 0.155 | 0.782 | 0.893 |
|
1.182 | 0.626 |
Average imputation performance indices of various imputation algorithms on simulated dataset (32,000 collected points).
Gap size | Method | Accuracy indices | Shape indices | ||||
---|---|---|---|---|---|---|---|
1-Sim | 1- |
RMSE | FSD | FB | 1-FA2 | ||
1% | FSMUMI |
|
|
|
0.159 | 2.51 | 0.574 |
Amelia | 0.157 | 1 | 2.206 | 0.232 | 3.619 | 0.794 | |
FcM | 0.118 | 0.998 | 1.483 | 1.98 | 2.015 | 0.998 | |
MI | 0.16 | 0.999 | 2.241 | 0.2 |
|
0.799 | |
MICE | 0.159 | 0.998 | 2.201 | 0.214 | 1.449 | 0.801 | |
missForest | 0.127 | 0.998 | 1.608 | 0.836 | 12.034 | 0.861 | |
na.approx | 0.146 | 0.992 | 1.901 | 0.393 | 18.997 | 0.777 | |
DTWUMI | 0.09 | 0.552 | 1.156 |
|
6.022 |
|
|
|
|||||||
2% | FSMUMI |
|
|
|
0.194 | 1.971 | 0.611 |
Amelia | 0.12 | 0.998 | 2.312 | 0.107 | 2.191 | 0.794 | |
FcM | 0.093 | 0.999 | 1.672 | 1.985 |
|
0.998 | |
MI | 0.12 | 1 | 2.307 | 0.123 | 3.949 | 0.789 | |
MICE | 0.119 | 0.999 | 2.282 | 0.114 | 8.881 | 0.789 | |
missForest | 0.096 | 1 | 1.769 | 0.941 | 2.777 | 0.858 | |
na.approx | 0.118 | 1 | 2.261 | 0.721 | 2.059 | 0.786 | |
DTWUMI | 0.074 | 0.523 | 1.545 |
|
3.686 |
|
|
|
|||||||
3% | FSMUMI |
|
|
|
0.076 | 10.649 | 0.582 |
Amelia | 0.13 | 0.999 | 2.212 | 0.062 | 3.779 | 0.794 | |
FcM | 0.098 | 0.999 | 1.526 | 1.984 | 2.22 | 0.997 | |
MI | 0.13 | 0.999 | 2.197 | 0.078 | 9.374 | 0.795 | |
MICE | 0.129 | 1 | 2.19 | 0.067 |
|
0.792 | |
missForest | 0.102 | 0.999 | 1.626 | 0.855 | 2.407 | 0.851 | |
na.approx | 0.116 | 0.997 | 1.938 | 0.518 | 1.974 | 0.818 | |
DTWUMI | 0.073 | 0.526 | 1.189 |
|
8.725 |
|
|
|
|||||||
4% | FSMUMI |
|
|
|
0.061 |
|
0.568 |
Amelia | 0.122 | 1 | 2.305 | 0.032 | 2.446 | 0.764 | |
FcM | 0.096 | 1 | 1.607 | 1.982 | 2.325 | 0.997 | |
MI | 0.125 | 1 | 2.261 | 0.043 | 2.391 | 0.792 | |
MICE | 0.124 | 0.999 | 2.233 | 0.045 | 42.495 | 0.791 | |
missForest | 0.101 | 1 | 1.726 | 0.876 | 2.901 | 0.854 | |
na.approx | 0.109 | 1 | 1.99 | 0.475 | 1.94 | 0.811 | |
DTWUMI | 0.066 | 0.465 | 1.172 |
|
2.079 |
|
|
|
|||||||
5% | FSMUMI |
|
|
|
0.062 | 4.508 | 0.577 |
Amelia | 0.122 | 1 | 2.273 | 0.028 | 4.109 | 0.798 | |
FcM | 0.092 | 1 | 1.619 | 1.984 | 2.192 | 0.998 | |
MI | 0.123 | 1 | 2.287 | 0.024 | 5.582 | 0.797 | |
MICE | 0.121 | 1 | 2.267 | 0.044 | 2.326 | 0.792 | |
missForest | 0.097 | 0.999 | 1.731 | 0.923 | 2.473 | 0.859 | |
na.approx | 0.114 | 1 | 1.988 | 0.567 | 2.247 | 0.809 | |
DTWUMI | 0.063 | 0.454 | 1.166 |
|
|
|
|
|
|||||||
7.5% | FSMUMI |
|
|
|
0.049 | 4.843 |
|
Amelia | 0.117 | 1 | 2.232 | 0.034 | 3.306 | 0.792 | |
FcM | 0.09 | 1 | 1.605 | 1.981 | 3.562 | 0.998 | |
MI | 0.119 | 0.999 | 2.259 | 0.025 | 1.946 | 0.793 | |
MICE | 0.118 | 1 | 2.238 | 0.032 | 9.359 | 0.794 | |
missForest | 0.094 | 0.999 | 1.695 | 0.907 |
|
0.858 | |
na.approx | 0.108 | 1 | 1.958 | 0.461 | 3.089 | 0.816 | |
DTWUMI | 0.065 | 0.477 | 1.19 |
|
3.851 | 0.566 | |
|
|||||||
10% | FSMUMI |
|
|
|
0.051 | 5.558 |
|
Amelia | 0.117 | 1 | 2.269 | 0.021 | 3.074 | 0.793 | |
FcM | 0.089 | 1 | 1.607 | 1.981 | 2.683 | 0.997 | |
MI | 0.118 | 0.9996 | 2.233 | 0.02 | 2.05 | 0.793 | |
MICE | 0.118 | 0.9998 | 2.254 | 0.018 | 3.424 | 0.793 | |
missForest | 0.094 | 0.9999 | 1.702 | 0.909 |
|
0.857 | |
na.approx | 0.11 | 1 | 1.958 | 0.541 | 2.006 | 0.798 | |
DTWUMI | 0.067 | 0.5371 | 1.293 |
|
3.093 | 0.577 |
Average imputation performance indices of various imputation algorithms on MAREL-Carnot dataset (35,334 collected points).
Gap size | Method | Accuracy indices | Shape indices | ||||
---|---|---|---|---|---|---|---|
1-Sim | 1- |
RMSE | FSD | FB | 1-FA2 | ||
1% | FSMUMI |
|
|
|
|
0.081 | 0.191 |
Amelia | 0.187 | 0.544 | 5.132 | 0.378 | 0.354 | 0.482 | |
FcM | 0.156 | 0.342 | 4.037 | 0.4 | 0.347 | 0.338 | |
MI | 0.192 | 0.561 | 5.282 | 0.396 | 0.365 | 0.497 | |
MICE | 0.166 | 0.608 | 5.596 | 0.423 | 0.35 | 0.436 | |
missForest | 0.165 | 0.472 | 4.422 | 0.385 | 0.355 | 0.381 | |
na.approx | 0.061 | 0.171 | 1.748 | 0.067 |
|
|
|
DTWUMI | 0.084 | 0.181 | 2.466 | 0.214 | 0.149 | 0.198 | |
|
|||||||
2% | FSMUMI | 0.045 | 0.037 | 1.446 | 0.053 | 0.083 | 0.182 |
Amelia | 0.146 | 0.369 | 4.743 | 0.211 | 0.222 | 0.429 | |
FcM | 0.116 | 0.06 | 3.418 | 0.415 | 0.237 | 0.231 | |
MI | 0.146 | 0.364 | 4.72 | 0.218 | 0.228 | 0.435 | |
MICE | 0.129 | 0.369 | 4.711 | 0.197 | 0.21 | 0.413 | |
missForest | 0.116 | 0.155 | 3.575 | 0.33 | 0.193 | 0.258 | |
na.approx | 0.06 | 0.07 | 2.012 | 0.045 | 0.094 | 0.214 | |
DTWUMI |
|
|
|
|
|
|
|
|
|||||||
3% | FSMUMI |
|
|
|
0.134 | 0.08 |
|
Amelia | 0.176 | 0.503 | 4.694 | 0.426 | 0.224 | 0.478 | |
FcM | 0.139 | 0.251 | 3.35 | 0.441 | 0.237 | 0.314 | |
MI | 0.17 | 0.531 | 4.474 | 0.354 | 0.221 | 0.476 | |
MICE | 0.157 | 0.552 | 4.905 | 0.34 | 0.184 | 0.429 | |
missForest | 0.139 | 0.345 | 3.556 | 0.422 | 0.184 | 0.346 | |
na.approx | 0.068 | 0.224 | 1.79 |
|
|
0.169 | |
DTWUMI | 0.096 | 0.216 | 2.587 | 0.329 | 0.136 | 0.223 | |
|
|||||||
4% | FSMUMI |
|
|
|
0.094 |
|
0.183 |
Amelia | 0.171 | 0.44 | 4.389 | 0.287 | 0.2 | 0.456 | |
FcM | 0.126 | 0.152 | 2.779 | 0.285 | 0.203 | 0.727 | |
MI | 0.166 | 0.41 | 4.234 | 0.277 | 0.204 | 0.444 | |
MICE | 0.15 | 0.379 | 4.15 | 0.268 | 0.19 | 0.411 | |
missForest | 0.129 | 0.234 | 3.134 | 0.23 | 0.187 | 0.303 | |
na.approx | 0.077 | 0.13 | 2.006 |
|
0.135 | 0.268 | |
DTWUMI | 0.07 | 0.105 | 1.77 | 0.15 | 0.12 |
|
|
|
|||||||
5% | FSMUMI |
|
0.22 |
|
0.227 | 0.152 |
|
Amelia | 0.151 | 0.551 | 4.924 | 0.303 | 0.189 | 0.461 | |
FcM | 0.113 | 0.337 | 3.606 | 0.301 | 0.199 | 0.254 | |
MI | 0.143 | 0.567 | 4.612 | 0.249 | 0.123 | 0.448 | |
MICE | 0.131 | 0.523 | 4.75 | 0.274 | 0.188 | 0.419 | |
missForest | 0.104 | 0.371 | 3.443 | 0.229 | 0.147 | 0.274 | |
na.approx | 0.065 |
|
2.071 |
|
|
0.233 | |
DTWUMI | 0.067 | 0.275 | 2.363 | 0.22 | 0.157 | 0.242 | |
|
|||||||
7.5% | FSMUMI |
|
|
|
0.075 |
|
|
Amelia | 0.14 | 0.42 | 4.546 | 0.191 | 0.197 | 0.437 | |
FcM | 0.104 | 0.123 | 3.12 | 0.328 | 0.198 | 0.23 | |
MI | 0.142 | 0.427 | 4.624 | 0.222 | 0.222 | 0.443 | |
MICE | 0.126 | 0.38 | 4.375 | 0.206 | 0.208 | 0.437 | |
missForest | 0.112 | 0.202 | 3.587 | 0.329 | 0.228 | 0.288 | |
na.approx | 0.073 | 0.081 | 2.043 | 0.092 | 0.107 | 0.243 | |
DTWUMI | 0.06 | 0.102 | 1.999 |
|
0.074 | 0.215 | |
|
|||||||
10% | FSMUMI |
|
|
|
|
|
|
Amelia | 0.14 | 0.3 | 4.294 | 0.24 | 0.142 | 0.442 | |
FcM | 0.1 | 0.098 | 3.68 | 0.136 | 0.101 | 0.303 | |
MI | 0.14 | 0.112 | 4.294 | 0.24 | 0.142 | 0.442 | |
MICE | 0.12 | 0.42 | 4.066 | 0.152 | 0.077 | 0.383 | |
missForest | 0.097 | 0.461 | 3.049 | 0.104 | 0.117 | 0.255 | |
na.approx | 0.071 | 0.529 | 1.873 | 0.098 | 0.094 | 0.253 | |
DTWUMI | 0.081 | 0.381 | 3.293 | 0.119 | 0.124 | 0.224 |
Among the considered methods, the FcM-based approach is less accurate at lower missing rates but it provides better results at larger missing ratios as regards the accuracy indices.
Different from the synthetic dataset, on the simulated dataset, the FcM-based method is always ranked the third at all missing rates for similarity and RMSE indicators. Following FcM is missForest algorithm for the both indices.
Although, in the second experiment, data are built by various functions but they are quite complex so that na.approx does not provide good results.
In contrast to the two datasets above, on the MAREL-Carnot data, na.approx indicates quite good results: the permanent second or third rank for the accuracy indices (the
Other approaches (including FcM-based imputation, MI, MICE, Amelia, and missForest) exploit the relations between attributes to estimate missing values. However, three considered datasets have low correlations between variables (roundly 0.2 for MAREL-Carnot data,
DTWUMI approach was proposed to fill large missing values in low/uncorrelated multivariate time series. However, this method is not as powerful as the FSMUMI method. DTWUMI only produces the best results at 2% missing level on the MAREL-Carnot dataset and is always at the second or the third rank at all the remaining missing rates on the MAREL-Carnot and the simulated datasets. That is because the DTWUMI method only finds the most similar window to a query either before a gap or after this gap, and it uses only one similarity measure, the DTW cost, to retrieve the most similar window. In addition, another reason may be that DTWUMI has directly used data from the window following or preceding the most similar window to completing the gap.
In this paper, we also compare the visualization performance of completion values yielded by various algorithms. Figures
Visual comparison of completion data of different imputation approaches with real data on the
Visual comparison of completion data of different imputation approaches with real data on the
Visual comparison of completion data of different imputation approaches with real data on the
At a 1% missing rate, the shape of imputation values produced by na.approx method is closer to the one of true values than the form of completion values given by our approach. However, at a 5% level of missing data, this method no longer shows the performance (Figure
Looking at Figure
Besides, we perform a comparison of the computational time of each method on the synthetic series (in second - s). Table
Computational time of different methods on the synthetic series in second (s).
Method | Gaps size | ||||||
---|---|---|---|---|---|---|---|
1% | 2% | 3% | 4% | 5% | 7.5% | 10% | |
FSMUMI | 353.9 | 427.5 | 701.9 | 1037.8 | 1423.6 | 2525.5 | 3556.8 |
Amelia | 3.2 | 3.4 | 5.2 | 3.2 | 3.2 | 3.2 | 3.2 |
FcM | 40.9 | 39.8 | 40.0 | 41.1 | 41.2 | 46.7 | 45.6 |
MI | 844.1 | 714.0 | 739.1 | 723.3 | 724.5 | 719.7 | 726.5 |
MICE | 7021.1 | 9187.7 | 21909.6 | 13041.9 | 14833.9 | 19417.7 | 23812.6 |
missForest | 26833.8 | 24143.8 | 22969.9 | 32056.6 | 36485.8 | 42424.1 | 28521.1 |
na.approx |
|
|
|
|
|
|
|
DTWUMI | 5002.67 | 15714.8 | 37645.82 | 64669.71 | 86435.38 | 180887.78 | 273879 |
This paper proposes a novel approach for uncorrelated multivariate time series imputation using a fuzzy logic-based similarity measure, namely FSMUMI. This method makes it possible to manage uncertainty with the comprehensibility of linguistic variables. FSMUMI has been tested on different datasets and compared with published algorithms (Amelia II, FcM, MI, MICE, missForest, na.approx, and DTWUMI) on accuracy and shape criteria. The visual ability of these approaches is also investigated. The experimental results definitely highlight that the proposed approach yielded improved performance in accuracy over previous methods in the case of multivariate time series having large gaps and low or non-correlation between variables. However, it is necessary to make an assumption of recurrent data and sufficiently large dataset to apply the algorithm. This means that our approach needs patterns (in our case the two queries (before and after the considered gap)) existing somewhere in the database. This enables us to predict missing values if the patterns occur in the past or in the following data from the considered position. Thus a satisfactory and sufficient dataset (large dataset) is required.
In future work, we plan to (i) combine FSMUMI method with other algorithms such as Random Forest or Deep learning in order to efficiently fill incomplete values in any type of multivariate time series; (ii) investigate this approach applied to short-term/long-term forecasts in multivariate time series. We could also investigate complex fuzzy sets ([
The data used to support this study are available from the corresponding author upon request.
The authors declare that they have no conflicts of interest.
This work was kindly supported by the Ministry of Education and Training Vietnam International Education Development, the French government, and FEDER, the region Hauts-de-France (CPER 2014-2020 MARCO). The experiments were carried out using the CALCULCO computing platform, supported by SCoSI/ULCO (Univ. Littoral).