Missing Value Imputation Based on Gaussian Mixture Model for the Internet of Things

This paper addresses missing value imputation for the Internet of Things (IoT). Nowadays, the IoT has been used widely and commonly by a variety of domains, such as transportation and logistics domain and healthcare domain. However, missing values are very common in the IoT for a variety of reasons, which results in the fact that the experimental data are incomplete. As a result of this, some work, which is related to the data of the IoT, can’t be carried out normally. And it leads to the reduction in the accuracy and reliability of the data analysis results. This paper, for the characteristics of the data itself and the features of missing data in IoT, divides the missing data into three types and defines three corresponding missing value imputation problems. Then, we propose three new models to solve the corresponding problems, and they are model of missing value imputation based on context and linear mean (MCL), model of missing value imputation based on binary search (MBS), and model of missing value imputation based on Gaussian mixture model (MGI). Experimental results showed that the three models can improve the accuracy, reliability, and stability of missing value imputation greatly and effectively.


Introduction
Recently, with the rapid development of key technology of the Internet of Things, most platforms and systems, based on the Internet of Things, are widely applied to various fields and industries, such as intelligent transportation, smart buildings, healthcare field, positioning and navigation field, and logistics field [1][2][3].Meanwhile, it generates a lot of urgent problems to be solved, and one of these problems is missing value imputation for the Internet of Things [4].Most learning algorithms, which are used to analyze data collected in the Internet of Things, generally assume that data is complete so each attribute of all instances is filled with a valid value.However, missing values are very common in the IoT for a variety of reasons, which results in people not being able to use the experimental data normally.For this reason, the accuracy and reliability of the experimental results will be greatly reduced.Therefore, it is very necessary for the IoT to estimate missing values.Missing value imputation exists in various domains, such as computer and network domain [5], economy domain [6], medical domain [7], and psychological domain [8].This paper addresses missing value imputation for the Internet of Things.In order to solve the problem, firstly, it is necessary to find the reasons for missing data [9].There are many reasons for this phenomenon, such as unstable network communication, synchronization issues, unreliable sensors, and other equipment failure.Secondly, we need to study the mechanism of missing data [10].Missing data mechanism can be divided into three categories: missing completely at random (MCAR), missing at random (MAR), not missing at random (NMAR).Thirdly, we have to study the pattern of missing data.Currently, there are two kinds of missing data patterns, and they are monotone missing pattern (MMP) and arbitrary missing pattern (AMP).Finally, we will have to establish missing value imputation model for the IoT and then give the estimated values of the missing data with the model.Missing values imputation algorithms, which are used commonly, are single imputation algorithm, multivariate imputation algorithm, and MCMC algorithm, and so forth.
Data collected in the IoT has many features, such as temporal and spatial correlation, correlation between individual properties, and correlation between individual properties.However, most studies of missing value imputation did not take into account the characteristics of the IoT data itself.And missing value imputation in the IoT field is rarely seen.This paper, for the characteristics of the data itself and the features of missing data in IoT, divides the missing data into three types and defines three corresponding missing value imputation problems, and then we propose three new models and solutions to solve the corresponding problems.Experimental results showed that the three models can improve the accuracy, reliability, and stability of missing value imputation greatly and effectively.
The rest of the paper is organized as follows: we first review the existing literature in Section 2. In Section 3, we illustrate Gaussian mixture model based on expectationmaximization algorithm.In Section 4, we define the problems of missing value imputation for the IoT.In Section 5, we establish three models to solve the missing value imputation problems.We present our experiments in Section 6.Finally, we conclude our study in Section 7.

Related Work
Inspired by ROUSTIDA, which is an incomplete data analysis method without statistical analysis and probability analysis, literature [11] proposes a new method for missing data imputation based on incomplete data clustering (MIBOI).This method requires a prespecified key parameter, which is the upper limit of constraint tolerance set dissimilarity, but it is very difficult in practical application.
Literature [12] proposes three data imputation algorithms based on the probabilistic path-event model to solve the problem of missing data imputation in RFID field.However, the matching algorithm, which is used by the data imputation algorithm, is inefficient.
Researchers propose an imputation technique for missing data based on spatial-temporal and association rule mining (STARM) to solve the context data missing problem in [13].This technique can lead to data overflow, but this paper did not deal with the issue of data overflow.
To solve the problem of coverage holes in wireless sensor network (WSN), [14] proposes a new moving-neighborhood interpolation algorithm based on Delaunay triangulation technique.But this algorithm did not take into account the characteristics of the data itself.
Literature [15] proposes a multiple regression modelbased missing values imputation algorithm to estimate the missing data as accurately as possible in wireless sensor networks.The estimation performance of the algorithm is more stable and reliable, but this algorithm has many default preconditions.
Literature [16] proposes a new single imputation algorithm based on locally linear reconstruction (LLR) to solve the problem of missing data imputation.This method can improve the prediction performance of missing values imputation, but it has to be used in a smaller range.
In [17], researchers propose a novel nearest neighbor (NN) imputation algorithm to estimate missing values in wireless sensor network by learning spatial-temporal correlation between wireless sensor nodes.This method has no assumptions for the WSN applications, but researchers do not extend the approach to other fields, such as the Internet of Things.And this algorithm lacks analysis of correlation between the properties.

Background
3.1.Gaussian Mixture Model.Gaussian mixture model (GMM) is commonly used for clustering.Each GMM consists of several Gaussian distributions, and each of them is called a component, which represents a different cluster.All the components are linearly added together to form a probability density function (PDF) of GMM.
There is a set of observed data, which is  = { 1 , . ..,   , . . .,   }.Each vector   is a -dimensional vector.We assume that  is generated by a GMM with  components.The function   (  ) represents the probability density function of the  component, expressing the probability of   generated by the  component.So, the PDF of the GMM is as follows: In the above formula,   represents the weight of the  component in the GMM;   and   represent the mean vector and covariance matrix of the  component; (  ) represents the probability of   generated by the GMM.Moreover,   satisfies the following conditions: The PDF of the  component   (  |   ,   ) is expressed as follows: (3)

Expectation-Maximization Algorithm. Expectation-maximization algorithm (EM algorithm), formally proposed by
Dempster, Laird, and Rubin in 1977, is an effective method to deal with incomplete data.EM algorithm is also a maximum likelihood estimation algorithm to solve model parameters from incomplete data.EM algorithm, with good local convergence, is simple and stable.
The basic principle of EM algorithm can be expressed as follows:  is the data that we can observe, but it is incomplete.
The complete data is  = (, ), and  is the missing data.If  is used to represent model parameter, ( | ) is the posterior distribution of  with .However, ( | ) is too complex to perform various statistical calculations.Assuming that the missing data  is known, we can get a posterior distribution of , ( | ), which is relatively simple so that we can perform a variety of statistical calculations.Looking back, the assumption about the value of  can be checked and improved.Following this cycle, we will convert a complex maximization problem to a series of simple maximization problems.
There is a set of observed data  = { 1 , . . .,   , . . .,   }, and data is independent from each other.The implicit category of the data, , is introduced, and  = { 1 , . . .,   },   ∈ {1, 2, . . ., }. is a finite integer that is needed to be specified in advance. and  are combined together to form a complete data set,  = (, ) = {( 1 ,  1 ), . . ., (  ,   )}.The likelihood function of the complete data is as follows: The expected value of the likelihood function is as follows: Each round iteration of EM algorithm consists of two steps: E-step and M-step.Each operation will produce a new value of  parameter, but we have to give its initial value.Assuming that the initial value of  parameter is  0 , the two steps are as follows.
In the above function, (,  (−1) ) is the function of  parameter;  (−1) is the value of  from the last iteration.The value of (,  (−1) ) is the expected value of log ( | , ).
M-Step.Calculate the value of  * , and ( * ,  (−1) ) is the great value: It can be seen that the distribution of random vector  is determined by  and  (−1) .If  *  represents the maximum likelihood function value of the  iteration and  * −1 represents the maximum likelihood function value of the  − 1 iteration, it can be proved that EM algorithm can guarantee that  *  ≥  * −1 , and EM algorithm is convergent.

Gaussian
Assuming that there are  clusters and the weight of the  component in the GMM is   ,   is the corresponding parameter.The density of   based on   is as follows: The logarithmic likelihood function of the complete data is Here, it is difficult to calculate the maximum value with the method of solving equations directly, so we will use EM algorithm to solve problem.In E-step of EM algorithm, the conditional expectation of the logarithmic likelihood function of complete data can be calculated with observable variables  and the estimated value of current parameter; in M-step, the weight, mean, and covariance matrix, which meet the log-likelihood function value which is the greatest, will be calculated with the results from E-step.
The parameter estimation of GMM based on EM algorithm is as follows.
Initialize weight, mean, and covariance matrix. Repeat.

E-
Step.   is calculated:

M-
Step.Calculate the estimated value of the maximum likelihood parameter with   from E-step: until reaching the satisfactory convergence criteria.

Problem Definition
Nowadays, the Internet of Things is widely used in a variety of fields, such as intelligent transportation, smart buildings, healthcare field, positioning and navigation field, and logistics field.The IoT is usually divided into three layers, from bottom to top, which are perception layer, transport layer, and application layer.Among them, the perception layer is responsible for data collection.There are wireless sensor nodes, weird sensor nodes, and other hardware at the perception layer, which are integrated with the sensor module so that they can collect environmental data.Among these data collecting equipment, some nodes only integrate a kind of sensor module, such as temperature sensor module, and this type of sensor nodes can only collect a kind of environmental data, such as temperature.However, some nodes integrate multiple sensor modules, such as temperature module, humidity module, light module, and acceleration module.These sensor nodes can acquire diverse environmental data, such as temperature, humidity, light, and acceleration.Data collected in the IoT are usually time-series data.A data collected at a time is called an instance, which generally contains multiple attributes: time, node number, sequence number, temperature, humidity, light, voltage, and acceleration.This phenomenon of data missing is very common in the IoT for various reasons, such as unstable network communication, synchronization issues, unreliable sensors, and other types of equipment failure.It results in people not being able to use the experimental data normally.As a result, the accuracy and reliability of the experimental results will be greatly reduced.So, we will have to estimate missing values.First of all, some definitions will be given as follows.In the next section,  is used to represent a data set;  represents an instance, while  represents an attribute.The relationship between these variables is as follows: The basic attributes uniquely identify an instance.And we determine whether the data is missing through its basic attributes.We determine the basic attributes of an instance with the following methods: time can be obtained with historical time and data collecting frequency; node number and sequence number can be acquired with historical sequence number of nodes.Therefore, this paper addresses missing value imputation about observation attributes.In the next section,   represents observation attributes; x represents the estimated value of missing data.

Model of Data Missing.
In Section 4, the problems of missing value imputation for the IoT are divided into three categories.Next, the three corresponding models of data missing will be given.
The model of data missing, corresponding to the first class of problem of missing data imputation, is shown below: In the first class of data missing model, each instance has only one observation attribute.The values of observation attribute are missing at random, but the situation of data missing continuously does not exist.
In the second class of data missing model, each instance has only one observation attribute.The values of observation attribute are missing at random.If data miss, data will miss continuously in a certain period of time.
The model of data missing, corresponding to the third class of problem of missing data imputation, is shown below: In the third class of data missing model, each instance has two or more observation attributes.The values of an instance may all be missing or partially missing; the values of an attribute may be missing continuously or intermittently.

Missing Value Imputation Model Based on GMM for the
IoT. Data collected in the IoT are usually time-series data, which are generated by kinds of nodes at different locations and different regions.These data has strong spatial-temporal correlation, attribute correlation, instance correlation, and nodes correlation, though the correlation is strong or weak.
According to the strong spatial-temporal correlation and other types of potential relevance, we build three models of missing value imputation, for three models of data missing, to solve the above three types of problems.
The model of missing value imputation based on context and linear mean (MCL) is as follows.
Step 3. The incomplete data   are classified according to the result of clustering.Then, determine the cluster of each instance from   .
Classification rules are as follows.∀  ∈   ,   belongs to a cluster, whose cluster center is the closest to   by the Euclidean distance.
Step 4. ∀  ∈   , find one or more complete instances, which is the closest to   , in the cluster of   by the Euclidean distance.
Step 5. ∀  ∈   , determine the value of   with the mean value of complete instances from Step 4.
The flow diagram of the MGI model is shown in Figure 1.

Experiment
6.1.Experimental Data and Settings.The programming software and data statistical analysis tool of this paper are Matlab (Release 2012a) and SPSS 20.Experimental data are collected from 54 sensors deployed in the Intel Berkeley Research lab between February 28 and April 5, 2004, [18].The attributes of the data from the IoT includes time, node number, sequence number, temperature, humidity, light, and voltage.The sampling frequency of the sensor nodes is 2 times per minute.This paper, for three different data missing models, set three corresponding experiments.

Experiment of MCL Model.
This experimental sample data includes 220 instances, which only contain temperature attribute.In the 220 instances, the temperature values of 20 instances miss at random, and this data missing situation is consistent with the first type of data missing model.This experiment adopts the MCL model to solve the first type of missing value imputation problems.
The results of this experiment are shown in Figures 2 and  3.

Experiment of MBS Model.
In this experiment, sample data contain 220 instances, only including temperature attribute.In the 220 instances, the temperature values of 21 instances miss at random, and this data missing situation is consistent with the second type of data missing model.This experiment adopts the MBS model to solve the second type of missing value imputation problems.
The experimental results are shown in Figures 4 and 5.

Experiment of MGI Model.
This experimental sample data of MGI includes 220 instances, which contain temperature attribute and humidity attribute.In the 220 instances, the values of 20 instances miss at random, and this data missing Cluster the complete data with the GMM model based on EM algorithm, and find cluster center of each cluster.Then, determine the cluster of each instance.
Data set D is divided into two data sets: D y and D n .D y is a complete data set, which contains all instances without missing values; but, D n is an incomplete dataset, which contains all instances with missing values.
The incomplete data D n are classified according to the result of clustering.Then, determine the cluster of each instance from D n .
For all X i ∈ D n , find one or more complete instances , which is the closest to X i , in the cluster of X i by the Euclidean distance.
For all X i ∈ D n , determine the value of X i with the mean value of complete instances from Step 4.

Analysis and Evaluation of Experimental Results.
Through the above experiments for the tree types of data missing, it can be seen that the three models of missing values imputation can estimate the missing values effectively.According to the experimental results and error analysis, we can see that MCL model, MBS model, and MGI model are stable and reliable.And the estimate values from the three models have high accuracy and small error.

Conclusions
This paper, for the characteristics of the data itself and the features of missing data in the Internet of Things, divides the missing value into three types and defines three corresponding missing value imputation problems.Then, we propose three new models, which are MCL model, MBS model, and MGI model, to solve the corresponding problems.Experimental results showed that the three models can improve the accuracy, reliability, and stability of missing value imputation greatly and effectively.

Definition 1 .Definition 2 .Definition 3 .
Data collected at a time in the Internet of Things is called an instance.Data set consists of multiple instances.An instance consists of basic attributes and observation attributes.Time, node number, and sequence number belong to basic attributes; temperature, humidity, light, voltage, and acceleration belong to observation attributes.

Definition 4 .
∀  ∈ , ∀  ∈   , if   is missing, the problem of solving the estimated value of   and making |x  −  | the minimum value is called missing data imputation problem.Definition 5. ∀  ∈ , ∀  ∈   ,   has only one observation attribute.If   is missing and ∃( (−1) ,  (+1) ),  (−1) and  (+1) are not missing, the problem of solving the estimated value of   and making |x  −  | the minimum value is called the first class of problem of missing data imputation.Definition 6. ∀  ∈ , ∀  ∈   ,   has only one observation attribute.If   is missing and ∃ (−1) ,  (−1) is missing (or ∃ (+1) ,  (+1) is missing), the problem of solving the estimated value of   and making |x  −   | the minimum value is called the second class of problem of missing data imputation.Definition 7. ∀  ∈ , ∀  ∈   ,   has two or more observation attributes.If   is missing, the problem of solving the estimated value of   and making |x  −   | the minimum value is called the third class of problem of missing data imputation.

Figure 1 :Figure 2 :
Figure 1: Flow diagram of the MGI model.

Figure 3 :
Figure 3: Error analysis of MCL model.