A Self-Learning Sensor Fault Detection Framework for Industry Monitoring IoT

Many applications based on Internet of Things (IoT) technology have recently founded in industry monitoring area. Thousands of sensors with different types work together in an industry monitoring system. Sensors at different locations can generate streaming data, which can be analyzed in the data center. In this paper, we propose a framework for online sensor fault detection.Wemotivate our technique in the context of the problem of the data value fault detection and event detection. We use the Statistics Sliding Windows (SSW) to contain the recent sensor data and regress each window by Gaussian distribution. The regression result can be used to detect the data value fault. Devices on a production line may work in different workloads and the associate sensors will have different status. We divide the sensors into several status groups according to different part of production flow chat. In this way, the status of a sensor is associated with others in the same group. We fit the values in the Status Transform Window (STW) to get the slope and generate a group trend vector. By comparing the current trend vector with history ones, we can detect a rational or irrational event. In order to determine parameters for each status group we build a self-learning worker thread in our framework which can edit the corresponding parameter according to the user feedback. Group-based fault detection (GbFD) algorithm is proposed in this paper. We test the framework with a simulation dataset extracted from real data of an oil field. Test result shows that GbFD detects 95% sensor fault successfully.


Introduction
Internet of Things (IoT) has been paid more and more attention by the government, academe, and industry all over the world because of its great prospect [1][2][3].In the IoT application field, intelligent industry is an important branch.A wired or wireless sensor network is the basic facility of the industry monitoring IoT.These networks comprising of thousands of inexpensive sensors can report their values to the data center.The aim of the monitoring system is to guarantee the process of production.
Fault detection is an important process for industry monitoring IoT, but it is a difficult and complex task because there are many factors that influence data and could cause faults.And faults are application and sensor type dependent [4][5][6].Sensors in an industry monitoring IoT have three features: (1) Big: thousands of sensors on different devices are working together, (2) Multitypes: many physical quantities are needed to determine the production status, (3) Uncertainty: different workload is needed according to the production plan and some devices need shut down for examination.So the values of correlative sensors will change between different levels.
From the data-centric view, we focus on the Outliers, Stuck-at faults and Spikes [7].From the application and system view, we focus on the rational and irrational trend detections.A rational trend means the sensor value transforms from one level to another smoothly and it is caused by a rational reason, such as shut down a device.An irrational trend means value changed when something is wrong with a device.The typical two mistakes of the monitoring system are taking a rational trend for an Outlier, or ignoring an irrational trend while values are still in range.In this paper, we propose a self-learning sensor fault detection framework for industry monitoring IoT.The data model design is described in Section 3. In Section 4, the framework and the core algorithm are discussed.A simulation experiment base on real data is shown in Section 5.

Related Work
Many researchers pay their attention to building a smart monitoring system.Bressan et al. [8] created a solid routing infrastructure through RPL.Castellani et al. [9] concentrate on the actual implementation of the communication technology and presented a lightweight implementation of an EXI library.Yuan et al. [10] present a parallel distributed structural health monitoring technology based on the wireless sensor network.An IoT communication framework for distributed worldwide health care applications is maintained in [11].All these works are focused on the basic frameworks, protocols, and communication technologies of monitoring systems but discussed less on sensors management.
For modeling sensor network data, Guestrin et al. [12] propose a framework, for the nodes in the network to collaborate in order to fit a global function to each of their local measurements.This is a parametric approximation technique and has more parameters then our approach.References [13,14] study the problem of computing order statistics in a sensor network.There has also been work on predicting and caching the values generated by the sensors [15,16], which can result in significant communication savings.But all these approaches are not fit our setting.
A similar approach for sensor fault detection in streaming data is described by Yamanishi et al. [17].In contrast to our work, their method does not operate on sliding windows but rather on the entire history of the data values.Chan et al. [18] extend the study of algorithms for monitoring distributed data streams from whole data streams to a time-based sliding window, but their focus is on presenting a communicationefficient algorithm.
Ding et al. [19] combined trajectories of all nodes and the paramealgorithm which requires low computational overhead.The proposed algorithm compared its sensor reading with the median value of its neighbors' readings.Gao et al. [20] approach WSN fault detection problems using spatial correlation with the assumption of similar reading within cross range of neighbor nodes.Krishnamachari and Iyengar [21] tried to solve the faulty node detection problem by using localized event region and they assume that the system knows the location of sensor.In an industry monitoring sensor network, finding out a neighbor automatically is very hard.In our approach, we separate sensors into groups according to the production flow charts.

Data Model Design
This section focuses on the sensor data model design.We model sensors from the view of value for Outliers, Stuck-at faults, and Spikes detection and from the application view for event detection.The events we are interested in are rational trend and irrational trend.

Sensor Value.
For detection the Outliers, Stuck-at faults, and Spikes, we propose a statistics method.Figure 1 shows the mechanism of Statistics Sliding Windows (SSW).For sensor , the current value V  and the previous  values form the current windows   = {V − , V −+1 , V −+2 , . . ., V  }.In the recent history, we can define  windows with the same length and get the set  V = { 1 ,  2 ,  3 , . . .,   }.We estimate the values in each sliding window by Gaussian distribution,   → (  ,  2  ) (Formula (1)).If  2  = 0, a Stuck-at fault is detected.And when  2  is big enough, a Spikes fault may happened: There is a buffer named   in front of the current window   .With the new samples coming into   , the obsolete samples deserted by   will become a member of   .When the size of   reach , the oldest window in  V will be discarded and the current   will join  V as  1 .With SSW, large numbers of historical sensor values are regressed to  pairs of Gaussian characteristics.In a real application, we need not hold all the recent values in the memory.

Status Group.
In the industry monitoring IoT, the status of a production line may be uncertain.With the different manufacturing techniques and different workloads, some devices in the production line may be shut down.In this case, the whole production line is still working, but values of sensors monitoring the shutdown devices will run out of range (Outliers).At the same time, related devices may also change with the shutting down operation.In an industry monitoring IoT, the rational status transformation of sensors should be recognized and ignored.
Figure 2 shows a typical flow chat of a gas-processing plant.P-1, P-2, and P-3 are three parallel subpipelines which are controlled by V-1, V-2, and V-3, respectively.Each subpipeline can be shut down independently and this operation will affect the value of P-0-1, a pressure sensor on the main input pipeline.In our approach, we deposit the sensors in the goal-processing plant into several status groups.Although the application background is important to the disposition method, we can follow some common rules as follows: (1) Sensors on the opposite side of a valve are not in the same group.A valve is a typical controller in the industrial IoT.All the devices on the backward position can be shut down by the proper valve.The sensors on the different side may be in different status.
(2) If there are too many sensors which are controlled by one slave, they should be divided into different groups.In our approach, we will not put more than 10 sensors into one status group because the probable status space will expand acutely with the increasing sensor numbers.In this case, sensors can be grouped by their relative position with the most complex device, because the complex device may lead to status change most possibly.(3) The production line is normally divided into many units according to the geographical position.We can ignore the relationship between sensors in different sections.
For the 11 sensors shown in Figure 2, we divide them into four groups which are .  0 represents the status of the main input pipeline. 1 ,  2 , and  3 are status groups which are derived from the three parallel subpipelines respectively.

Status Model.
For a status group  = { 1 ,  2 , . . .,   }, all the sensors in  may have different trends when the production status changed.Figure 3 shows a status transformation process. 1 ,  2 , and  1 are stable in the first 30 and the last 20 seconds.The interval from 30 s to 40 s is called the Status Transform Window (STW). 1 and  2 increase to a new level while  1 keep the same trend.And the slopes of  1 and  2 are different.We use the trend vector which contains all the slopes of one sensors's group to represent the status transformation, that is,  = { 1 ,  2 , . . .,   }.The trend of a sensor can be found by fitting its values by a specific size of STW.Here, we use the least-square method [22] (Formula (3)) to fit the values and record the rational trend vector by the sensor index.We use the key idea of incremental clustering algorithm [23] to handle the trend vectors, get the cosine angle between the current trend and existing vectors, respectively.If the angle is big enough then mark it a new trend.Otherwise, a repeated trend is found and we only need to merge it with the closest vector.In Section 4, the specific clustering method will be given for details: (3)

Design of Architecture
In this section we propose the architecture for sensor fault detection in industry monitoring at first.Then, how to realize our algorithm is discussed.4.There are four modules in our approach.
(1) Application DB: all the parameters are stored in the Application DB, including the threshold   for sensor  and the history statistics result   ,  2  .The grouping information is also serialized in this database.This DB is the interface for high layer application which can get the sensor fault prediction and input the user feedback.
(2) Detection Thread: it is a background service and contains the main detecting process.Since our approach is an online detection, a series of detection thread will be created and maintained by the working thread pool and a related status queue.When new data is coming, the working thread dispatcher a wakes up a pending thread to handle it.
(3) Self-Learning Thread: the self-learning thread uses the OS timer as a driver.The user feedback about the detection result will be rechecked by this thread to revise the trend vectors.

The GbFD Algorithm.
Dividing the sensors into status group is the key idea in our approach.For group-based sensor fault detection, we propose the (group-based fault detection) GbFD algorithm.The GbFD Algorithm 1 starts by initializing the global parameters (line 6-7); then it instantiates the two core processes SelfLearningThread and DetectionThread.The input  of GbFD is defined as a set that contains one time samples for a status group.
In the DetectionTread, we first check the Stuck-at faults and Spikes by calling the IsStuck and IsSpikes methods.And IsRatStaChange is called two times (line [19][20][21] to detect the sensor status transformation.When an Outlier was detected, GbFD should give the conclusion whether it is a rational operation such as shut down the device, and if no Outlier happened, we also need to check the abnormal status transformation.The first calling of IsRatStaChange is controlled by the IsOutlier method with an AND logic; no more algorithm complexity is increased by the twice calling.The SelfLearningThread works as a background service.Its responsibility is learning the user feedback to find out the missing detection and the false detection and adjusting the relational parameters.For a new rational trend vector, we will add it to the proper group rational trend history by the IncClustering method, which can recluster the grouped trend vectors according to the specified angle threshold. The time complexity of the GbFD algorithm is dominated by IsOutlier and IsRatStuChange methods.For a sensor in group , the detection computation takes ((  + ) ×   ),   is the STW size,  is the length of , and   is the SSW size.In a real application, the two windows sizes are steerable and  is less than 20 for the most part.Moreover, a proper thread dispatch mechanism will guarantee GbFD to handle the real-time detection task.And, the self-learning process is more complex due to many iterative operations, but it is a background service and does not require real-time performance.

Experimental Evaluation
We built an application to evaluate our framework.This application was implemented in JAVA, about 2400 lines code and used Oracle as the Application DB.

Data Preparation.
We use real data from an oil field in China.This oil field has 20 oil/gas treatment plants and all of the production equipments are monitored by sensors which can be mainly classified as temperature sensor, pressure sensor, and liquid level sensor.The sampling rate of the production IoT is 60 seconds.We obtained the sample data of 10,000 sensors between January 1st 8 PM to October 31th 8 AM, 2012.Since the production lines are relatively stable, we filtered the original data by two steps.Firstly, some production units that never make a mistake were eliminated.Secondly, for a plant, we discard some data in its stable period.
Table 1 shows features of our simulation dataset.We choose more than 751 million samples from 5800 sensors.According to the corresponding flow charts, we separate the sensors into 1340 groups.Each group has 4.33 sensors in average, and the maximum group contains 8 sensors.We analyze the user feedback history and find out four typical errors.Outlier means that a value runs out of range and it is caused by the sensor failure.We put Stuck-at faults and Spikes together.The Rational Trend (RT) missed fault means that sensors submit an exceptional value after a rational user operation, such as shuting down a device for examination.And the Irrational Trend (IRT) missed fault means that something wrong happened with the production line and the sensor values changed with it but still suitable with the threshold condition.

Experiment.
We split the simulation data into four datasets according to the time sequence.Each time we use optional three data sets to train the GbFD algorithm and the remainder is used as the testing data.
We use three pairs SSW size and STW size, which are (30, 15), (60, 20), and (90, 25), to run the four-folder crossvalidation test.The result is shown in Figure 5.When   = 30 and   = 15, each cross get a precision of about 80%.This result is not good enough.A satisfactory result is generated in the next test;   = 60 and   = 20 get a mean 95% accuracy.But with the increase of two windows size, the accuracy of GbFD will go to an opposite direction.This phenomenon indicates that the sizes of SSW and STW are strong correlating with our algorithm.Choosing the proper windows size can increase the detection sensitivity and the windows size (60, 20) is suitable with our simulation data.

Conclusions
We present a self-learning sensor fault detection framework in this paper.We propose a model which can represent the sensor value, sensor relationship, and sensor status transformation.GbFD algorithm is proposed to detect the sensor fault.And we use real data from an oil field for validation.Experimental results show that our system can detect 95% of data fault in the simulation data which contains 751.68 million samples from 5800 sensors.
We will continue validate our approach on other dataset to find out the proper statistical sliding window size and status transform windows size in different application contexts.Our goal is to build a sensor health management system for industry IoT that includes not only sensor fault detection, but also sensor lifecycle prediction and sensor inspection management.

Figure 1 :
Figure 1: Estimation of the data distribution in the Statistics Sliding Windows (SSW) for one time instance.

Figure 2 :
Figure 2: Typical flowchat of a gas processing plant.

Figure 3 :
Figure 3: Status transformation process for two time instances.

Figure 4 :
Figure 4: A self-learning sensor fault detection framework.

Figure 5 :
Figure 5: Four-folder cross-validation for precision (a, c, e) and recall (b, d, f) with different   and   .

Table 1 :
Simulation data description.
4.1.A Self-LearningFramework.The self-learning sensor fault detection architecture is shown in Figure