Wireless sensor networks (WSNs) are a rapidly emerging technology with a great potential in many ubiquitous applications. Although these sensors can be inexpensive, they are often relatively unreliable when deployed in harsh environments characterized by a vast amount of noisy and uncertain data, such as urban traffic control, earthquake zones, and battlefields. The data gathered by distributed sensors—which serve as the eyes and ears of the system—are delivered to a decision center or a gateway sensor node that interprets situational information from the data streams. Although many other machine learning techniques have been extensively studied, real-time data mining of high-speed and nonstationary data streams represents one of the most promising WSN solutions. This paper proposes a novel stream mining algorithm with a programmable mechanism for handling missing data. Experimental results from both synthetic and real-life data show that the new model is superior to standard algorithms.
It is anticipated that wireless sensor networks (WSNs) will enable the technology of today to be employed in future applications ranging from tracking, monitoring, and spying systems to various other technologies likely to improve aspects of everyday life. WSNs offer an inexpensive way to collect data over a distributed environment that may be harsh in nature, such as biochemical contamination sites, seismic zones, and terrain subject to extreme weather or battlegrounds. The sensors employed in WSNs—which are miniatures embedded computing devices—continue to produce large volumes of streaming data obtained from their environment until the end of their lifetime. It is known that when the battery power in such sensors is exhausted, the likelihood of erroneous data being generated will grow rapidly [
Data classification is a popular data mining technique used to determine predefined classes (verdicts) to which unseen data freshly obtained from a WSN map, thereby providing situational information about current events in an environment covered by a dense network of sensors. At the core of the classification technique is a decision tree constructed by a learning algorithm that uses tree-like graphs to model the underlying relations of attributes characterized by the output signals of the sensors to predefined classes. Other alternative algorithms include a support-vector machine, neural network, and Bayesian network algorithms, which offer about the same ability to model nonlinear relations between inputs and outputs. However, decision trees have been widely used in WSNs because of their simplicity and the interpretability of their rules, which can easily be derived from the structure of the tree.
The huge volume and imperfect quality of data streams poses two specific issues applicable to data mining applications, especially for decision trees used in WSNs: problems surrounding model induction and predictive accuracy. A decision tree is constructed by learning from a set of training data, a process in which a local greedy partitioning method is normally used. The training data have to be stationary and bounded in size throughout the learning process. Should new learning data arrive, the learned model must be trained again by processing the whole dataset to update the underlying relations. However, although a single WSN includes a huge number of sensor nodes, each of them has only a limited storage capacity, and it is difficult to accommodate all the training data of the whole network. This implies that data mining can be carried out only at a backend base station that meets storage and computation complexity requirements. Centralized data aggregation gives rise to problems of data synchronization and data consistency, given that the data may come from different sensors randomly distributed over the whole network. Most importantly, retraining a decision tree model requires an ever-increasing degree of latency due to the tremendous volume of data needed. Even if only the latest data are used and old data are discarded, evolving data streams are nonstationary, and very frequent updates in which the model is repeatedly retrained are therefore needed to catch up with the level of prediction accuracy for the current trend.
The second issue is the imperfect quality of the data stream, which clearly affects the prediction accuracy of the decision tree. Noisy data confuse the decision tree with false relations of attributes to classes; such false relations effectively mislead the training algorithm to produce an enormous number of pseudopaths and nodes in the decision tree. Not only do such pseudopaths and nodes degrade accuracy and blunt predictive power, but they also result in problems of tree-size explosions. Though decision tree pruning is a technique commonly employed to remove redundant tree branches and nodes, it surely adds to overall computational complexity and overheads. Given the scarcity of memory space and computational power in WSNs, finding appropriate solutions to alleviate these problems has become an urgent task.
This paper proposes an alternative type of decision tree—the very fast decision tree (VFDT)—to be used in place of traditional decision tree classification algorithms. The VFDT is a new data mining classification algorithm that both offers a lightweight design and can progressively construct a decision tree from scratch while continuing to embrace new inputs from running data streams. The VFDT can effectively perform a test-and-train process each time a new segment of data arrives. In contrast with traditional algorithms, the VFDT does not require that the full dataset be read as part of the learning process, but adjusts the decision tree in accordance with the latest incoming data and accumulated statistical counts. As a preemptive approach to minimizing the impacts of imperfect data streams, a data cache and missing-data-guessing mechanism called the auxiliary reconciliation control (ARC) is proposed to function as a sidekick to the VFDT. The ARC is designed to resolve the data synchronization problems by ensuring data are pipelined into the VFDT one window at a time. At the same time, it predicts missing values, replaces noises, and handles slight delays and fluctuations in incoming data streams before they even enter the VFDT classifier.
To the best of our knowledge, this novel data mining model is the first attempt to alleviate problems of imperfect data in WSNs using a stream mining algorithm and an auxiliary control. This paper makes two key contributions to the literature: it applies stream mining techniques to WSNs by providing an ARC-cache combination to deal with imperfect data streams. The remainder of this paper is organized as follows. Section
Mining WSN data is said to be constrained by certain limitations and characteristics of WSNs [
Local-type and fusion-type wireless sensor network arrangements.
Local type
Fusion type
This paper focuses on the important WSN task of classification. It is applicable to almost all kinds of WSN applications, for example, in detecting whether a monitored biomedical patient is suffering from an illness, tracking whether a herd of cattle is moving along the normal route, determining whether a large machine is operating normally, estimating whether a rainforest is growing in balance, or ascertaining whether an anomaly of any kind has arisen in any other type of environment. A decision tree classifier makes predictions or classifications according to predefined classes based on test samples by traversing a tree of possible decisions. WSNs commonly adopt a decision tree method because the trees that represent relations between attributes and classes are informative and intuitively understood. Each path through a decision tree is a sequence of conditions that describe a class. Rules can be derived from such decision tree paths and can be used in a WSN to distinguish an outcome or phenomenon based on measurements observed from sensed data. The simplicity of a decision tree offers useful insights due to the transparent model learning process it follows. The model is learnt by first observing a complete set of training samples. Each sample has several attributes, each of which may be represented by a signal given by a sensor. A sample record may take the form (
It is known that a major cause of overfitting in a decision tree is the inclusion of contradicting samples in the model learning process. Noisy data and data with missing values are usually the culprits when contradicting samples appear. Unfortunately, such samples are inevitable in distributed communication environments such as WSNs. Two measures are commonly employed to define the extent of values missing from a set of data [
The simplest but not the ideal way to deal with missing values is to discard sample instances with missing values. Alternatively, when the missing values represent only a small percentage of the data set, they can be converted into a new variable. A more commonly adopted method known as imputation is to substitute missing values with analyzed or predicted values. Previous studies have compared the performance of different imputation methods in replacing missing values for a decision tree classifier. To the best of our knowledge, few studies examine how to handle missing data for stream mining types of decision tree classifiers.
An unresolved problem in stream mining research is how to detect concept changes due to noise-infested data and missing values. A streaming ensemble algorithm (SEA) [
This section presents a mathematical model to address the problem of missing values in data stream mining for a sensor network. The assumptions of the model are as follows The model is defined from the perspective of a centralized data stream mining engine or base station in the WSN where aggregated sensed data are sent to a single classification tree for classification. The mining process runs continuously as the data stream in segments. The length of each data segment is equal to the width of the sliding window. A whole segment will enter the VFDT during each window of time. The data stream is characterized by a train of data records. Each record has one or more attributes. Each attribute is assumed to hold a value given by a sensor in the case of a local-type WSN, and the value can come from the sink of a cluster of sensors in the case of a fusion-type WSN. The values for the attributes of a record are assumed to arrive in synchronization across a number of different sensors and/or sinks of clusters. A unique time stamp is added to each record. The time stamps increase in uniform intervals. All values for the attributes of the same record are used at the same time (including missing values) for each iteration of the test-and-train process at the VFDT. The name record and instance are used interchangeably. The attributes of data records take data of the following formats: nominal, numeric, binary, or mixed.
The original mathematical model for minimizing the overall number of missing values influencing the VFDT within a window size of timeout
Because the existence of all
The objective function (
In a stream-based classification, the VFDT decision tree is built incrementally over time by splitting nodes into two using a small amount of the incoming data stream. How many samples have to be seen by the learning model to expand a node depends on a statistical method called the Hoeffding bound or additive Chernoff bound. This bound is used to decide how many samples are statistically required before each node is split. As the data arrive, the tree is evaluated and its tree nodes can be expanded. The following equations essentially depict the building blocks of the stream mining model using the Hoeffding bound. The tree they represent is generally known as the Hoeffding tree (HT), which grows by holding to the Hoeffding bound as a yardstick. The heuristic evaluation function is used to judge when to convert a leaf at the bottom of the tree into a conditional node, thereby pushing it up the tree. Given that a node split occurs when there is sufficient evidence that a new conditional node is needed, replacing the terminal leaf with the relevant decision node better reflects current conditions as represented by the tree rules.
In (
Let us assume that we have a real-valued random variable
The VFDT is operated according to a simultaneous test-and-train process, meaning that when a new data segment arrives, the attribute values of the segment will pass down the tree from the root to one of the most likely leaves. In this way, the tree engages in a testing process also known as a classification or prediction exercise based on sample data. At the same pass (traversing through the tree), if the sample data carry a known class
A workflow representing the VFDT algorithm tree building process.
The ARC is a set of data preprocessing functions used to solve the problem of imperfect data streams before they enter the VFDT. The ARC can be programmed as a standalone program which may run in parallel and in synchronization with the test-and-train VFDT operation. Synchronization is facilitated by using a sliding window that allows one segment of data to arrive at a time at regular intervals. When no data arrive, the ARC and the VFDT simply stand still without any action. The operational rate of the sliding window should be no greater than the speed at which the VFDT is operated and faster than the speed at which the WSN sensors transmit data.
When data segments arrive as a stream, one segment at a time will initially be cached. The sliding window closes for a brief moment. While the window is closed, the ARC will attempt to correct four different types of imperfect data (if any) in the cache: missing values, noise, delayed data, and data fluctuations. The correction methods employed for each type are described in the following section. After the data have been manipulated, with missing values guessed on a best efforts basis and noise eliminated, the processed data enter the VFDT for instant testing and training. A class prediction/classification output and a failure anomaly report will then be generated by the VFDT and ARC, respectively. The end user could employ the VFDT output for subsequent decision making if implemented as a final base station, or could feed it into a further cluster of the WSN as an intermediate classification result derived from its own cluster. The failure and anomaly report contains statistics on variables such as the percentage of missing values, noise, delay, and data fluctuations as additional information about the quality of current data traffic. This information could be used as a reference indicator to gauge the reliability of the classification result based on the current quality of the data stream. It could also be used as an alarm signal to alert the network administrator to initiate repairs to the network infrastructure should the statistics in the report show a recurring problem over time. The sliding window will open again when the output results are sent, the data cache will be cleared, the VFDT will have been incrementally trained, and the gateway sensor node will be ready to receive the next incoming segment of data. Only statistics and accumulative counts remain at the ARC and VFDT throughout this continuous operation, thus providing a lightweight operating environment. No historical data need to be stored anywhere at this node. Figure
The workflow of the ARC and VFDT in a gateway sensor node.
To tackle the problem of missing values in a data stream, a number of prediction algorithms are commonly used to guess approximate values based on past data. Although many algorithms can be used in the ARC that deployed should ideally achieve the highest level of accuracy while consuming the least computational resources and time. Some popular choices we use here for simulation experiments include, but are not limited to, mean, naïve Bayesian, and C4.5 decision tree algorithms for nominal data and mean mode, linear regression, discretized naïve Bayesian, and M5P algorithms for numeric data. Missing value estimation algorithms require a substantial amount of past data to function. For example, before using a C4.5 decision tree algorithm as a predictor for missing values, a classifier must be built using statistics from a sample of a sufficient size.
To further lighten the workload induced by the ARC at the gateway sensor node, the estimation algorithm kicks on only when two conditions are met. First, sufficient statistics must be obtained from past data. Second, the trained classifier (regardless of which algorithms are used) will retrain itself only when prediction error reaches a certain threshold. The ARC therefore registers an error rate
The mechanism adopted for handling noise in the data stream is similar to that used to estimate missing values. Noises are considered to be values far different in range from normal values. A surge or interruption in radio signals along a wireless communication link will bring such values up or down to an extreme. However, because this rarely happens in practice, noise has a low probability occurrence distribution. In our model, we can safely assume that noise is equivalent to an outlier in our data samples because both noise and outliers share the same statistical characteristics. The ARC therefore used an outlier detection algorithm instead of a missing value prediction algorithm to handle noise. However, as argued in [
Flow chart of ARC operations.
Additional buffer space is required to overcome delays and fluctuation problems in data mining in WSNs. The additional buffer is called a bucket, which is a preceding space in front of the cache. The bucket can be implemented in the same gateway sensor node at the outmost interface position between the cache and the sink connector. The function of the bucket is essentially that of a synchronized bulk transfer receptacle that regularly shuttles between the data stream inlet and the data cache. It must operate at specific intervals the frequency of which must be no lower than that of the sliding window. The concept of bucket transfer is analogous to that of a cable car or a lift in a building that carries multiple passengers in bulk. Although passengers (data) can walk into a lift (bucket) asynchronously, they must do so within a certain time limit that has to be shorter than the real-time operational requirement of the lift (VFDT). A slight time latency caused by different sensors can therefore be tolerated. Figure
(a) Lateral view of bucket cache—when the ARC-cache
To address the issue of data fluctuations, which is one of the requirements of our imperfect data stream handling model, we propose the use of a lower-bound
Replenish ( Initialize FOR ( {
// More-than-bound data
IF (
// Less-than-bound data
ELSE }
Smoothing data traffic fluctuations.
Simulation experiments are conducted to validate our theoretical model comprising the ARC-cache and the VFDT. The aim of this section is to evaluate the performance of our proposed methods in dealing with missing values in data streams. Several different types of data streams are used in the experiments to facilitate a thorough comparison, including those generated synthetically from data generators and real-life data.
For the simulation, we implement a VFDT program and extend it by incorporating the functions of an ARC model for guessing missing values in data streams. Though the estimation method employed should be generic, the following methods are used in our experiments: the mean, naïve Bayesian, and C4.5 decision tree methods for nominal data, and the mean mode, linear regression, discretized naïve Bayesian, and M5P approaches for numeric data. The simulation system is built with a JAVA open source toolkit called Massive Online Analysis (MOA) stream mining software. The software package comes with a standard data stream generator and a Hoeffding tree algorithm. The data streams used in the experiments are stored in ARFF file format. The run time environment is JAVA JDK 1.5 and WEKA 3.6, and the computing platform is a Windows 7 64-bit workstation with an Intel quad core 2.83 GHz CPU and 8 Gb of RAM.
An MOA stream generator is used to create a synthetic data stream comprising one million data records. We wrote a customized JAVA software program which randomly adds missing values according to a parameter set by the user—the missing data percentage (MDP). There is another optional control that allows the user to place missing values at either the beginning or the end of the data stream. It is well known that decision trees in stream mining models are unstable in the initial stage of learning. Inserting missing values at this early stage only lengthens the process of training the model to maturity. It may make more sense to observe the impact of missing values after the VFDT model is established and see how it responds to the imperfect stream. The MOA stream generator generates the four different synthetic datasets shown in Table
Synthetic datasets.
Name | Attribute number | Attribute |
Class number | Instance number |
---|---|---|---|---|
LED7 | 7 | Nominal | 10 | 1,000,000 |
LED24 | 24 | Nominal | 10 | 1,000,000 |
SEA | 3 | Numeric | 2 | 1,000,000 |
LED7 is a data stream that is simpler than the rest in that it has only 7 nominal attributes. In this experiment, we configure the MDP at 20%, 40%, 60%, 80%, and 100%, which are randomly inserted missing values in defined positions in the data stream. As a result, in comparison with a perfect data stream where MDP = 0%, the higher MDP comes with a lower level of VFDT classification accuracy. Figure
Comparison of accuracy of missing values in the
Comparison of accuracy of missing values at the
LED24 is a more complicated data stream with 24 nominal attributes and a total of one million instance records. We add MDP 50% to this dataset. To handle imperfect data, the ARC-cache is applied together with the VFDT with different window sizes: 250, 500, 750, and 1000. A C4.5 decision tree function in WEKA is chosen as the ARC construction method in this case. The experimental results shown in Figure
Missing values replacement performance comparison between ARC-cache and WEKA (standard function for replacing missing values).
Dataset | Wrong instance number | Wrong prediction (%) | VFDT accuracy |
---|---|---|---|
MDP = 0% | N/A | N/A | 0.899269 |
MDP = 50% | N/A | N/A | 0.828151 |
WEKA (mean) | 210521 | 42.10% | 0.867028 |
ARC ( |
166712 | 33.34% | 0.886413 |
ARC ( |
166679 | 33.34% | 0.886413 |
ARC ( |
166903 | 33.38% | 0.886413 |
Accuracy of ARC-cache and VFDT in LED24 data stream, missing data added at the
SEA is a data stream consisting of 3 numeric attributes, two nominal classes, and one million instances. We also add MDP = 50% of missing values to this dataset. To handle such an imperfect data stream, we use the ARC-cache and VFDT in different window sizes of 500, 750, and 1000. The missing value predictor in the ARC is initially trained up by the M5P function in WEKA. The experimental results shown in Figure
Accuracy of ARC-cache and VFDT in SEA data stream, missing data added at the
The results are loaded into a margin curve chart visualized in WEKA to evaluate the model generated by a different data stream (as shown in Figure
Margin curves of accuracy testing in SEA dataset.
Nonmissing
MDP = 50%
Replace by Weka (mean)
Replace by ARC model
In this experiment, we use a set of real-world data streams downloaded from the 1998 KDD Cup competition provided by Paralyzed Veterans of America (PVA) [
In common with the previous experiment, we compare the ARC-cache and VFDT method with the standard missing values replacement method found in WEKA using means. The results of the comparison are shown in Figure
(a) Performance of AC-cache and VFDT missing values replacement method; (b) magnified version of the diagram.
Another dataset from the real world is introduced to the experiment. This dataset, which can be downloaded from UCI Machine Learning [
A large amount of missing values is deliberately added to the dataset. The missing completely at random (MCAR) method is chosen for the use in this scenario. Forty percent of the total number of instances (records) is replaced by missing values, meaning 65,944 instances are added completely at random. The distributions of missing values for different attributes are 8,056 (person sequence name) 8,165 (tag identifier sensor) 7,921 (timestamp) 8,051 (date) 8,092 (
We observe two significant phenomena from the results shown in Figure
Percentage performance gain realized using the ARC-cache and VFDT method for different attributes in comparison with that achieved with the non-ARC-cache model.
Regarding the learning speed of the proposed method, we compared the ARC learning speed for each missing value in the same test. Figure
Learning speed comparison using the ARC-cache and VFDT method for different attributes.
An example that shows the stability of processing time for ARC-cache.
The complex nature of incomplete and infinite streaming data in WSNs has escalated the challenges faced in data mining applications concerning knowledge induction and time-critical decision making. Traditional data mining models employed in WSNs work mainly on the basis of relatively structured and stationary historical data and may have to be updated periodically in batch mode. The retraining process consumes time, as it requires repeated archiving and scanning of the whole database. Data stream mining is a process that can be undertaken at the front line in a manner that embraces incoming data streams. We propose using a Very Fast Decision Tree (VFDT) in place of traditional data mining models employed in WSNs due to its benefit of lightweight operation and its lack of a data storage requirement.
To the best of our knowledge, no prior study has investigated the impact of imperfect data streams or solutions related to data stream mining in WSNs, although the preprocessing of missing values is a well-known step in the traditional knowledge discovery process. This paper proposes a holistic model for handling imperfect data streams based on four features that riddle data transmitted among WSNs: missing values, noise, delayed data arrival, and data fluctuations. The model has a missing value predicting mechanism called the auxiliary reconciliation control (ARC). A bucket concept is also proposed to smooth traffic fluctuations and minimize the impact caused by late arriving data. Together with the VFDT, the ARC-cache facilitates data stream mining in the presence of noise and missing values. To prove the efficacy of our model, a simulation prototype is implemented based on ARC-cache and VFDT theories by using a JAVA platform. Experimental results unanimously indicate that the ARC-cache and VFDT method yields a better accuracy in mining data streams in the presence of missing values than in those without. One reason for this improved performance is ascribed to the improved predictive power of the ARC in comparison with other statistical counting methods for handling missing values, as the ARC computes the information gains of almost all other attributes with nonmissing data. In future research, we will continue to investigate the impact of noisy or corrupted data and irregular data stream patterns on data stream mining in WSNs.
Number of sensors to which the current WSN gateway is currently connecting
Number of attributes in sensor
Timeout parameter, in unit of number of time stamps
Number of class labels in the VFDT
Sensor index
Attribute index
Attribute value index
Timestamp index per complete instance being collected
Class label index
Number of values of attribute
Attributes collected by sensor
Attribute
Target class label
Indexed value of attribute
A record
Number of all attributes in
Time stamp
The abstract function of VFDT, classifies
Number of samples needed to update VFDT model
Heuristic evaluation of split attribute
Parameter to justify whether it is time to update VFDT model, when
Range of
The confidence of VFDT model update.
The importance of attribute
The importance of sensor
A binary variable taking the value of 0 if
A binary variable taking the value of 0 if sensor
The authors are thankful for the financial support from the research grant “Real-time Data Stream Mining,” Grant no. RG070/09-10S/FCC/FST, offered by the University of Macau, FST, and RDAO.