Classification Based on Pruning and Double Covered Rule Sets for the Internet of Things Applications

The Internet of things (IOT) is a hot issue in recent years. It accumulates large amounts of data by IOT users, which is a great challenge to mining useful knowledge from IOT. Classification is an effective strategy which can predict the need of users in IOT. However, many traditional rule-based classifiers cannot guarantee that all instances can be covered by at least two classification rules. Thus, these algorithms cannot achieve high accuracy in some datasets. In this paper, we propose a new rule-based classification, CDCR-P (Classification based on the Pruning and Double Covered Rule sets). CDCR-P can induce two different rule sets A and B. Every instance in training set can be covered by at least one rule not only in rule set A, but also in rule set B. In order to improve the quality of rule set B, we take measure to prune the length of rules in rule set B. Our experimental results indicate that, CDCR-P not only is feasible, but also it can achieve high accuracy.


Introduction
The Internet of things is one of the hot topics in recent years. It has integrated many kinds of modern technology. By these kinds of technology, it produces large-scale data in IOT. In order to handle these large data, it requires techniques and methods of data mining and machine learning [1][2][3][4][5][6].
As one of the most important tasks of data mining, classification has been widely applied in IOT. The main idea of classification is that builds classification rules. According these rules, we can predict the class label for unknown objects.
Traditional rule-based classifications usually use greedy approach, such as FOIL [7], CPAR [8], and CMER [9]. These methods repeatedly search for the current best one rule or best-rules and remove examples covered by the rules. They cannot guarantee that all instances can be covered by at least two classification rules. As a result, some traditional classifiers have less classification rules. Their accuracy may not be high. Decision tree classifiers produce classification rules by constructing classification trees, such as ID3 [10], C4.5 [11], and TASC [12]. The process of building a decision tree does not need to delete any examples. All examples can find only one matching rule in the classification rule set. That is why decision trees often generate small rule sets and cannot achieve high accuracy in some data.
Aiming at these weaknesses, we propose a novel Double Covered Rule sets classifier called CDCR-P (Classification based on the Pruning and Double Covered Rule sets). CDCR-P generates two different rule sets and and then prunes the rule set . Each instance can be covered by at least one rule from rule set . At the same time, each instance can be covered by at least one rule from rule set . CDCR-P has four aspects. First, CDCR-P generates rule set . We select several best values which can just cover the training set to construct a candidate set. CDCR-P employs candidate set to produce rule set . Second, in order to induce rule set , we remove the values of candidate set in training data and select other several best values to induce rule set . Rule set is fully different from rule set . Third, each instance can find at least two matching rules. One of the rules is from rule set , and another is from rule set . Forth, we prune the length of rules in rule set , so as to improve the quality of rule set . Our method has the following advantages.
(1) CDCR-P can produce two rule sets. Thus, CDCR-P can generate large number of classification rules. (1) 1 = , CV 1 = , Length = 0; (2) compute the information gain of each sample ; (3) sort according to the information gain in descending order. If two sample have the same information gain, sort the two sample according to the support; (4) Length = . Count × 1/3; where is the first element of ; (7) for each tuple in (8) While end while (13) end for (14) end while (15) return 1 , CV 1 Algorithm 1: Dividing training set into three smalldatasets.
(2) All instances in training set can be matched by at least two classification rules. (3) CDCR-P can achieve high accuracy by combining rule set with rule set .
The paper is organized as follows. In Section 2, we introduce the method of CVCR (Classification based on Value Covered Rules). In Section 3, we propose a new classifier CDCR-P and discuss how to use CDCR-P to classify new objects. We report our experimental results in Section 4. We finally conclude our study in Section 5.

Classification Based on Value Covered Rules
In this section, we introduce the method of value covered classifier; this method is called CVCR (Classification based on Value Covered Rules). Suppose = { 1 , 2 , . . . , } is a set of tuples. Each tuple has attributes { 1 , 2 , . . . , }. Let be a finite set of class labels { 1 , 2 , . . . , } and be a set consisting of data samples. A rule consists of several samples and a class label , which takes the form of 1 ∧ 2 ∧ ⋅ ⋅ ⋅ ∧ → . One rule set is formed by a lot of rules which are extracted from one classifier. If tuple satisfies 1 ∧ 2 ∧ ⋅ ⋅ ⋅ ∧ from rule , the is matched by . predicts that belongs to class .

Definition 1 (information gain). Let
be the number of samples of in class . The information gain of an attribute value is denoted by ( 1 , 2 , . . . , ) and is defined as follows: where is the probability that a literal belongs to class label . is estimated by / . CVCR finds a set of values which can cover all the training set. The process of constructing CVCR is as follows.
First, CVCR sorts all literals according to the information gain in a descending order and selects several best attribute values V 1 , V 2 , . . . , V which can just cover the training set . V 1 , V 2 , . . . , V construct a candidate set. Let these values split to subdatasets 1 , 2 , . . . , , respectively. Second, CVCR connects V with attribute values V 1 , V 2 , . . . , V which can just cover dataset to produce patterns. Finally, repeat the above steps until the information gain of each pattern is equal to 0.
The experimental results of CVCR are shown in Table 2. The experimental results show that CVCR can achieve higher accuracy than ID3 and FOIL. Because CVCR contains the global optimal attribute values, CVCR is more feasible than ID3. However, CVCR still produces less classification rules, which cannot guarantee that each instance can be matched by at least two rules.

Classification Based on Pruning and Double Covered Rule Sets
In this section, we produce a new method CDCR-P. First, we show the process of how to induce rule sets and . Second, we describe the method of how to prune rule set . Finally, we give the way of how to use the two rule sets and to classify new objects.

Constructing Rule Sets and .
Based on the idea of CVCR, we continue mining knowledge in-depth. This approach divides the training set into three small datasets 1 , 2 , and 3 according to candidate set. The method contains four steps.
Step 1, we select several best attribute values V 1 , V 2 , . . . , V from candidate set which can just cover one-third of training set. Attribute values V 1 , V 2 , . . . , V have less information gain. The tuples which contain one of V 1 , V 2 , . . . , V form the small dataset 1 . The process is shown as Algorithm 1. We form 2 , 3 using the same way as 1 .
The Scientific World Journal Step 2, according to V 1 , V 2 , . . . , V , 1 is split into datasets 1 , 2 , . . . , . We find cv (a set of cover values) from on the basis of information gain. The measure of cv is the same as CVCR, shown as Algorithm 2. CDCR connects V with cv to produce patterns. If the information gain of pattern is equal to 0, → belongs to rule set , shown as Algorithm 3.
Step 3, CDCR recalculates the CV (cover values) in 1 excluding V 1 , V 2 , . . . , V . CV splits 1 into some datasets and connects covered value in each dataset to produce new rules. These rules belong to rule set , shown as Algorithm 4. Finally, we remove 1 from and iterate the process until 2 , 3 are trained. Rule set is the same as CVCR. Both rule sets and belong to CDCR.

Pruning Rule Set .
In order to improve the quality of rule set , we introduce a new method CDCR-P (Classification based on the Pruning and Double Covered Rule sets).
Definition 2 (confidence). The confidence of sample is defined as follows: where count ( ) means the number of tuples which contain sample in class .
The confidence of rules that CDCR generated is equal to 100%. We modify the length of rule set . The rules are 4 The Scientific World Journal Input: Training data , CV Output: Rule set Method: (1) Rule set = , dataset = , cover value cv = , candidate set cs = , IniteQueue ; (2) add cover value which can cover to cs, and cs ∉ CV (3) while cs ̸ = (4) . push( ) where is first element of cs; (5) cs = cs − { }; (6) end while (7) while ! . empty() (8) = . front(); . pop() (9) If . length > max rule length continue; (10) compute the information gain of ; (11) if . information gain == 0 generated when the confidence is 100% in the small dataset instead of in the whole training set . Each rule is marked with the confidence in . Thus, rule set in CDCR-P is shorter than rule set in CDCR.

Classifying Unknown Examples.
In this part, we give the method of how to use CDCR and CDCR-P to classify unknown instances.
Definition 3 (support). The support of sample is denoted by where count( ) means the number of tuples which contain sample . | | is the number of tuples in training data. When testing unknown examples, CDCR selects the matched rule with the highest support. If some rules have the same support, we select the maximum number of matched rules in each class.
CDCR-P first considers the rule with the highest confidence. If two rules have the same confidence, CDCR-P sorts the two rules according to the support.

Experiments
We show the experimental results in 14 UCI datasets. The character of each data is shown in  In Table 2, we give the accuracy of ID3, FOIL, CVCR, CDCR, and CDCR-P. Figure 1 gives the accuracy of ID3, FOIL, and CVCR. CVCR employs the idea of covered values; these cover values are the global optimal attribute values in training data. From Figure 1 and Table 2 we can see that CVCR can achieve higher accuracy than ID3 and FOIL. Figure 2 gives the accuracy of CVCR, CDCR, and CDCR-P. CDCR not only uses the method of covered values, but also produces two rule sets and . Each instance can be matched at least by one rule from rule set and rule set . From Figure 2 and Table 2 we can see that CDCR can achieve higher accuracy than CVCR. Based on all advantages of CVCR and CDCR, CDCR-P take measure to prune the length of rule set . The experimental results show that CDCR-P has the highest accuracy. Table 3 displays the missing match rate of ID3, FOIL, CVCR, CDCR, and CDCR-P. CVCR can produce more rules than ID3 and FOIL. From Table 3 we can see that the missing match rate is decreased obviously by CVCR. CDCR produces two rule sets. Therefore, CDCR produces more rules than  Table 3 we can see that the missing match rate of CDCR is lower than CVCR. CDCR-P modifies the length of rule set ; the quality of rules in CDCR-P is higher than CDCR. The experiments indicate that the mismatch rate of CDCR-P is the lowest.

CVCR. From
Through all the above experimental results, we can conclude the following. (1) It is necessary for us to construct two rule sets. (2) It is necessary to prune rule set . (3) CDCR-P can achieve high accuracy and has an excellent result in missing match rate.

Conclusions
Classification has been widely applied in IOT. The accuracy of classification is an important factor in classification task. The traditional rule-based classifications cannot guarantee that all test cases can be matched by two rules. They usually generate less classification rules. Thus, the accuracy of these algorithms may be low in some data. In this paper, a novel approach CDCR-P is proposed. CDCR-P generates two rule 6 The Scientific World Journal sets: rule set and rule set . All instances can be matched by at least one rule not only in rule set , but also in rule set . This method greatly increases the number of extracted rules. Thus, it gets more information from training data. Our experimental results show that the methods of CDCR-P can produce more rules and achieve high accuracy. In future research, we will perform an in-depth study on combining distributed data mining with IOT in order to improve the efficiency of CDCR-P.