A Novel Classification Approach through Integration of Rough Sets and BackPropagation Neural Network

Classification is an important theme in data mining. Rough sets and neural networks are the most common techniques applied in data mining problems. In order to extract useful knowledge and classify ambiguous patterns effectively, this paper presented a hybrid algorithm based on the integration of rough sets and BP neural network to construct a novel classification system. The attribution values were discretized through PSO algorithm firstly to establish a decision table. The attribution reduction algorithm and rules extraction method based on rough sets were proposed, and the flowchart of proposed approach was designed. Finally, a prototype system was developed and some simulation examples were carried out. Simulation results indicated that the proposed approach was feasible and accurate and was outperforming others.


Introduction
Across a wide variety of fields, data are being accumulated at a dramatic pace, especially in the age of internet [1,2].There is much useful information hidden in the accumulated voluminous data, but it is very hard for us to obtain it.Thus, a new generation of computational tool is needed to assist humans in extracting knowledge and classifying the rapidly growing digital data; otherwise, these data are useless for us.
Neural networks are applied in several engineering fields such as classification problems and pattern recognition.Artificial neural network in its most general form attempts to produce systems that work in a similar way to biological nervous systems.It has the ability to simulate human thinking and owns powerful function and incomparable superiority in terms of establishing nonlinear and experiential knowledge simulation models [3][4][5][6].Back-propagation neural network (BP neural network for short) is the core of feed-forward networks and embodies the essence of artificial neural network.However, conventional BP algorithm has slow convergence rate, weak fault tolerance, and nonunique results.Although many improved strategies have been proposed, such as additional momentum method, adaptive learning method, elastic BP method, and conjugate gradient method [7][8][9], the problems above have still not been solved completely, especially when BP neural networks are applied in multidimensional and uncertain fields.Therefore, newer techniques must be coupled with neural networks to create more efficient and complex intelligent systems [10].
In an effort to model vagueness, rough sets theory was proposed by Pawlak in 1982 [11,12].Its outstanding feature is not to need the specific relation description about some characteristics or attributes, but to determine the approximate region of existing problems and find out the inherent rules through the indiscernibility relations and indiscernibility classes.This theory has been successfully applied in many fields such as machine learning, data mining, data analysis, and expert systems [13,14].In the paper, rough sets theory and BP neural network are presented as an integrated method because they can discover patterns in ambiguous and imperfect data and provide tools for data analysis.The decision table is established reasonably through discretizing attribution values.The features of decision table are analyzed using rough sets and subsequently a classification model based on these features is built through integration of rough sets and BP neural network.Actually, this model uses the

Literature Review
In recent years, most of classifier architectures are mainly constructed based on artificial intelligence (AI) algorithms with the unceasing development and improvement of AI theory.In this section, we try to summarize some recent literatures which are relevant to the construction methods of classification system.Hassan et al. constructed a classifier based on neural networks and applied rough sets theory in attributes reduction to preprocess the training set of neural networks [15].Berardi et al. proposed a principled approach to build and evaluate neural network classification models for the implementation of decision support system (DSS) and verified the optimization speed and classified samples accuracy [16].In [17,18], the architecture of neural network with fuzzy input was proposed to effectively improve the classification performance of neural network.Hu et al. presented a self-organized feature mapping neural network to reasonably determine the input parameters in pavement structure design [19].In [20,21], wavelet neural networks were applied in the construction of classification system and better prediction results were obtained.In [22,23], a classifier in which improved particle warm algorithm was used to optimize the parameters of wavelet neural network was established and the application in nonlinear identification problems demonstrated its strong generalization capability.Sengur et al. described the usage of wavelet packet neural networks for texture classification problem and provided a wavelet packet feature extractor and a multilayer perceptron classifier [24].Hassan presented a novel classifier architecture based on rough sets and dynamic scaling in connection with wavelet neural networks and the effectiveness was verified by the experiments [25].
Although many approaches for the establishment of classification systems have been presented in above literatures, they have some common disadvantages summarized as follows.On one hand, the classification effect of signal neural network cannot be guaranteed when the networks are applied in multidimensional and uncertain fields.On the other hand, the combination method of rough sets and neural networks can only deal with classification and recognition problems that possess discretized dataset and cannot process the problems that contain continuous dataset.
In this paper, a novel classification system based on the integration of rough sets and BP neural network is proposed.The attribution values are discretized through PSO and the decision table is constructed.The proposed model takes account of attributes reduction by discarding redundant attributes and a rule set is generated from the decision table .A simulation example and comparisons with other methods are carried out and the proposed approach is proved feasible and outperforming others.

Basic Theory
3.1.Rough Sets Theory.Fundamental to rough sets theory is the idea of an information system IS = (, ), which is essentially a finite data table consisting of different columns labeled by attributes and rows labeled by objects of interest, and entries of the table are attribute's values. is nonempty finite set of objects called a universe and C is a nonempty finite, where each  ∈  is called condition attribute.In IS = (, ), every  ⊆  generates an indiscernibility relation Ind() on  which can be defined as follows: /Ind() is a partition of  by .For all  ∈ , the equivalence class of  in relation /Ind() can be defined as follows: [] Ind() = { ∈  :  () =  () , ∀ ∈ } . ( According to , two crisp sets  and  called the lower and the upper approximation of the set of objects  can be defined as follows: where [] Ind() denotes the set of all equivalence classes of Ind(). is the set of all objects from  which can be certainly classified as elements of  employing the set of attributes . is the set of objects of  which can be possibly classified as elements of  using the set of attributes  [25].Decision table  = (,  ∪ ) is a special form of an information system and the major feature is the distinguished attribute set , where  ∩  = Φ is called the decision attribute.Generally speaking, there is a certain degree of dependency between condition attribute and decision attribute, which can be defined as follows: where POS  () is referred to the -positive region of .
Due to the relevance between condition attribute and decision attribute, not all condition attributes are necessary for decision attribute so as to introduce the attribute reduction which is a smaller set of attributes that can classify objects with the same discriminating capability as that of the original set.As well known, the reduct is not the only one [26,27].
When we want to determine the reduct of decision table, the significance degree of attributes is commonly used and can be defined as follows: sig (, ; ) =  {∪} () −   () ,  where sig (, ; ) denotes the significance degree of attribute  to attribute , relative to decision attribute .For all  ∈ , if POS {−} () = POS  (), then  in  is superfluous for .Otherwise,  in  is indispensable for .The set composed of all attributes that are indispensable for is called the core of  relative to , namely, relative core CORE  ().This relative core cannot be removed from the system without a loss in the knowledge that can be derived from it.
When determined, the relative core of decision table must be firstly judged whether it is the reduct of  relative to , namely, relative reduct.The judgment rules are shown as follows.
In addition, decision rules can be perceived as data patterns which represent the relationship between the attributes values of a classification system [11,12].The form of the decision rules can be shown as IF  THEN , where  is the conditional part of a rule set and it is a conjunction of selectors;  is the decision part of attribute  : ( ∈ ) and it usually describes the class of decision attribution.

BP Neural
Network.BP neural network (BP-NN) which is surely a back-propagating neural network belongs to the class of networks whose learning is supervised and the learning rules are provided by the training set to describe the network behavior.The topology structure of BP-NN can be shown in Figure 1.
In Figure 1, input vector  = { 1 ,  2 , . . .,   } is furnished by the condition attribution values of reduced decision table and output vector  = { 1 ,  2 , . . .,   } is the prediction class of decision attribution.The output of hidden layer can be calculated as follows: where   is the connection weight between input and hidden layers;  is the number of hidden layer nodes;   is the threshold of hidden layer node ;  is the activation function of hidden layer and can be chosen as linear function or sigmoid function () = 1/(1 +  − ).
The output of output layer can be calculated as follows: where   is the connection weight between hidden and output layers;  is the number of output layer nodes;   is the threshold of output layer node .
The training of network parameters   ,   ,   , and   would depend on the error value which can be calculated by the following equation: where   and   are current and desired output values of the network, respectively.The weight values and thresholds of the network can be updated as follows: where  is the current iteration times; () is the learning rate and the range is the interval of [0, 1].
The learning rate has great influence on the generalization effect of neural network.Larger learning rate would significantly modify the parameters of weight values and thresholds and increase the learning speed.But overlarge learning rate would produce larger fluctuations in the learning process, and excessively small learning rate would generate slower network convergence and stabilize the weight values and thresholds difficultly.Therefore, this paper presents the variable learning rate algorithm to solve the above problems, which can be described as follows: where  max and  min are the maximum and minimum of learning rate;  max is the number of maximum iterations.

The Proposed Approach
This section tries to present a new approach aiming at connecting rough sets with BP neural network to establish a classification system.The section has five main parts and can be elaborated through the following subsections.

4.1.
The System Construction Process.The construction process of classification system can be shown as follows.
(1) Preprocessing historical dataset to obtain the sample data is the application precondition of rough sets and BP neural network.In the process of preprocessing, the historical dataset would be discretized through PSO algorithm to construct the decision table.
(2) Simplify the input of neural network including input dimension and training set number through the attribute reduction based on significance degree algorithm.Deleting the same row in the decision table can simplify the training samples and eliminating the superfluous column (condition attribute) can simplify the network input dimension number.Thus, reduce the decision table using some reduct or sum of several calculated reducts; that is, remove from the table attributes not belonging to the chosen reducts.
(3) BP neural network integrated with rough sets is used to build the network over the reduced set of data.Decision rules are extracted from the reduced decision table as the basis of network structure through a value reduction algorithm of rough sets.
(4) Perform network learning for parameters   ,   ,   and   .Do this step until there is no significant change in the output of network and the system construction process can be shown as Figure 2.

The Discretization Algorithm for Decision Table.
Due to the fact that rough sets theory cannot directly process continuous values of attributes, the decision table composed of continuous values must be discretized firstly.Thus, the positions of breakpoints must be selected appropriately to discretize the attributes values, so as to obtain fewer breakpoints, larger dependency degree between attributes, and less redundant information.In this paper, the basic particle swarm optimization algorithm (PSO) described in [28] is presented to optimize the positions of breakpoints.In PSO, a swarm of particles are initialized randomly in the solution space and each particle can be updated in a certain rule to explore the optimal solution  best after several iterations.Particles are updated through tracking two "extremums" in each iteration.One is the optimal solution  best found by the particle itself and another is the optimal solution  best found by the particle swarm.The specific iteration formulas can be expressed as follows: where  = 1, 2, . . ., ;  is the number of particles;  is the current iteration times;  1 and  2 are the acceleration factors;  is the random number between 0 and 1;    is the current position of particle ; V   is the current update speed of particle ;   is the current inertia weight and can be updated by the following equation: where  max and  min are the maximum and minimum of inertia weight;  max is the number of maximum iterations.The main discretization steps for decision table through the PSO can be described as follows.
(1) Normalize the attribute values and initialize the breakpoints number ℎ, the maximum iterations  max , the minimum dependency degree between condition attribute and decision attribute  min , the particles number , the initial position  0  , and initial velocity V 0  which are ℎ ×  matrixes ( is the number of attributes).
(2) Discretize the attribute values with the position of particle  and calculate the dependency degree   as the fitness value of each particle to initialize  best and  best .
( to step (1) to search the optimal particles.Then, compare the optimal particles and select the better one as the basis of discretization for decision table.
The more detailed discretization process is elaborated in Section 4.5.

Rough Sets for the Attribution Reduction.
One of the fundamental steps in proposed method is reduction of pattern dimensionality through feature extraction and feature selection [10].Rough sets theory provides tools for expressing inexact dependencies within data [29].Features may be irrelevant (having no effect on the processing performance) and attribution reduction can neglect the redundant information to enable the classification of more objects with a high accuracy [30].
In this paper, the algorithm based on attribution significance degree is applied in the attribution reduction of decision table.The specific reduction steps can be described as follows.
( The attribute in redundant with the maximum significance degree is marked as  max .Update the reduct and redundant through the following equations: (4) Go to step (2) until POS reduct () satisfies the conditions; then output RED  () = reduct.
Therefore, the irrelevant features are neglected and the DT is reduced.Thus, a reduced decision table will be constructed and can be regarded as the training set to optimize the structure of RS-BPNN classifier.

Rough Sets for the Rules Extraction.
The knowledge in the trained network is encoded in its connection weights and it is distributed numerically throughout the whole structure of neural network.For the knowledge to be useful in the context of a classification system, it has to be articulated in a symbolic form usually in the form of IF-THEN rules.Since neural network is a "black box" for users, the rules which are implicit in the connection weights are difficult to understand, so extracting rules from neural network is extremely tough because of the nonlinear and complicated nature of data transformation conducted in the multiple hidden layers [31].In this paper, the decision rules are indirectly extracted from the reduced decision table using the value reduction algorithm of rough sets [32].This algorithm can be elaborated in the following steps.
Step 1. Checking the condition attributes of reduced decision table column by column.If bringing conflicting objects after deleting one column, then preserving the original attribute value of these conflicting objects; If not bringing conflicts, but containing duplicate objects, then marking this attribute value of duplicate objects as " * "; marking this attribute value as "?" for the other objects.
Step 2. Deleting the possible duplicate objects and checking every object that contains the mark "?".If the decisions can be determined only by the unmarked attribute values, then "?" should be modified as " * ", or else as the original attribute value.If all condition attributes of a particular object are marked, then the attribute values marked with "?" should be modified as the original attribute value.
Step 3. Deleting the objects whose condition attributes are marked as " * " and the possible duplicate objects.
Step 4. If only one condition attribute value is different between any two objects and this attribute value of one object is marked as " * ", and for this object if the decision can be determined by the unmarked attribute values, then deleting another object, or else deleting this object.
Each object represents a classification rule in the decision table which is value-reduced through the above steps and the number of attributes which are not marked as " * " of every object makes up the condition number of this rule.Moreover, this rule set can provide the rational basis for the structure of BP neural network.

The Flowchart of the Proposed Approach. According to
the above description about the classifier based on integration algorithm of rough sets and BP neural network, the proposed approach is an iterative algorithm and can be coded easily on a computer, and the flowchart can be summarized as shown in Figure 3.

Simulation Example
A classifier named RS-BPNN (short for rough sets with BP neural network) has been set up and implemented by VC 6.0.In this section, an engineering application of shearer running status classification in a coal mine was put forward as a simulation example to verify the feasibility and effectiveness of the proposed approach.The classification capabilities of different types of neural network models were compared and the proposed approach was proved outperforming others.

The Construction of Decision Table.
Due to the poor working conditions of coal mining, there are many parameters relating to the running status of shearer, mainly including cutting motor current, cutting motor temperature, traction motor current, traction motor temperature, traction speed, tilt angle of shearer body, tilt angle of rocker, and vibration frequency of rocker transmission gearbox, marked as  1 ,   to the cutting status of shearer rocker and could be expressed by the ratio of actual cutting load to rated load.Therefore, in decision table DT, the condition attribute set and decision attribute set could be expressed as  = { 1 ,  2 ,  3 ,  4 ,  5 ,  6 ,  7 ,  8 } and  = {} ( denoted the shearer running status).The values of decision attribution were discretized based on the following assumptions: for the ratio of actual cutting load to rated load of shearer being greater than 1.6,  = 4; for the range of this ratio being in interval of (1.2, 1.6],  = 3; for the range of this ratio being in interval of (0.8, 1.2],  = 2; for this ratio being less than 0.8,  = 1.The discretized values of condition attributes could be determined through the algorithm described in Section 4.2. The condition attributes to be discretized were  1 ∼  8 , so  = 8.Other parameters of PSO were initialized as follows:  = 100,  max = 200,  min = 0.9, V max = 0.6,  1 =  2 = 1.6,  is a random number from [0, 1],  max = 0.9, and  min = 0.4.As the breakpoints number ℎ had greater influence on the dependency degree   () of discretized decision table, the comparison of two important parameters: dependency degree and simulation time, was analyzed when ℎ was assigned various values.The comparison results were shown in Figure 4.
From Figure 4, it was observed that the dependency degree and simulation time both showed ascending tendency with the increase of breakpoints number ℎ.The dependency  degree was not significantly increased and the simulation time was obviously increased when ℎ ≥ 2. So, ℎ was considered being equal to 2 and   () = 0.98.The breakpoints of each attribute value were shown in Table 2.The breakpoints were applied to discretizing the parameter values in Table 1.If the value was in interval of (0, Br 1 ], the discretized attribution value was labeled as 1; if the value was in interval of (Br 1 , Br 2 ), the discretized attribution value was labeled as 2; otherwise, it was labeled as 3.The discretized decision table was shown in Table 3.

Relative Reduct and Rules Extraction.
Through the attribution reduction algorithm presented in Section 4.3, the function "CORE(, )" was invoked to obtain the relative core of decision table and CORE  () = { 1 ,  8 }, shown in Figure 5.As the minimum relative reduct of decision table, RED  () = { 1 ,  3 ,  5 ,  6 ,  8 } was determined by invoking function "Reduct." In the module of "Rough Set for Attribution Reduction, " a new decision table of 200 × 6 was established by the attributions { 1 ,  3 ,  5 ,  6 ,  8 } and decision attribution {}.The function "Refresh (DT)" was called to reject the repeated objects in this new decision table and a reduced decision table was provided finally, containing 79 objects, as shown in Figure 6.Subsequently, the decision rules were extracted through function "Extract rules" which was based on the value reduction algorithm and 25 rules were obtained, as shown in Figure 6.The extracted rules could reveal the potential rules (knowledge) in dataset; for example, the first rule "123 * 11" in Figure 6 illustrated that if cutting motor current  1 ∈ (0, 48.56], traction motor current  3 ∈ (82.37, 95.42), traction speed  5 ∈ [3.24, +∞), tilt angle of shearer body  6 was arbitrary value, and vibration frequency of rocker transmission gearbox  8 ∈ (0, 651.28], then the running status class was 1, which was in accordance with engineering practice situation.Moreover, RS-BPNN classifier, in fact, was to diagnose the status of objects according to these potential rules (knowledge) and the structure of network could be constructed reasonably based on the number of rules so as to enhance the classification accuracy.

The Structure and Testing of RS-BPNN Classifier.
Because the dataset obtained 25 rules (knowledge), the RS-BPNN should be constructed with double hidden layers and the number of neurons was 5 × 5, which must completely contain these rules (knowledge).The number of input layer notes  = 5; the number of output layer notes  = 1.Other parameters of network were determined as follows:  max = 0.9;  min = 0.4;  max = 500; the initial connection weights were assigned somewhat random values.The structure of RS-BPNN classifier was shown in Figure 7.
The reduced decision table, as the training set, was presented to the network several times.The testing set, composed of 40 samples extracted from the information database randomly, was not seen by the network during the training phase and it was only used for testing the generalization of neural network after it was trained, as shown in Table 4.After the training phase, the RS-BPNN classifier was obtained.In order to test the performance of this classifier expediently, a testing interface was designed as shown in Figure 8.The testing set was imported with the "Import samples" button and was processed with the functions of "Discretize" and "Reduce attribution." The prediction results of RS-BPNN classifier were output through the function of "Classify" in Figure 8.The contrast of prediction class and actual class was shown brightly in Figure 9.
When the classification accuracy of RS-BPNN was computed in Figure 8, the output of classifier was impossibly equal to desired output.Therefore, it allowed the output to differ from the desired value.If the difference between classifier output and required value of decision was less than some present value, it was regarded as a correct one.This tolerance margin was decided during simulation evaluation and finally it was 5% of the desired value.Seen from Figures 8 and  9, the classification accuracy was 90.00% and the average classification error was 2.02%.The present model showed higher accuracy and lower error for the classification of shearer running status.The testing results showed that the proposed classifier was proved to be satisfactory and could be used in engineering application in the future.

Discussion.
To evaluate the classification capabilities of different types of neural network models (WNN short for wavelet neural network, RS-WNN short for rough sets coupling with wavelet neural network, BP-NN short for backpropagation neural network, and RS-BPNN short for rough Therefore, a search mechanism was needed to find the optimal number of nodes in the hidden layer for WNN and RS-WNN models.In this study, various numbers of nodes in the hidden layer had been checked to find the best one.WNN and RS-WNN yielded the best results when the number of hidden nodes was 8.
In this subsection, WNN, RS-WNN, BP-NN, and RS-BPNN were provided to solve the problem of above simulation example.The configurations of simulation environment for four algorithms were uniform and in common with above simulation example.Initially, for each network, the connection weights were assigned somewhat random values.In order to avoid the random error, the training set of input was presented to the networks 100 times and the average values were calculated.The training accuracy, testing accuracy, classification error, and classification time of four algorithms were shown in Table 5.
From the table, it was observed that the proposed model had a better classification capability and better performance than other models of neural networks in predicting the nonlinear, dynamic, and chaotic behaviors in the dataset, and the proposed model was proved outperforming others.Furthermore, the comparison results were suggesting that the proposed model represents a new good method for classification and decision making, and the new method can be treated as a promising tool for extracting rules from the dataset in industrial fields.
In order to illustrate the superiority of proposed method, Iris data set [23], as a benchmark data set from UCI database, was used to verify the classifiers based on above four types of neural networks and the compared results were shown in Table 6.The sample data had been reduced firstly based on attribution reduction algorithm, and then they were used to train neural networks so the networks coupling with rough sets (RS-WNN and RS-RBNN) could acquire better classification accuracy than signal networks (WNN and BPNN).The rules were extracted through value reduction algorithm to construct RS-BPNN structure, so the classification accuracy of classifier based on RS-BPNN was higher than that of RS-WNN classifier.

Conclusions
This paper proposed a novel classifier model based on BP neural network and rough sets theory to be applied in nonlinear systems.The decision table was constructed and discretized reasonably through the PSO algorithm.The decision rules were extracted by the use of value reduction algorithm which provided the basis for the network structure.In order to verify the feasibility and superiority, the proposed approach was applied to a classification problem of an industrial example.The results of comparison simulations showed that the proposed approach could generate more accurate classification and cost less classification time than other neural network approaches and the rough sets based approach.The classification performance of proposed model demonstrated that this method could be extended to process other types of classification problems such as gear fault identification, lithology recognition, and coal-rock interface recognition.

Figure 1 :
Figure 1: The topology structure of BP-NN.

Figure 3 :
Figure 3: The flowchart of the proposed approach.

Figure 4 :
Figure 4: The change curves of dependency degree and simulation time.

Figure 9 :
Figure 9: The contrast of prediction class and actual class.
2 ,  3 ,  4 ,  5 ,  6 ,  7 , and  8 , respectively.According to the information database acquired from the 11070 coal face in the number 13 Mine of Pingdingshan Coal Industry Group Co., 200 groups datasets of shearer running status were rearranged as shown in Table 1.This "running status" referred Calculate  i , P best , and Update , x i , and i M, n, c 1 , c 2 , r, max ,  max , and  min

Table 1 :
The running status of shearer and its corresponding parameters.

Table 2 :
The corresponding breakpoints of each attribute value at ℎ = 2.

Table 3 :
The discretized decision table.

Table 4 :
The testing samples for the classifier.

Table 5 :
Classification performance of the four models of neural networks.

Table 6 :
Classification results of the four methods.In order to obtain the best model, both inputs and outputs should be normalized.The number of nodes in the hidden layer of WNN and RS-WNN was equal to that of wavelet base.If the number was too small, WNN/RS-WNN may not reflect the complex function relationship between input data and output value.On the contrary, a large number may create such a complex network that might lead to a very large output error caused by overfitting of the training sample set.