Random Fuzzy Granular Decision Tree

In this study, the classification problem is solved from the view of granular computing. *at is, the classification problem is equivalently transformed into the fuzzy granular space to solve. Most classification algorithms are only adopted to handle numerical data; random fuzzy granular decision tree (RFGDT) can handle not only numerical data but also nonnumerical data like information granules. Measures can be taken in four ways as follows. First, an adaptive global random clustering (AGRC) algorithm is proposed, which can adaptively find the optimal cluster centers and maximize the ratio of interclass standard deviation to intraclass standard deviation, and avoid falling into local optimal solution; second, on the basis of AGRC, a parallel model is designed for fuzzy granulation of data to construct granular space, which can greatly enhance the efficiency compared with serial granulation of data; third, in the fuzzy granular space, we design RFGDT to classify the fuzzy granules, which can select important features as tree nodes based on information gain ratio and avoid the problem of overfitting based on the pruning algorithm proposed. Finally, we employ the dataset from UC Irvine Machine Learning Repository for verification. *eory and experimental results prove that RFGDT has high efficiency and accuracy and is robust in solving classification problems.


Introduction
e classification problem is a necessary research topic in data mining fields. Among classification approaches, the decision tree is very effective. It has the advantages of high classification accuracy, few parameters, and strong interpretability. Decision trees have been widely adopted in business, medical care, etc., and have achieved remarkable results. Data generated in daily life is increasing rapidly, which brings opportunities and challenges for decision trees to solve large-scale data classification problems. Decision tree is an inductive learning algorithm on the basis of examples. With the in-depth research on decision tree algorithms and the diversified needs in practical applications, a variety of learning algorithms or models for constructing decision trees have been proposed.

Information Entropy Decision
Tree. Quinlan proposed the ID3 decision tree algorithm.
is algorithm employs information gain in information theory as an evaluation of feature quality in the process of splitting tree nodes, and the feature with the largest information gain and the corresponding split point will be used for the construction of the node [1]. e ID3 algorithm is a clear and quick method, but it also has some obvious shortcomings. First of all, in the scope of processing datasets, it is not suitable for datasets containing continuous features; secondly, it tends to choose conditional features with more values as the optimal splitting features. In the same year, Schlimer and Fisher et al.
proposed the ID4 decision tree method, which constructs decision trees in an incremental manner [2]. Two years later, ID5 algorithm was presented by Utgof et al., which allows the structure of the existing decision tree to be modified by adding new training instances without the need for retraining [3]. In addition, Xiaohu Liu and his colleagues also discussed the decision tree construction by considering both the information gain brought about by the conditional feature of the current node and the information gain of the conditional feature of the next node when selecting the split feature [4]. Aiming at the shortcomings of ID3 decision trees in selecting features, Xizhao Wang et al. selected appropriate decision tree branches for merging during the construction of decision trees. is algorithm can increase the comprehensibility, improve the generality, and reduce the complexity [5]. e C4.5 decision tree improved performance compared with ID3 algorithm, proposed by Quinlan et al. in 1993 [6]. To solve the bias problem of information gain when selecting split features, this method employs the information gain rate as a metric for choosing split features. Also, the C4.5 algorithm also has the advantage of processing continuous features, discrete features, and incomplete datasets, which has stronger applicability than the ID3 algorithm. When constructing the C4.5 decision tree, the pruning operation is adopted, which can enhance the efficiency of the decision tree, reduce the scale of the decision tree, and effectively avoid the problem of overfitting. Domingos and his colleagues presented a very fast decision tree for data flow, which is called VFDT [7]. e algorithm shortens the training time of the decision tree by using sampling technology. Hulten designed CVFDT algorithm, which extended VFDT and required users to give parameters in advance [8].

Gini Coefficient Decision Tree.
In 1984, Breiman and his colleagues designed the CART algorithm, which adopts Gini coefficient as a metric for feature splitting, and chooses the conditional feature with the smallest Gini coefficient as the splitting feature of the tree node to generate a decision tree [17]. When generating a decision tree, if the purity of the spanning tree node is greater than or equal to the threshold assigned in advance, the tree node is stopped from dividing and then the main label of the instance data covered by the node is used as the label of the leaf node label. Meanwhile, the approach adopts resampling to analyze the accuracy of the constructed decision tree and perform pruning operations, and the decision tree with high accuracy and the smallest size is selected as the final decision tree. Here, the pruning method adopts the minimum cost complexity method, MCCP, which can solve the problem of overfitting, reduce the size, and improve interpretability. CART also has its own shortcomings. Due to the limitation of computer memory, this method cannot effectively process large-scale datasets. Aiming at the shortcomings of the CART algorithm in computer memory, in 1996 Mehta et al. proposed an SLIQ decision tree [18]. When building the decision tree, instances are presorted, and then the breadth-first method is used to choose the optimal split feature. e SLIQ algorithm has the advantages of fast calculation speed and ability to process larger datasets. When the data exceed the memory capacity, this method employs feature lists, classification lists, and class histograms to get the solution. However, when the amount of data is large to a certain extent, the algorithm still faces the problem of insufficient memory [19]. In 1998, Rastogi et al. presented a public decision tree algorithm [20]. e algorithm integrates the tree construction process with the tree adjustment process. By calculating the cost function value of the tree node, it is judged whether the node needs to be pruned or expanded. If it is not expanded, the node is marked as a leaf node. e combination of the establishment process and the adjustment process greatly increases the training efficiency of the public decision tree. e Rain Forest is a framework of the decision tree proposed by Gehrke and his colleagues in 2000 [21]. e research aim of the algorithm was to enhance scalability of decision tree and reduce the consumption of computer memory resources as much as possible.

Rough Set Decision Tree.
Incorporating the rough set theory into the decision tree can make the decision tree have the ability to handle uncertain, incomplete, and inconsistent data. When using rough set to generate a decision tree, the main research focus is how to use rough set theory to choose node splitting features. Miao and his colleagues designed a rough set based on multivariate decision tree algorithm. e method first selects the conditional features in the kernel to construct a multivariate test, and then generates a new feature to split the node [22]. e advantage of this algorithm is that the training efficiency is relatively high, but because there are too many variables in the nodes of the decision tree, the interpretability of the decision tree is difficult. Wang et al. designed a fuzzy decision tree on the basis of rough set, which uses fuzzy integration to keep the results consistent [23]. In 2011, Zhai and his colleagues adopted the fuzzy rough set to generate a decision tree and presented a new selection criterion for fuzzy condition features [24]. Wei et al. constructed a decision tree according to a variable precision rough set, which allows small error in the classification process and improves the generalization ability of the decision tree [25]. Jiang and his colleagues designed an incremental decision tree learning method via rough set, and used this method to discuss the problem of network intrusion detection [26]. In 2012, Hu and others proposed a monotonous ordered mutual information decision tree. is decision tree employs the dominant rough set theory to establish a new metric to measure feature quality to construct tree nodes. is decision tree can be resistant to noise, handle monotonous classification problems, and has good effects on general classification problems [27]. On the basis of the algorithm, Qian and his colleagues designed a fusion monotonic decision tree. e algorithm uses feature selection technology to generate multiple data feature distributions to construct multiple decision trees, and employs these decision trees to make comprehensive decisions [28]. Pei and his colleagues designed a monotonously constrained multivariate decision tree. e algorithm first uses the ordered mutual information splitting criterion to generate different data subsets, and then optimizes these subdata to build a decision tree [29].

Parallel Decision
Tree. Nonparallel or serial decision trees have received extensive and in-depth research and development, and a lot of decision tree models and algorithms have been proposed, but due to the recursive characteristics of decision trees and computing platforms, relatively speaking, the research of decision trees parallelization is not very extensive. e following is a brief review and summary of the current research status of parallel decision trees. e research on parallel decision tree started with SPRINT proposed by Shafer et al. in 1996 [30]. e algorithm tries to avoid the problem of insufficient memory by improving data structure during the growth of decision tree, but during calculation of the tree splitting node, the algorithm requires a broadcast from the entire instance to the entire instance. Kufrin et al. discussed the parallelization of decision trees and introduced a parallelization framework in 1997 [31]. One year later, Joshi and his team proposed a parallel form of decision tree similar to SPRIENT. It is different from the traditional depth-first construction method of decision tree. e algorithm adopts a breadthfirst form of decision tree growth within a parallel framework.
is method can avoid possible load imbalance problem [32]. Srivastava and his team presented two parallel decision tree models on the basis of the synchronous construction approach and the asynchronous construction method, but both of these models have large communication overhead and load imbalance problems [33]. Shent et al. gave a parallel decision tree that divides the input instance into four subsets to build a decision tree, and applied the generated parallel decision tree to the user authentication problem [34]. With the in-depth research on parallel decision trees and the emergence of MapReduce distributed computing frameworks, Panda et al. designed a parallel decision tree in 2009, which relies on multiple distributed computing frameworks [35]. Walkowiak and his colleagues focused on the parallelization of decision trees, and proposed an optimization model for network computing for the distributed implementation of decision trees [36]. In 2012, Yin et al. adopted the scalable regression tree algorithm to give a parallel implementation of the Planet algorithm under the MapReduce framework [37]. In response to the problem of overlearning or overfitting, Wang Ran et al. proposed an extreme learning machine decision tree on the basis of a parallel computing framework, but the disadvantage of this algorithm is that it can only handle numerical datasets and cannot handle mixed dataset [38]. e parallel C4.5 decision tree proposed in [39] considers the problem of overfitting and mixed data types. Aiming at the ordered mutual information decision tree that is widely used in monotonic classification problems, Mu and his colleagues presented a fast version and gave its parallel implementation [40]. ere are other parallel decision tree approaches proposed like distributed fuzzy decision tree [41], parallel Pearson correlation coefficient decision tree [42], etc. In addition to the above research, Li and other researchers proposed some classification and alignment algorithms [43][44][45][46][47][48][49] from the perspective of granular computing, which have good performance.

Contributions
In this study, a decision tree is constructed in granular space to solve the classification problem. e main contributions are as follows: (i) We propose AGRC that can adaptively give the optimal cluster centers, which is a global optimization method and can avoid falling into local optimization solution. (ii) We design the parallel granulation method based on the above clustering algorithm, which solves the problem of high complexity of traditional serial granulation and enhances the granulation efficiency. (iii) In granular space, we define fuzzy granules and related operators and select features based on the information gain ratio to construct a fuzzy granular decision tree for classification. To avoid overlearning, we also design the corresponding pruning algorithm. e method presented can solve binary classification or multiclassification problem and give feature importance according to the order of the tree node generated.

The Problem
Let S � (X, R, Y, f, V) be a classification system. e goal is to design a classification algorithm. During the process, parameters can be obtained by statistic instances. en, the label of instance can be predicted by the model. e symbols mentioned above can be explained in detail as follows. X � x 1 , x 2 , . . . , x n is an instance set. R � r 1 , r 2 , . . . , r m represents a feature set. V � ∪ r∈R V r , V r denotes the value region of feature r. h: X × R ⟶ V expresses an information function that allocates a value for each feature, that is, ∀r ∈ R, x ∈ X, h(x, r) ∈ V r . Y � y 1 , y 2 , . . . , y l indicates a label set, where y j is a label w.r.t. instance x i .

The Algorithm Description
To obtain the solution of classification problem described in Section 3, the model can be presented as follows. First of all, to enhance the efficiency as much as possible, we need to cluster the data. During the process, an adaptive clustering algorithm is designed, which can obtain the quantity of cluster and the cluster centers automatically. Second, on the basis of the quantity of center and the cluster centers, a parallel granulation is executed by calculating the distance between instances and cluster centers and dividing instances set into instances subsets. ird, the problem of instance classification can be converted into the problem of granule classification in granular space. Fuzzy granules, related operators, and cost function are defined in granular space. Splitting nodes can be found by solving cost function to build a fuzzy granular decision tree. Fourth, the pruning algorithm w.r.t. RFGDT is designed. Finally, the label of test instance can be predicted by RFGDT. e overview is described in Figure 1.

eory of AGRC. K-means
algorithm is an unsupervised classification approach. e cluster centers and their numbers need to be specified in advance. e results obtained rely on the above cluster center. If the initial cluster center is not well selected, it will fall into a local optimal solution. An Adaptive Global Random Clustering algorithm is proposed, which adaptively selects cluster centers and their numbers, and the initial selection of cluster centers is random, which is a global optimization approach. e thought is as follows. We know that if the between-cluster variance α is large and the intracluster variance θ k is small, then the performance is great. Hence, α 2 / K k�1 θ 2 k can be used as an evaluation indicator, where α represents standard deviation of intercluster and θ k denotes standard deviation of k − th inner-cluster. e goal is to increase the ratio α 2 / K k�1 θ 2 k continuously until the maximum iteration is achieved. In each iteration, we have a set of new cluster centers and these parameters, including the quantity of cluster centers, cluster centers and evaluation values, are combined into an evaluation set. When the process is over, we can get cluster centers that are corresponding to the largest evaluation value in the evaluation set, which are the optimal parameters. Here, in each iteration, we select an instance as the next cluster center with a certain probability, until the quantity of cluster centers meets the preset in this iteration. e process is described as follows: Step I: remove instances of missing feature values.
Step III: assign maximum iterations Max, evaluation set E � ϕ (composed of standard deviation of intercluster, standard deviation of inner-cluster, and evaluation value), and iteration t � 1.
Step IV: assign current cluster center set C t � ϕ and the quantity of current cluster center k � 0 and generate a random number n within (1, N) as the quantity of cluster center.
Step V: randomly select one instance x i as a cluster center and set k � k + 1, Step VI: calculate the shortest distance between the remaining instances and all cluster centers and the probability of an instance being selected as the next cluster center is Step VII: if x j is selected as cluster center, we set Step VIII: if k < n, go to Step V; otherwise, go to Step IX.
Step IX: calculate the standard deviation of intercluster and the standard deviation of inner-cluster and update the evaluation set, that is, Step X: update iteration t←t + 1.
Step XI: if Max < t, go to Step XII; or else, go to Step IV.
Step XII: in evaluation set E, the cluster center set with the largest ratio of the intercluster standard deviation to the sum of inner-cluster standard deviation, and the quantity of cluster center corresponding to it, K � |C * | (here, |·| expresses the quantity of elements of the set), is the optimal solution.
e principle of Adaptive Global Random Clustering is given. On the basis of the principle, the algorithm is as shown in Algorithm 1.

From Data to Fuzzy
Granules. Now, we introduce how to implement parallel fuzzy granulation of data via cluster  centers. We can adopt AGRC to get the cluster center set C � c 1 , c 2 , . . . , c k . Let instance set be X and feature set be R. If ∀x i ∈ X, c j ∈ C, and r ∈ R, the similarity between c j and x w.r.t. r is where 0 ≤ h(x i , r) ≤ 1, 0 ≤ h(c j , r) ≤ 1, and d(x i , c j , r) represents the similarity between x i and c j on r (0 ≤ d(x i , c j , r) ≤ 1). A fuzzy granule generated by x i can be written as For simplicity, it can also be written as Here "− " represents separator and "+" denotes the union of elements. In other words, q(x i , r) is the similarity between x i and cluster centers. e cardinal number of fuzzy granule q(x i , r) is obtained by equation: Now we give operators of fuzzy granules. For ∀x, z ∈ X, the operators between fuzzy granules generated by z and x can be written as follows: . . , r |U| , and |U| ≤ |R|, the fuzzy granular array generated by x on U can be written as follows: e symbol "+" denotes union and the symbol "-" represents separator. e cardinal number of the fuzzy granular array can be written as Now, we have the operators between fuzzy granular arrays. Let Q(x, U) and Q(z, U) be the fuzzy granular arrays Input: instance set X, maximum iterations Max Output: optimal cluster center set C * and its number K (1) Remove instances of missing feature values.
C � ϕ\\ Let the current cluster center be an empty set. (7) n � Select Cluster Number(1, N)\\ Randomly generate the quantity of cluster center in (1, N) (8) k � 1\\ Initialize the quantity of cluster center. (9) c k � Random Select Instance(X, 1)\\ Randomly select an instance as a cluster center in the dataset.
\\ \\Calculate the probability selected of x j as the next cluster center. (14) p � GenProbability (); \\Randomly generate a probability. (15) If \\ e standard deviation of intercluster and one of inner-cluster are calculated and stored and updated in the evaluation set.
θj\\ In the evaluation set E, choose the cluster center set with the largest ratio. Mathematical Problems in Engineering generated by the instance x and z on feature subset U, respectively. e operators can be written as follows: e difference between two fuzzy granular vectors can be written as follows: (19) From information granulation, we can see that fuzzy granules and fuzzy granular arrays are generated by their operators.
e fuzzy granules consist of the space called fuzzy granular space. Theorem 1. For ∀x i , x j ∈ X, ∀U ⊆ R, ∀r ∈ U, the similarity of fuzzy granules satisfies the following equation: Proof 1. According to equation (7), we have Equations (13) and (14) show Below we give an example to explain the granulation process and measurement. Table 1

Example 1. As shown in
} be instance set, feature set, and label set, respectively. C � c 1 , c 2 represents the cluster center set and parameter is λ � 0.5. e fuzzy granulation is as follows.
According to q( Mathematical Problems in Engineering Similarly, we also obtain Hence, the distance between instance x 1 and x 2 on feature set R with λ � 0.5 is

Random Fuzzy Granular Decision
Tree. RFGDT can embody structure and express the course of classifying instances on the basis of features. It includes a series of ifthen rules, or it can also be regarded as a conditional probability distribution calculated in fuzzy granular space and label space. e strength is that this algorithm is readable and its efficiency is high. When learning, we employ the training data to generate a RFGDT on the basis of minimizing cost function. When predicting, test data are classified via the model. ere are three steps in learning of RFGDT, namely, selecting feature, constructing tree, and pruning tree.
RFGDT is to describe the feature structure for classifying fuzzy granules in the fuzzy granular space, which can be composed of directed edges and nodes. A feature can be expressed by an internal node, and a label can be denoted by a leaf node.
During the classified process, the model starts from the root node, tests a certain fuzzy granule of instance, and assigns the fuzzy granule to its child nodes according to the result. Meanwhile, the value of a feature corresponds to each subnode. In this way, fuzzy granules are tested and allocated recursively until they reach the leaf node. Finally, fuzzy granules are allocated to the label of the leaf node. Now, the definition of the fuzzy granular rule set is written as follows.
Definition 1. Suppose that S � X, R, C, Y { } be a decision system, where X � x 1 , x 2 , . . . , x n is an instance set, R � r 1 , r 2 , . . . , r m is a feature set, Y � y 1 , y 2 , . . . , y l is a label set, and C � c 1 , c 2 , . . . , c K is a cluster center set. For ∀x ∈ X, a fuzzy granular array Q(x, R) can be generated by instance x.
en, fuzzy granular space can be generated by Q(x, R), that is, where W, G ∈ R K×m are fuzzy granular array coefficients. Suppose that ∀x ∈ X, y ∈ Y, a rule rule R (x) � [Q(R, x), y] can be made up of fuzzy granular array and label. us, a rule set RULE R � rule R (x)|∀x ∈ X can be constructed.

IF-en Rule.
RFGDT can also be regarded as a set of if-then rules. A RFGDT can be converted into an if-then rule set like this: A rule is constructed for every path from root node to leaf node; the features of the internal nodes with regard to conditions of the rule, and the label of leaf node, correspond to the conclusion of rule. e path of RFGDT (corresponding to if-then rule set) has a key character: mutually exclusive and complete. is means that every fuzzy granular array is covered by a unique path or rule. e so-called coverage here means that the features of the fuzzy granular array are consistent with the features on the path.

Learning.
e learning of RFGDT is to generalize a series of classification rules from training data. ere may be more than one RFGDT (i.e., a RFGDT that can correctly classify the training data) that is not inconsistent with the training data. Our purpose is to find a RFGDT that has little contradiction with training data and has strong generalization ability. In other words, RFGDT learning is to estimate conditional probability on training data. Infinite conditional probability models exist by fuzzy granular space division. e conditional probability model chosen can not only have a great fitting to training data but also have a perfect prediction on test data. RFGDT learning uses a cost function to express the aim. As mentioned below, the cost function of RFGDT learning is usually a regularized maximum likelihood function, and its strategy is to minimize the cost function. e algorithm of RFGDT learning is to recursively select the optimal feature and segment the training data on the basis of the feature, in order that each subdataset can get the best classified results. is process corresponds to the division of fuzzy granular space and the form of RFGDTs. At the beginning, the root node is constructed and all fuzzy granular arrays are placed at the root node. e algorithm chooses an optimal feature, and divides the training data into several subsets on the basis of this feature, in order that each subset has the best classification under the current conditions. If these subsets have been correctly classified, then the algorithm constructs leaf nodes and divides these subsets into the corresponding leaf nodes; if there are still subsets that cannot be correctly classified, then the algorithm selects new optimal features for these subsets, continues to segment them, constructs corresponding nodes, and proceeds recursively until all training data subsets are basically classified correctly or there is no suitable feature. Finally, each subset is divided into leaf nodes, i.e., there are clear categories. is generates a RFGDT. e RFGDT produced may have better classification ability for training data but may have worse classification ability for test data, that is, overfitting may occur. We need to prune the tree from bottom to top to let the tree be simpler in order that it can enhance its generalization ability. Specifically, it is to remove the leaf nodes that are too subdivided, make them fall back to the parent node, or even an ancestor node, and then modify the parent node or an ancestor node to a new leaf node.
If the quantity of features is large, the features can also be chosen at the beginning of the RFGDT learning, leaving only features that have sufficient classification ability for training data. We can draw a conclusion that the learning algorithm includes feature selection, RFGDT construction, and RFGDT pruning. Since RFGDT denotes conditional probability distribution, RFGDTs of different depths correspond to probability models of different complexity. e generation of RFGDT corresponds to local selection of the model, and the pruning of RFGDT is related to global selection of the model. e generation of the RFGDT only considers the local optimum, while the pruning of the RFGDT considers the global optimum.

Feature Selection and Cost Function Construction.
Selecting important features can enhance the efficiency of RFGDT learning. If the result of using a feature for classification is not very different from the result of random classification, the feature is said to have no classification ability. Empirically, throwing away such features has little effect on the accuracy. Here, we redefine the information gain ratio and use this criterion as the cost function of constructing a RFGDT. First, the empirical entropy of the dataset X is defined by where Q denotes the set composed of fuzzy granular arrays, |Q| expresses the quantity of elements of the set, Q i is the subset composed of fuzzy granular arrays of which classification is y i , and |Q i | is the quantity of elements of the set. Stipulate 0 log 2 (0) � 0. It can be seen from the definition that entropy only depends on the distribution of (Q) and has nothing to do with the value of (Q). e greater the entropy is, the greater the uncertainty of the random variable is. 0 ≤ E(Q) ≤ log 2 (l) can be verified from the definition. e empirical conditional entropy of feature r on the fuzzy granular array set Q is where Q j is the subset composed of fuzzy granular arrays taking the value r j on the feature r, |Q j | is the quantity of elements of the subset Q j , Q ji is the subset composed of fuzzy granular arrays that takes the value r j on feature r and the label y i , and |Q ji | is the quantity of elements of the subset Q ji . e information gain is calculated as follows: We can now write the ratio δ(Q, r) of information gain as Here, E r (Q) � − t j�1 |Q j |/|Q|log 2 |Q j |/|Q| (t denotes the quantity of taking the value on feature r).

RFGDT Generation.
We adopt the information gain ratio criterion to select features and recursively build a RFGDT. e specific method is as follows.
Information gain ratio for each feature of each subdataset is calculated in the Map stage. en, in the Reduce stage, the information gain ratios on the corresponding features of each subdataset are summed. e feature with the largest sum of information gain ratio can be chosen as the feature of the node, and the child node is constructed from the different feature values. We call the above approach recursively on the child nodes to build a RFGDT until the information gain ratios of all features are very small or there are no features to choose from. e algorithm is as follows: Step I: Q is randomly divided into s subfuzzy granular array sets Q (j) , j � 1, 2, . . . , s.
Step II: in the Map phase, the Map function uses each feature as the key and the information gain ratio as the value, namely, [key, value] � [r, δ(Q (j) , r)].
Step III: in the Reduce phase, Hadoop distributed system first aggregates the output results of all Map functions according to the key, and then uses these aggregated intermediate results as the input to the Reduce phase. e intermediate results after aggregation are as follows: [key, value] � r k , List δ Q (j) , r , k � 1, 2, . . . , s .

(29)
Step IV: if all fuzzy granular arrays in Q belong to the same class y i , the algorithm sets T as a single-node tree, adopts y i as the label of the node, and returns T.
Step V: if R � ϕ, set T as a single-node tree, use the label y i with the largest number of fuzzy granular arrays in Q as the class of the node, and return T.
Step VI: or else, calculate the sum of the information gain ratio of each feature in R to Q according to equation (28), and select the feature r * with the largest sum of information gain ratio as the split feature, which can be written as Step VII: if the information gain ratio of r * is less than the threshold ϵ, set T as a single-node tree, and use the class y i with the largest number of fuzzy granular arrays in Q as the label of the node, and return T.
Step VIII: if not, for each possible value v j of r * , according to r � f j , divide Q into a number of nonempty subsets Q i , the label with the largest number of fuzzy granular arrays in Q i is used as a mark, and the subnodes are constructed. e tree T is formed by the nodes and their subnodes, and return T.
Step IX: for the node i, use Q i as training set and R − r { } as feature set, recursively call Step I to Step VIII, get subtree T i , and return T i . e algorithm is described as in Algorithm 2.

Pruning.
e algorithm recursively generates RFGDT until it cannot do further. e tree generated in this way is often very accurate for the classification of training data, but the classification of test data is not so accurate, i.e., overfitting occurs. e main reason is that too much consideration is given to how to improve the correct classification of training data, thereby building an overly complex tree. e solution to this problem is to reduce tree complexity and simplify the constructed tree. e process of simplifying the constructed tree is called pruning. Specifically, pruning cuts some subtrees or leaf nodes from the constructed tree, and uses its root node or parent node as new leaf nodes, thereby simplifying the classification tree. e pruning of RFGDT can be achieved by minimizing the overall cost function of the tree.
Suppose that the quantity of leaf nodes of the tree T is |T|, t is the tree T leaf node, the leaf node has N t fuzzy granular arrays, where there are N tk fuzzy granular arrays in label y k (k � 1, 2, . . . , l). E t (T) is the empirical entropy on the leaf node t, α is the parameter (α ≥ 0), then the cost function of RFGDT learning can be written as In equation (31), C(T) denotes the prediction error of the algorithm on training data, i.e., the fit degree between the algorithm and the training data, |T| expresses model complexity, and parameter α ≥ 0 controls the impact between them. e larger α chooses a simpler model (tree), and the smaller one chooses a more complex model. α � 0 means that only the fit between the model and the training data is considered, and the complexity of the model is not considered. Pruning is to select the algorithm with the smallest cost function when α is determined, i.e., the subtree with the smallest cost function. If α is determined, the larger the subtree is, the better the fit is to training data, but the higher the complexity is; on the contrary, the smaller the subtree, the lower the model complexity, but it often does not fit well to the training data. e cost function just shows the balance between the two. RFGDT generation only considers the better fit of the training data by increasing the information gain. e RFGDT pruning also considers the reduction of model complexity by optimizing the cost function. e RFGDT generation is a local learning model, and the RFGDT pruning is a global learning model. e following is a RFGDT pruning algorithm.
Step I: empirical entropy of each node is calculated.
Step II: recursively retract upward from the leaf nodes of the tree. Suppose that the whole tree before and after a group of leaf nodes retract to its parent node is T b and T a , and the corresponding cost function values are C α (T b ) and C α (T a ), respectively. If Mathematical Problems in Engineering 9 then pruning, that is, the parent node becomes a new leaf node.
Step III: go to Step II, until it cannot continue, and get the subtree with the smallest cost function T α .
e algorithm is shown as in Algorithm 3.

Label Prediction.
After the RFGDT is constructed, given a test instance, first we transform it into fuzzy granular array and then use the RFGDT decision tree trained to predict. e method is described further in Algorithm 4.

Experimental Analysis
is paper employs 4 datasets from the UC Irvine Machine Learning Repository as the data source for the experimental test and constructs 8 datasets with 1% noise and 3% noise, respectively, as demonstrated in Table 2. e tenfold crossvalidation method was adopted in the experiments. Data of 90% and 70% were chosen randomly as training sets, respectively, and the remaining data were taken as the test set to execute one verification. en, we repeated the process ten times. e running time and average accuracy were used as measurements of performance. As illustrated in Figure 2, serial clustering fuzzy granulation, parallel clustering fuzzy granulation proposed, and serial granulation were compared on efficiency. C4.5, Support Vector Machines (SVMs), Convolutional Neural Networks (CNN), and RFGDT were compared for average accuracy (see . In RFGDT classifier, the quantity of cluster centers is a key parameter that has an effect on performance of classification, such as accuracy.
e relation between the quantity of cluster centers and the average accuracy was analyzed.
Fuzzy granulation in the case of serial computing tasks has lower efficiency, which cannot meet the needs. If computing task is parallelizable, serial computing task can be converted into parallel computing one. is paper adopted the approach of parallel granulation via clustering. We usually employ MapReduce to deal with parallel tasks of large-scale data. Fuzzy granulation of large-scale dataset can be divided into several subcomputing tasks and the subdatasets can be assigned to computing nodes via abstracting a hierarchical computing model. Due to its simple and easyto-use programming interface, MapReduce has been widely used in parallel programming model and computing Input: instance set X and threshold ε Output: root T of fuzzy granular decision tree (1) Normalize instances into [0, 1].
For i � 1 to |X t | ∃x i ∈ X t For j � 1 to M ∃r ∈ R, instance x is fuzzy granulated as q(x i , r) � k j�1 d(x i , c j , r)/c j end for Build a fuzzy granular array Q(x, R) � |R| j�1 q(x, r j )/r j ; Get label of x i , y x i ; A rule can be built.
End for (5) Q is randomly divided into s sub fuzzy granular array sets Q (j) , j � 1, 2, . . . , s. (6) Map stage. In the Map function, each feature r is used as the key of the Map function, and the information gain ratio is used as the value of the Map function, namely, [key, value] � [r, δ(Q (j) , r)] (7) IF for ∀Q i ∈ Q, its classification is y i , then set T as a single-node tree, take y i as the label of the node, and return T. [key, value] � [r k , List(δ(Q (j) , r), k � 1, 2, . . . , s)] (9) If R � ϕ, set T as a single-node tree, and use the label y i with the largest number of fuzzy granular arrays in Q as the label of the node, and return T. Or else, ∀r ∈ R, calculate the information gain ratio of r to Q j , namely δ(Q j , r) � δ(Q j , r)/E r (Q j ), select the feature with the largest sum of information gain ratio δ(Q j , r) � δ(Q j , r)/E r (Q j ).
(10) If the information gain of r satisfies δ(Q, r) < ε, then T is set as a single-node tree, and the label y i with the largest number of fuzzy granular arrays in Q is used as the class of the node, and T is returned; Or else, for each possible value v j of r, according to r � f j , divide Q into subsets of nonempty, Q i , take the label with the largest number of fuzzy granular arrays in Q i as a mark, construct subnodes, and form tree T based on the nodes and their subnodes, and return T. framework. e main thought is as follows. Job is divided into multiple independently runnable Map tasks; these Map tasks are distributed to several processors to execute; intermediate results are generated; the reduce operation tasks are combined to produce final output results. ere are two parts in MapReduce calculation process, namely, Map and Reduce. e input [key, value] is received by Map (see Table 3 Table 4). e output of Reduce is like key � x i , value � [q(x i , r 1 ), q(x i , r 2 ), . . . , q(x i , r m )] (see Table 5). Without clustering, the complexity of granulation is O(n 2 * m); With clustering, the complexity of granulation time can reach O(n * k * m) (k < m); If fuzzy granulation is proceeded by the parallel method, the complexity can Input: fuzzy granular decision tree T and parameter α Output: pruned subtree T α (1) Calculate the empirical entropy of each node E t (T) � − k N tk /N t logN tk /N t (2) Recursively retract upward from the leaf nodes of the tree.
Suppose that the whole tree before and after a group of leaf nodes retract to its parent node is T b and T a , and the corresponding cost function values are C α (T b ) and C α (T a ).
, pruning is executed, that is, to change the parent node into a new leaf node. (4) Return to Step 2 until it cannot continue, and get the subtree with the smallest cost function T α ALGORITHM 3: Pruning of RFGDT.
Input: test instance x, root node of RFGDT, and cluster center set C. Output: label y x of instance x (1) Obtain fuzzy clustering granular vector Q(x, R).
(2) Judge the path from the root node to the leaf node according to the feature value, and get the label according to the leaf node. ALGORITHM 4: RFGDT prediction.   achieve O(n * k * m/B), where B denotes the quantity of subsets, k represents the quantity of cluster centers, n expresses the quantity of instances, m is the quantity of features, and each subset corresponds to a Map. Figure 2 compares the efficiency of traditional granulation, serial granulation with clustering, and the parallel granulation with clustering proposed, where the abscissa expresses the quantity of instances and the ordinate denotes the average time taken for granulation. Below n represents the quantity of instances and K is the quantity of cluster center. When n � 5000 and K � 3692, the serial granulation cost 50 mins, the running time of serial clustering granulation was 37 mins, and the running time of parallel clustering granulation was only 9 mins. e parallel clustering granulation reduced by 26.00% and 82.00%, respectively. When n � 20, 000 and K � 11, 524, the serial granulation took 800 mins, while the running time of the serial clustering granulation was 461 mins (i.e., 42.38% improvement). Parallel granulation with clustering executed only 115 mins and was enhanced by 85.63% and 75.05%, respectively. When n � 30, 000 and K � 16, 398, the running time of parallel granulation with clustering improved by 45.33% and 86.33%, respectively, compared with the other two methods. We can draw a conclusion that as the quantity of instances increases, serial clustering granulation and parallel clustering granulation methods increase the efficiency to a great extent.
Taking 90% of the dataset Wine Quality as the training set, we can get the following results. As shown in Figure 3(a), in the dataset Wine Quality, when the quantity of cluster centers was less than 3006, the average accuracy of RFGDT was lower than that of the other three methods. With the quantity of cluster centers rising, the average accuracy of RFGDT increases rapidly. Especially, when the quantity of cluster centers was 3006, it achieved a peak value of 0.969, while the average accuracies of SVMs, C4.5, and CNN were 0.961, 0.952, and 0.948 (i.e., 0.83%, 1.79%, and 2.22% improvement, respectively). When the quantity of cluster centers was greater than 3006, the average accuracy of RFGDT also decreased slightly, but it was still higher than the other three methods. After adding 1% noise in the data, as illustrated in Figure 3(b), when the quantity of cluster centers was 3006, the average accuracy of RFGDT reached a peak value of 0.964, while the average accuracies of SVMs, C4.5, and CNN, RFGDT were 0.923, 0.914, and 0.892, respectively. RFGDT improved by 4.44%, 5.47%, and 8.07% respectively. Comparing these two datasets, SVMs, C4.5, and CNN dropped by 3.95%, 3.99%, and 5.91% respectively, and RFGDT reduced by 0.52% at the peak value. It can be seen from statistical data that RFGDT is more robust and stable to noise. When we add 3% noise to the data, as exhibited in Figure 3(c), SVMs, C4.5, CNN, and RFGDT decreased by about 0.54%, 1.97%, 0.56%, and 0.42%, respectively, compared with noise data of 1%. After that, we took 70% of data     as the training set to verify the performance. Overall, the average accuracies of these four methods were on the decline. From Figures 3(d)-3(f ), RFGDT performs better than other three algorithms when the number of clustering is more than 2000.
As illustrated in Figure 4(a), dataset BankMarketing contained nearly 50,000 instances, which was 10 times the scale of dataset Wine Quality. e shape of the average accuracy curve of RFGDT was similar to Figure 3, and the overall shape was high in the middle and low on both sides. When the quantity of cluster centers was K � 30100, the average accuracy of RFGDT reached the largest value of 0.963, while SVMs, C4.5, and CNN were 0.951, 0.947, and 0.955, respectively (i.e., 1.26%, 1.69%, and 0.84% improvement, respectively). In the Bank Marketing dataset with noise, as demonstrated in Figure 4(b), RFGDT reached a peak value of 0.956 at K � 30100. Compared with SVMs, C4.5, and CNN, RFGDT increased by 3.13%, 4.82%, and 2.69%, respectively. RFGDT reduced by 0.73%, while SVMs, C4.5, and CNN reduced by 2.59%, 3.84%, and 2.69%, respectively. As can be seen, RFGDT is not sensitive to noise and C4.5 is more sensitive to noise. Figure 4(c) shows the four algorithms are all reduced when the percent of noise data was 3%. However, RFGDT performed better than SVMs, C4.5, and CNN by about 3.61%, 5.22%, and 2.16% regarding the average accuracy, respectively. When 70% of data were taken as the training set, the four algorithms were decreased compared with 90% of data being the training set. However, as shown in Figures 4(d)-4(f ), RFGDT outperforms the other three algorithms under most parameters. e quantity of instances in dataset Localization Data for Person Activity was more than 160,000. As illustrated in Figure 5(a), without noise, when K � 85060, the average accuracy curve of RFGDT reached a peak value of 0.953, while SVMs was 0.932, C4. 5     Map 1 Map 2 performs better than SVMs, while SVMs is better than C4.5, and RFGDT is slightly better than CNN. In the dataset with noise, as demonstrated in Figure 5(b), compared with SVMs, C4.5, and CNN, the peak value of RFGDT increased by 4.30%, 4.86%, and 2.38%, respectively. Compared with SVMs, C4.5, RFGDT and CNN are less sensitive to noise. As shown in Figure 5( e dimension in dataset IDA2016Challenge is much higher than the other three datasets. We took 90% and 70% of data to test as training sets, respectively. On the basis of this, we added 1% and 3% noise into the dataset, respectively. e detailed results are as follows. As shown in Figure 6(a), when K � 39800, RFGDT achieved a peak value of 0.956, while SVMs, C4.5, and CNN just got 0.932, 0.921, and 0.947, respectively (i.e., about 2.58%, 3.80%, and 0.95% improvement, respectively). After adding noise of 1%, as illustrated in Figure 6(b), the highest value of RFGDT was 0.947 and RFGDT increased by about 4.30%, 5.11%, and 2.38% compared with SVMs, C4.5, and CNN, respectively. After adding noise of 3%, as demonstrated in Figure 6(c), the four algorithms were all decreased, but RFGDT was better than the other three algorithms. When data of 70% is taken as training set, RFGDT still outperforms SVMs, C4.5, and CNN, as exhibited in Figures 6(d)-6(f ).
Besides the dataset mentioned above, we also applied the algorithm to predict Alzheimer's disease by voice. is dataset was from the University of Pittsburgh and was stored in the form of speech and text from participants containing elderly controls, people with possible Alzheimer's Disease, and people with other dementia diagnoses. e corpus included 1263 instances. Mel-frequency Cepstral Coefficients (MFCC) of corpus was extracted as features for prediction. We calculated the first 20 dimensions, their first-order difference, and their second-order difference, which were concatenated to get 57-dimensional features. e precision of RPFDT was a maximum of 0.932. We can predict Alzheimer's disease by voice, which is a simple and low cost method compared with Magnetic Resonance Imaging. It is very meaningful and valuable for the diagnosis of Alzheimer's disease.
From the above analysis, we can see that the average accuracy of RFGDT is better than SVMs, C4.5, and CNN in the six datasets. In smaller datasets, CNN performs weaker than SVMs and C4.5. Especially, in datasets containing noise, the average accuracy of RFGDT is stable  and less sensitive to noise. Judging from the curve shape of the average accuracy of RFGDT, it shows a form of high and low in the middle. When the value of K is small, the performance of RFGDT is weaker than the other three algorithms. As the value of K increases, the performance of RFGDT is better than the other three algorithms. We use 10-fold crossvalidation, and the test set and training set are obtained randomly. In other words, for each time, the algorithm is evaluated in different training set and test set; it has randomness, but the performance of the algorithm is just evaluated objectively. e imbalance of instances will also affect the performance of the algorithm. For noisy datasets, we also found that RFGDT is more robust. e main reason lies in the fuzzy granulation process. RFGDTembodies a global comparison thought, which can overcome the noise interference to some extent. is is also the advantage of the RPFDT. At the same time, we also found that the choice of the K value is also the key to classification. If the K value is too small, it will reduce the classification accuracy. Instead, if the K value is too large, it will increase noise and also reduce the classification effect. A reasonable value of K is also the key to the performance of the algorithm. Compared with classical methods, granulation process costs some time, but this process can be executed offline. Moreover, parallel granulation can improve the efficiency greatly. In the meantime, in the granular space, the accuracy of classification can be enhanced.

Discussion
In this study, we propose a RFGDT that is suitable for dualclassification or multiclassification problems. In the algorithm, the idea of parallel distributed granulation is introduced, which improves the efficiency of data granulation. In the parallel granulation process, we design AGRC for granulation. We transform a classified problem of data into fuzzy granular space to find the solution. In the fuzzy granular space, we define fuzzy granules, fuzzy granular arrays on the basis of operators designed. e aim is to use the information gain rate to select feature as the split point to recursively construct the fuzzy granular decision tree. In order to avoid overfitting, we also design the pruning algorithm of RFGDT, which can improve the performance further. In the future, we will apply it to cloud computing and big data.

Data Availability
e dataset used to support the findings of this study is from the UC Irvine Machine Learning Repository.

Conflicts of Interest
e authors declare that they have no conflicts of interest.