A Novel Multiway Splits Decision Tree for Multiple Types of Data

Classical decision trees such as C4.5 and CARTpartition the feature space using axis-parallel splits. Oblique decision trees use the oblique splits based on linear combinations of features to potentially simplify the boundary structure. Although oblique decision trees have higher generalization accuracy, most oblique split methods are not directly conducive to the categorical data and are computationally expensive. In this paper, we propose a multiway splits decision tree (MSDT) algorithm, which adopts feature weighting and clustering. This method can combine multiple numerical features, multiple categorical features, or multiple mixed features. Experimental results show that MSDT has excellent performance for multiple types of data.


Introduction
Despite the great success of deep neural network (DNN) model in image processing, speech recognition, and other fields in recent years, decision trees have competitive performance compared to DNN scheme, such as the advantage of interpretability, less parameters, and good robustness to noise, and can be applied to large-scale data sets with less computational cost. erefore, the decision tree is still one of the hotspots in the field of machine learning today [1][2][3]. e research mainly focused on the construction method of decision trees, split criterion [4], decision trees ensemble [5,6], mixing with other learners [7][8][9], decision trees for semisupervised learning [10], and so on.
Despite practical success, the optimal construction of decision trees has been theoretically proven to be NPcomplete [11]. In order to avoid the local optimal solution, some researchers adopted evolutionary algorithms to build decision trees [12][13][14]. However, due to the time complexity, the most popular algorithms, such as ID3 [15], C4.5 [16], and CART [17], and their various modifications [18] are greedy by nature and construct the decision tree in a top-down, recursive manner. Besides, they only act on one dimension at a time and thus result in an axis-parallel split. In the induction of decision tree, if the candidate features are numerical, a suitable cut point needs to be searched. Instances in the training set are divided into the left node or the right node according to the following formula: where x l denotes the value of the instance on the feature A l and θ l is the cut point. Axis-parallel trees have the advantages of fast induction and strong comprehensibility. However, in the case of highly correlated features, a very bad situation may arise. Figure 1 gives an illustration. e parallel splits will be carried out many times with a stair case-like structure, which leads to the complexity of the decision tree structure.
To solve the problem of parallel decision trees, some researchers introduced oblique decision trees. In such oblique decision trees, the nonleaf node tests the linear combination of features, i.e., p i�1 a l x l ≤ θ, (2) where a l represents the coefficient for the lth feature, θ is the threshold, and p is the number of features. In Figure 1, the instances of the two classes can be completely separated by one oblique split. erefore, it is generally believed that the oblique splits can often produce smaller decision trees and better generalization performance for the same data.
It is much more difficult to search the optimal oblique hyperplanes than the optimal axis-parallel hyperplanes. To solve this problem, numerous techniques have been applied, for example, hill-climbing [17], simulated annealing [19], and genetic algorithm [20]. Among them, a large amount of research work has been done on reducing the risk of falling into local optimal solution, such as Simulated Annealing Decision Tree (SADT) [19], which used the simulated annealing algorithm; OC1 [21] method combined the ideas of CART-LC [17] and SADT.
In searching oblique hyperplanes, thousands of candidates have been tried in both simulated annealing algorithm and genetic algorithm, resulting in low time efficiency. So many researchers used linear discriminant analysis, linear regression, perceptron, SVM, and other methods to find suitable oblique hyperplanes. Fisher's decision tree (FDT) [22] takes advantage of dimensionality reduction of Fisher's linear discriminant and uses the decomposition strategy of decision trees to come up with an oblique decision tree. FDT is only applicable to binary classification problems. Based on ADTree [23], Hong et al. [24] proposed the multivariate ADTree. Paper [24] presented and discussed the different variations of ADTree (Fisher's ADTree, Sparse ADTree, and Regularized Logistic ADTree). Wickramarachchi et al. [25] explored a decision tree algorithm (HHCART). HHCART uses a series of Householder matrices to reflect the training data during tree construction. Shah and Sastry [26] defined separability of instances as the split criterion that optimized their evaluation function at each node and then presented the Alopex Perceptron Decision Tree algorithm for learning a decision tree. Menze et al. [27] presented an oblique tree forest method, which used LDA and ridge regression to conduct oblique splits.
In the above oblique methods, the trees with fewer nodes and better accuracy can be obtained. However, there are also some deficiencies, mainly including three aspects.

Inability to Directly Employ the Methods for Categorical
Data.
e oblique splits use the linear combination of features. erefore, the categorical features need to be converted into one or more numerical features [28]. is transformation may bring new biases to the classification problems, thus reducing the generalization ability of the models.

High Time
Cost. e oblique splits always require complex matrix calculation when using linear discriminant analysis, ridge regression, or other methods. Although these methods are more efficient than simulated annealing and genetic algorithm, they still pay more cost than the axisparallel methods, such as C4.5.

Some Methods Cannot Be Suitable for Multiclassification
Problems. Generally, the oblique split methods conduct the binary splits. Although the binary tree can also be directly used for multiclassification problems, some binary splits rely on class label, such as FDA, original SVM, etc., which makes some algorithms like FDT in [22] limited to binary classification problems. In addition, some models need to convert multiclassification problems into binary ones [7].
In order to overcome the above shortcomings, this paper proposes a multiway splits decision tree for multiple types of data (numerical, categorical, and mixed data). e specific characteristics of this method are as follows: (i) Categorical features are handled directly. (ii) e time complexity is similar to that of the axisparallel split algorithms. (iii) It is not necessary to convert multiclassification problems into binary ones by using the multiway splits directly. e remainder of the paper is organized as follows. In Section 2, we review RELIEF-F and k-means algorithms briefly. Section 3 presents our algorithm and discusses its time complexity. Section 4 presents and analyzes the compared experimental results with other decision trees. e last section gives the conclusion of this paper.

Preliminaries
e proposed decision tree method needs to weight the features by RELIEF-F algorithm and split the nodes by the weighted k-means algorithm. erefore, this section reviews the two algorithms and their variations.

RELIEF-F Algorithms.
e RELIEF algorithm [29] is popular to feature selection. It estimates the weights of features according to the correlation between individual feature and class label. RELIEF randomly samples an instance R from the training set and then searches its two nearest neighbors H and M: H is from the same class (called near Hit) and M is from different class (called near Miss). If the distance between R and H on feature A is less than the distance between R and M, RELIEF will increase A's weight. On the contrary, RELIEF will decrease the weight.
In fact, RELIEF's estimate W(A) of feature A is an approximation of the following difference of probabilities: where P(·|·) represents the conditional probability. RELIEF algorithm only deals with binary classification problems. Kononeill addressed an algorithm called RELIEF-F for multiclassification problems [30]. e algorithm picks m instances. For each instance R, its k nn nearest neighbors are searched in each class. e weight W(A) is calculated as follows: where p(T) represents the proportion of class T instances to the total instances and M j (T) represents the jth nearest neighbor to R in class T. diff(A, R 1 , R 2 ) calculates the difference between two instances R 1 and R 2 on the feature A as follows:

k-Means, k-Modes, and k-Prototypes.
e k-means is widely used in real world applications due to its simplicity and efficiency.
Let D be a set of n instances. D is characterized by a set of p features and needs to be clustered into k clusters C 1 , C 2 , . . . , C k . First, randomly pick some instances as the centers of the initial k clusters μ 1 , μ 2 , . . . , μ k , and then calculate the cluster label for each instance x i as follows: After all the instances are partitioned, each cluster center will be updated by the following formula: Repeat formulas (6) and (7) until the variable E in formula (8) converges to the local optimal solution or the preset number of iterations is reached: However, the classical k-means is only worked on the numerical data. e k-modes and k-prototypes are variants of k-means for categorical and mixed data, respectively [31]. When k-modes processes the categorical variables, the center of each cluster is represented by modes. When calculating the distance between instance and cluster center, the distance on each feature is calculated by formula (5) and then accumulated.
It is straightforward to integrate the k-means and kmodes into the k-prototypes. dis(x i , μ j ) is the distance between instance x i and cluster center μ j as follows: (9) where dis_n(x i , μ j ) represents the distance on the numerical variables and dis_c(x i , μ j ) represents the distance on the categorical variables, respectively. c is used to adjust the proportion of dis_n(

Our Proposed Algorithm
Our proposed MSDT has three differences with most oblique methods: (i) MSDT does not use greedy methods to pursue maximum impurity reduction, (ii) MSDT uses a combination of multiple variables to do multiway splits for nonleaf nodes, and (iii) MSDT treats categorical features in a similar way to numerical features.

Multiway Splits.
Most oblique methods conduct binary splits, while the proposed algorithm performs multiway splits; that is, in one split, multiple hyperplanes are generated simultaneously, and the feature space is divided into several disjoint regions. Ho [32] categorized the linear split methods into three types, axis-parallel linear splits, oblique linear Mathematical Problems in Engineering splits, and piecewise linear splits, while our method falls into the third. Piecewise linear split methods find k anchors in feature space, and each instance is clustered according to the nearest neighbor anchor. Figure 2 shows the 5-way splits of the two-dimensional feature space.

Location of Anchor.
Finding suitable split hyperplanes is the key problem in most decision tree induction algorithms. Under piecewise linear splits, the problem of finding appropriate hyperplanes is equivalent to that of finding appropriate anchors. Usually, anchor selection can use the class centroids, or cluster centers generated by some clustering algorithms. In MSDT, we first use RELIEF-F to weight features and then use k-means with weighted distance to cluster instances.

Why Do We Use k-Means?
If the instances are linearly separable, it is obviously more efficient to use simply the class centroids than cluster centers as anchors. However, when the instances of some classes are distributed in different regions of the feature space, the class centroids may no longer be suitable for being anchors. For example, in Figure 3, the circular instances are distributed in two different areas. If the solid line that is perpendicular to the line between the two class centroids is used to separate the instances, the effect is obviously not satisfactory. e instances in Figure 3 are obviously distributed into two clusters. If the instances are divided by the dotted line that is a perpendicular bisector of the two cluster centers, at least the circular instances on the right side of the figure can be distinguished.
e split method proposed is based on clustering assumption. e clustering assumption states that the samples belonging to the same cluster belong to the same class. kmeans methods partition instances according to some (dis) similarity measures; hence, the leaf nodes of MSDT can be regarded as some prototypes, and the class of a test instance depends on which prototype the instance is more similar to. e univariate decision trees can produce a comprehensible classification mode, due to the knowledge representation method-a decision tree is a graphical representation and can be easily converted into a set of rules written in a natural language. Some researchers believe that multivariate decision trees are not able to convert into the comprehensible rules. e other researchers think that multivariate tree with fewer nodes is easy to understand. MSDT is easy to understand due to two reasons. One is that MSDT has fewer nodes compared to univariate decision trees. e other one is that the similarity with the prototype is easy to understand by the users and it can replace the rules generated by the univariate decision tree.

Why Do We Weight Features?
e original k-means is an unsupervised clustering algorithm, which is suitable for unlabeled data. And the optimization goal is to minimize (8).
e goal of split is to reduce the class impurity of current node as much as possible. Note that the two goals are not the same. erefore, we estimate the correlations between features and label to weight features. When calculating the distance from an instance to a cluster center, we give a larger weight to the feature strongly related to the label that enlarges the contribution of the feature to the distance. Otherwise, we give a smaller weight that reduces the contribution of the uncorrelated feature to the distance. In this way, the optimization goal of k-means algorithm is close to that of node split. Figure 4 shows an example to illustrate the effectiveness of feature weighting. e solid line comes from unweighted features, and the dotted line comes from weighted features when the weight of A 1 is 0.05 and the weight of A 2 is 0.95. It is obvious that some instances have been corrected.
To further illustrate the role of feature weighting, we use dataset iris to carry out a simple experiment: 150 samples of dataset iris come from three classes, and each class has 50 samples. We directly use k-means algorithm to cluster and obtain 10 misclassified samples. e specific results are shown in Table 1.  Class centroids Cluster centers The perpendicular line between the two class centroids The perpendicular line between the two cluster centers en, we use the RELIEF-F algorithm to calculate the weights of four features, which are 0.09, 0.14, 0.34, and 0.39, respectively. In the process of k-means clustering, the distances between instances and cluster centers are calculated by (10), where p indicates the number of features and w l indicates the weight of the lth feature. We obtain 6 misclassified samples, and the specific results are shown in Table 2.
Our proposed split method is shown in Algorithm 1, which will be used to split nodes for numerical data.
In the fifth step of Algorithm 1, l max represents the maximum number of iterations. In the experiments, we set it to 6 by default. e reason for setting such a small value is mainly to consider the time efficiency of the algorithm. In addition, the purpose of clustering is to split nodes. Even if the clustering algorithm does not converge, the partition results can still be accepted.

Categorical Feature.
As mentioned in the previous subsection, the split method can be directly applied to numerical features. For categorical features, RELIEF-F algorithm can still be used to weight features. However, in the process of clustering, the representation of cluster center and the distance from instance to cluster center need to be redefined.
e k-modes extends the k-means by replacing the means of numerical variables with the modes of the categorical variables. Yet it is less precise to calculate the distance. What is more, choosing different modes may cause opposite conclusion while there are several modes for a feature.
Here is an example. Suppose there are two clusters C 1 and C 2 described by two categorical features A 1 and A 2 , and each cluster contains 10 instances as is shown in Table 3. e modes of C 1 and C 2 for A 1 are a11, which makes A 1 useless for distinguishing the distances between instances and the clusters.
ere are two modes for A 2 in C 1 and C 2 , respectively. Suppose that there is an instance q � (a11, a21); if μ 1 � (a11, a21) is selected as the center of C 1 and μ 2 � (a11, a23) for C 2 , distance between q and μ 1 is 0 and distance between q and μ 2 is 1; hence, q is nearer to C 1 . If μ 1 � (a11, a22) is selected as the center of C 1 and μ 2 � (a11, a21) for C 2 , distance between q and μ 1 is 1 and distance between q and μ 2 is 0; hence, q is nearer to C 2 .
To avoid the less precision and the ambiguity of distance measure on the modes, we use the probability estimation of each categorical feature value to represent the cluster center and define a function to calculate the distance from instance to cluster center.
Let D be a set of categorical data described by p categorical features. Number of instances in D is n and instances are partitioned into k clusters. ere are d (i,j) with different values ω 1 , ω 2 , . . . , ω d i,j { } for the lth feature A l of the jth cluster C j , l ∈ 1, 2, . . . , p , j ∈ 1, 2, . . . , k { }.
Definition 1. C j,x l represents the set of instances with value of e condition probability is estimated as follows: S j,l is the summary of all values of A l in C j , defined as follows: S j,l � P ω 1 |j , P ω 2 |j , . . . , P ω d |j .
Definition 2. e center of C j is represented by the following vector: μ j � S j,1 , S j,2 , . . . , S j,p .

Mathematical Problems in Engineering
Definition 3. diff(A l , ω, S j,l ) represents the distance between value ω and S j,l for A l : Definition 4. dis_c(x i , μ j ) represents the weighted distance between instance x i and center μ j : According to formula (15), in the above example, the weights of two features are 1. e distances between instance q � (a11, a21) and two cluster centers (μ 1 and μ 2 ) in Table 3 are 0.7 � 0.1 + 0.6 and 1.2 � 0.6 + 0.6, respectively. It means that q is closer to C 1 , which is in accordance with the human's intuition.
To cluster categorical data, we use formula (13) to replace formula (7) in step 4 and step 7 of Algorithm 1 and formula (15) to formula (10) in step 6.

Mixed Features Data.
For mixed data, the vector of cluster center consists of two parts: one is the means of numerical features and the other is the vector as shown in (13). In this case, we use (9) to calculate the distance from instance to cluster center, where dis_n and dis_c are obtained by (10) and (15), respectively. As the ratio of numerical and categorical features differs by the datasets, we choose c in (9) that makes the most reduction of GINI index, where c ∈ 0.1, 0.2, . . . , 0.9 { }.

MSDT and Time Complexity Analysis.
e multi_split function is prompted for node splits. Algorithm 2 describes the construction process of MSDT.
In step 2 of Algorithm 1, RELIEF-F is used to get the weights.
Time complexity of RELIEF-F is O(m · p · n · log 2 k m ), where p is the feature number, n is the instance number, m is the sampling number, and k nn is the nearest neighbor number. In this paper, m is set log 2 n, k nn is set 1, and log 2 k nn is negligible, so the time complexity of RELIEF-F in this paper is O(p · n · log 2 n).
Steps 4 to 9 of Algorithm 1 are the clustering process, and the time complexity is O(I · p · n · k), where k is cluster number and I is iteration number. When we use Algorithm 1 to split nodes, the max iterations Imax is 6; it means that time complexity may reach O(6 · p · n · k) in the worst case.
Considering the above two parts, the time complexity of Algorithm 1 is O((6k + log 2 n) · p · n). Compared with the time complexity of the classical axis-parallel splits, there is an extra k. When k is large, this algorithm is lower efficiency than the axis-parallel algorithms. Compared with binary splits, if the node numbers of the decision trees are the same, the operations in k-way splits are obviously less than in binary splits.
OC1 [21] is a classic oblique decision tree, whose time complexity is O(p · n 2 · log 2 n) in the worst case. In [25], the time complexities of HHCART(A) and HHCART(D) are O((p + n · log 2 n) · p 2 · k) and O((p + log 2 n) · p · n · k), respectively. In [22], the speed of FDT for splitting node is close to or even better than that of axis-parallel split method. e time complexity of this method is O(p 2 · n). Unfortunately, it can only be applied in binary classification problems.
In summary, when k is small, the efficiency of the proposed split method is close to classical axis-parallel split methods, and it is better than most oblique split methods.

Experiments
In this section, we use experimental results to demonstrate the effectiveness and performance of our proposed algorithm. In the first part, the experiments are used to illustrate the effectiveness of clustering, feature weighting, and the novel distance calculation method for categorical feature. e second parts compare MSDT with classical decision trees and another two oblique trees. Finally, we use a larger dataset covertype to compare with two axis-parallel trees. Table 4, the 20 UCI datasets [33] are used to evaluate the proposed algorithm, where the number of instances, the number of classes, and feature types Input: Current node training set D Output: Divide D as C 1 , C 2 , . . . , C k , cluster centers μ 1 , μ 2 , . . . , μ k , the weights w (1) Initialize the number of clusters k with the number of classes in D.

Datasets. As shown in
(2) Input D, call RELIEF-F to generate w.

Comparison of Different Piecewise Linear Split
Methods. e piecewise linear split methods can be summarized as two steps. First, find appropriate anchors. en, divide instances according to the nearest anchor. On the basis of this approach, our proposed algorithm is improved in three aspects: feature weighting, clustering, and special categorical feature processing.
is section combines these three changes into multiple functions and compares the performances in multiple types of data. ese functions are shown in Table 5.
e pessimistic pruning algorithm is adopted after the decision trees are generated. In addition, the average results of all experiments are obtained by 10 repetitions of 10-fold cross-validation.

Numerical Data.
In terms of numerical data, the proposed algorithm uses weighted k-means to optimize cluster center position. In order to demonstrate the role of clustering and feature weighting, we implement four different node split functions to generate decision trees. Fun0 directly uses the center of each class as the anchor. Instances are divided according to the nearest anchor. Euclidean distance is used for distance calculation. Fun1 also uses the center of each class as the anchor. However, in the process of selecting the nearest anchor for each instance, RELIEF-F is firstly used to calculate the weight of each feature. en remove features whose weights are less than 1/5 of the maximum. Finally, the distance is calculated according to formula (10). Fun2 uses the center of each class as the initial cluster center of k-means and the outputs of k-means as the partition results. Fun3 combines Fun1 with Fun2 and is our proposed algorithm for numerical data. Table 6 gives the classification accuracy of the 4 functions in 10 datasets, and the best entry in each row is bolded. As can be seen, Fun3 gets the best accuracy on 9 of 10 datasets and the average improvement is 4.16% higher than Fun0. In particular, the accuracy increases by more than 8% on Glass and Letter. e average accuracy of 10 datasets shows that Fun1 is about 1.07% higher than Fun0, and Fun3 is 1.39% higher than Fun2. e results show that feature weighting improves the classification performance. Fun2 is about 2.77% higher than Fun0, and Fun3 is 3.09% higher than Fun1. e reason for improvement is using clustering.

Categorical and Mixed Data.
On the categorical and mixed data, we implement eight different split functions to generate decision trees. Fun0 directly uses the center of each class as the anchor. For categorical features, modes are used to replace the means of numerical features as the component of anchors. When calculating the distance between instance and anchor, the distance on each feature is calculated by formula (5) and then summed. e difference between Fun1 and Fun0 is that the weight of each feature is calculated by RELIEF-F. en remove features whose weights are less than 1/5 of the maximum. e distance between the instance and the anchor is obtained by formula (15) and formula (9). Fun2 adds clustering process on Fun0. k-modes and k-prototypes are used for categorical data and mixed data, respectively. Fun3 combines Fun1 and Fun2. Fun4-7 corresponds to Fun0-3 respectively. On the categorical features, the calculation of cluster centers and distances adopts the method described in Section 3.3 (formulas (13) and (15), respectively). Fun7 is our proposed algorithm for categorical and mixed data. Table 7 gives the classification accuracy of the 8 functions in 5 categorical datasets (Balance, Car, Chess, Hayes, and MONK) and 5 mixed datasets (Abalone, CMC, Flags, TAE, and Zoo), and the best entry in each row is bolded. Except for CMC and Zoo, Fun7 obtains the best accuracy, and the average is 11.77% higher than Fun0. As can be seen, Fun1 is better than Fun0, Fun3 is better than Fun2, Fun5 is better than Fun4, and Fun7 is better than Fun6. e average improvement is 5.37%. is is the contribution of feature weighting. It is shown that Fun2 is better than Fun0, Fun3 is better than Fun1, Fun6 is better than Fun4, and Fun7 is better than Fun5. e average improvement is 4.45%. e reason is the use of clustering. Meanwhile, we can see that Fun4-7 is averagely 1.48% better than Fun0-3.
is improvement is the statistical distribution of feature values instead of modes.

Comparison with Other Decision Trees.
In order to verify the performance of our proposed algorithm, we selected four decision trees: J48 (WEKA's implementation of C4.5), CART SL (scikit-learn's implementation of optimal CART), OC1, and HHCART(A). Since CART SL and OC1 do not support categorical features, we convert categorical features to numerical features using the One Hot method. e 10 repetitions of 10-fold cross-validation were used in our experiments to report the average accuracy and tree size of 5 classifiers on the test set. Friedman test and Nemenyi test will be used to analyze the algorithm difference. e accuracy over the numerical datasets by each method is shown in Table 8. As can be seen, MSDT gets the best accuracy on 5 of 10 datasets and the average accuracy is 81.68%. It is 1.91%, 4.42%, 1.15%, and 1.81% higher than other four trees, respectively. In order to further demonstrate the differences of the classifiers, the Friedman test is used. We use the averages of the ranks of 5 classifiers on 10 datasets to calculate F F � 7.129032. Here, with 5 algorithms and 10 datasets, F F follows the F − distribution with 4 and 36 degrees of freedom, and the critical value is F(4, 36) � 2.634. (2) Procedure grow(node) (3) If |D| is less than the minparent or the instances are not partitionable (All the instances are of the same class or have the same feature values) en (4) mark node as a leaf, and label it with the class of the majority of instances in D. (5) Return node (6) End If (7) Call multi_split to get the cluster C 1 , C 2 , . . . , C k , μ 1 , μ 2 , . . . , μ k , and w. (8) Save the values of μ 1 , μ 2 , . . . , μ k , and w into the current node for the prediction.  Hayes- Roth  Hayes  160  4  3  15  MONK's problems  MONK  432  6  2  16  Abalone  Abalone  4177  7&1  3  17  Contraceptive method choice  CMC  1473  2&7  3  18  Flags  Flags  194  10&19  8  19  Teaching assistant evaluation  TAE  151  1&4  3  20 Zoo Zoo 101 15&1 7 where k is the number of algorithms and N is the number of datasets. When k � 5, N � 10 and significance α � 0.05, q α � 2.728, the calculated critical interval CD � 1.92899. In the case of these conclusions, the MSDT and OC1 have obvious performance advantages compared with CARTSL. e accuracy over the categorical and mixed datasets by each method is shown in Table 9. MSDT gets the best accuracy on 4 of 10 datasets and the average accuracy is 78.48%. It is 3.62%, 1.1%, 1.56%, and 1.88% higher than four other trees, respectively. We use the averages of the ranks of 5 classifiers on 10 datasets to calculate F F � 1.48951. Here, the critical value is F(4, 36) � 2.634. So, we cannot reject the null hypothesis; namely, there is no significant difference among the five classifiers. In the case of these conclusions, on categorical and mixed data, the advantages of three multivariate decision trees over two univariate decision trees are not so obvious. Especially in OC1, one categorical feature is transformed into multiple numerical features by One Hot method, which greatly increases the dimension of feature space. In the new feature space, the data becomes very sparse, and OC1 cannot find a suitable split hyperplanes. e tree size over 20 datasets by each method is shown in Table 10. In terms of the complexity of model structure, the average number of nodes in three multivariate decision tree is lower than the other two univariate decision trees. We use the averages of the ranks of 5 classifiers on 20 datasets to calculate F F � 3.35294. Here, with 5 algorithms and 20 datasets, F F follows the F-distribution with 4 and 76 degrees of freedom, and the critical value is F(4, 76) � 2.492. So, we reject the null hypothesis. Nemenyi method is used for post hoc test. Critical interval is obtained; CD � 1.364. In the case of these conclusions, the MSDT has obvious performance advantages compared with J48.

Comparison on Big Data.
e data set covertype comes from UCI [35], which is a 7-classification problem, which includes 581012 instances and 54 features. 10 of 54 features are numerical, and the remainders are Boolean. MSDT and J48 regard Boolean as categorical features, and CART SL is regarded as numerical features. e 10 repetitions of 10-fold cross-validation are used. Table 11 provides the accuracy of the three classifiers, the size of the tree, and the time to build the tree. e three classifiers achieve similar accuracies on covertype. In terms of tree size, MSDT has the least number of nodes. e running time provided in Table 11 is the time to build a tree and does not include the time consumed by loading data and testing. J48 runs slower than CART SL , which does not mean that there is a significant difference in time complexity between the two algorithms. e difference  Fun 7 Balance may be caused by the different development language. ere are two reasons why MSDT gets the most expensive time consumption. One is that the time complexity of our proposed method is higher than that of the axis-parallel methods when dividing a node. e other one is that the axis-parallel methods mainly perform relational operations, Table 8: e accuracy of five classifiers on numerical data.

Dataset
Accuracy % (rank) J48 CART SL OC1 HHCART(A) MSDT Blood for instance, "<." Our method needs to calculate a large number of distances, which requires arithmetic operations of real number. Although multiway splits can reduce the number of times to split nodes, the time consumed by our method is about 2 to 3 times that of the axis-parallel methods from the experimental results.

Conclusion
e decision trees generated by the oblique splits often have better generalization ability and fewer nodes. However, most oblique split methods are time-consuming and cannot be directly used for categorical data, and some of these methods can only be used for binary classification problems. Our proposed algorithm MSDT uses feature weighting and clustering to multiway splits of nonleaf nodes, which can be directly applied to multiclassification problems. Meanwhile, it has a time complexity similar to that of the axis-parallel algorithms. In addition, we give the representation of cluster center and the distance from instance to cluster center, which enables clustering to be used in categorical and mixed data. Experimental results show that MSDT has a good generalization accuracy on multiple types of data.

Data Availability
e data used to support the findings of this study are included in the article.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.