A Searching Method of Candidate Segmentation Point in SPRINT Classification

SPRINT algorithm is a classical algorithm for building a decision tree that is a widely used method of data classification. However, the SPRINT algorithm has high computational cost in the calculation of attribute segmentation. In this paper, an improved SPRINT algorithm is proposed, which searches better candidate segmentation point for the discrete and continuous attributes. The experiment results demonstrate that the proposed algorithm can reduce the computation cost and improve the efficiency of the algorithm by improving the segmentation of continuous attributes and discrete attributes.


Introduction
In recent years, with the rapid development of economy and the continuous improvement of the level of computer technology, a large number of databases are used in business management, scientific research, and engineering development. In the face of massive storage data, how to find valuable information is a very difficult task. Data mining is to help people to extract valuable information from large, incomplete, random fuzzy data. Classification is a very important section in data mining. The purpose of classification is to construct a function or a model by which data can be classified into one of the given categories. The classification model can achieve the goal of forecasting data [1,2]. The prediction model is derived from historical data records to represent the trend of the given data, so that it can be used to forecast future data.
The ID3 algorithm is a significant algorithm for building a decision tree [3,4]. The information gain is used in this algorithm to select node's attributes in a decision tree. But ID3 has the shortcoming of inclining when choosing attributes in the large scale values. The improved method C4.5 is proposed based on the ID3 algorithm [5,6], and the C4.5 method uses the information gain rate instead of the information gain to select attributes of the decision tree, which improves the efficiency of decision trees. Then many improved algorithms based on the ID3 algorithm have been proposed, including SLIQ, SPRINT, and other algorithms. The SLIQ [7] algorithm can handle classification of large datasets. The SPRINT algorithm [8][9][10] based on SLIQ can be unrestricted by memory and its processing speed is considerable.
The SPRINT algorithm has many advantages. This algorithm is unrestricted by memory, and it is a kind of scalable and parallel method of building decision trees. But there are also some shortcomings. For example, finding the best segmentation point of discrete attributes needs a large amount of calculation, and the partition of continuous attributes is unreasonable.
Based on these issues, this paper proposes a new method of searching for the best segmentation point. For the segmentation of discrete attributes, the new method reduces time complexity by avoiding unnecessary computation. For the segmentation of continuous attributes, we can achieve the goal of reducing the depth of decision trees and improving the classification efficiency of decision trees through discretization of continuous attributes.

2
Journal of Electrical and Computer Engineering knowledge from large scale datasets and represent them in a graphically intuitive way.
The paper [1] presents the Importance Aided Decision Tree (IADT), which takes feature importance as an additional domain knowledge for enhancing the performance of learners. Decision tree algorithm finds the most important attributes in each node. Therefore, the mechanism of importance of features in the paper is a relevant domain knowledge for the decision tree algorithm. For automatically designing decision tree, Barros et al. [2] propose a hyperheuristic evolutionary decision tree algorithm tailored to a specific type of classification dataset. The algorithm evolves design components of top-down decision tree induction algorithms.
The key of ID3 algorithm is considering information gain as the reference value for testing attributes, which leads to lower classification accuracy [3]. So the authors in [4] proposed a new scheme for solving the shortcoming of ID3. The paper uses the improved information gain based on dependency degree of condition attributes as a heuristic when it selects the best segmentation attribute.
Ersoy et al. [5] proposed an improved C4.5 classification algorithm with the hypothesis generation process. The algorithm adopts -best Multi-Hypothesis Tracker (MHT) to reduce the number of generated hypothesis especially in high clutter scenarios.
In order to solve the security problems of intrusion detection system (IDS), attack scenarios and patterns should be analyzed and categorized. The enhanced C4.5 [6] is a combination of tree classifiers for solving security risks in the intrusion detection system. The mechanism uses a multiple level hybrid classifier which relies on labeled training data and mixed data. Thus, the IDS system based on C4.5 mechanism can be trained with unlabeled data and is capable of detecting previous attacks.
SLIQ decision tree solves the problem of sharp decision boundaries which are hardly found in classification. Thus the paper [7] proposes a fuzzy supervised learning in Quest decision tree. The authors construct a fuzzy decision boundary instead of a crisp decision boundary. In order to avoid incomprehensible induction rules in a large and deep decision tree, fuzzy SLIQ constructs a fuzzy binary decision tree, which has significant reduction in tree size.
SPRINT decision tree algorithm can predict the quality level of system modules, which is good for software testing [8]. The paper presents an improved SPRINT algorithm to calibrate classification trees. It provides a unique tree-pruning technique based on the minimum description length (MDL) principle. Based on this, SPRINT tree-based software quality classification mechanisms are used to predict whether a software module is fault-prone or not fault-prone.

Description of SPRINT Algorithm.
The SPRINT algorithm has no limit to the number of input records and its processing speed is considerable. This algorithm creates a list of attributes and a corresponding statistics table for each attribute of the sample data in the initialization phase. Elements in the list of attributes are known as attribute records, which consisted of labels, attribute values, and classes. Statistics tables are used to describe the class distribution of a property, and the C above and C below two lines, respectively, describe the class distribution of processed samples and untreated samples.
Steps of the original SPRINT algorithm are as follows:

If (node meets the termination conditions) {
Put node into the queue, labeled as a root node; Return; }

For (for each attribute ) {
Update histogram in real time; Calculate and evaluate the index of segmentation for each candidate segmentation points, and find the best segmentation point; Find out the best segmentation for node from the best segmentation for each attribute. Based on it make two part 1 , 2 ; The termination condition of the algorithm has three kinds of cases. (1) No attribute can be used as testing attribute.
(2) If all the training samples in the decision tree belong to the same class, the node is used as a leaf node and labeled by this class. (3) The number of training samples is less than the user-defined threshold.

Segmentation of Attributes.
The traditional SPRINT algorithm uses Gini index [5] to search for the best segmentation attribute, which provides the minimum Gini index representing the largest information gain.
For a dataset containing classes, Gini is defined as is the frequency of class in . If a partition divides the dataset into two subsets 1 and 2 , | 1 | and | 2 | represent the number of records in subsets 1 and 2 , respectively. After the segmentation, the Gini value is A segmentation of attribute values providing the least Gini value is chosen as the best segmentation [9].
For discrete attributes and continuous attributes, the SPRINT algorithm uses different processing methods.
In order to find discrete attribute segmentation point [7], we assume that the number of a certain attribute's values Journal of Electrical and Computer Engineering 3 is , which should be divided into two parts. All attribute values are considered as possible partition, and then the corresponding Gini value is obtained. There are 2 kinds of possible partitioning ways in total. We need to calculate the Gini value for each partitioning way using exhaustive method and then can obtain the best segmentation.
For the solution of finding the continuous attribute's partitioning point, the split can only occur between two values. First the values of the continuous attribute should be sorted and the candidate segmentation points are intermediate points between two values.
After a scan of sorted values, the statistics table should be updated when a record is read. The statistics table contains all the information needed to calculate the Gini index. Then we should calculate the Gini index to find the segmentation point with the minimum Gini value.
Although the traditional method can find the best segmentation point, it is necessary to traverse all of the segmentation in discrete attributes [8], which makes this algorithm have high time complexity. For the segmentation of continuous attributes, dividing them into two consecutive parts in most cases can not reflect the distribution of attribute values.

Segmentation of Discrete Attribute.
Taking credit risk of bank as an example, the data record is shown in Table 1.
Values of a discrete attribute with kinds of classes are divided into two sets, and then there are 2 types of partitions, which mean that the Gini index values should be calculated for 2 times. In Table 1, there are four kinds of classes, student, worker, clerk, and retiree, so 2 4 kinds of partitions should be considered. Taking into account commutative law of addition in formula (2) According to this property, the times of calculating Gini index can be reduced for segmentation of discrete attributes in the SPRINT algorithm. In order to reduce the time complexity of SPRINT algorithm, this paper proposes an improved discrete attribute partition algorithm.
is a collection of discrete attribute values, and the number of values in is . Now the attribute value set is divided into two sets 1 and 2 . Select some values from and put them into 1 . The number of selected values is . The initial value of is 1 with one-step growth until < /2. Values are identical in 1 and 2 in the case of > /2. When is odd, it is impossible that is equal to /2. If is an even number and is equal to /2, there are /2 kinds of combinations of attribute values in 1 , and 1/2 of attribute combinations are the same as 2 . So after selecting values for 1 , we need search for the same collection in 2 . If there is the same collection in 2 , the collection in 1 will be deleted. And when the number of 1 is more than 1/2 * /2 , the search is stopped. At the same time when all values in a subset belong to the same class, this subset can be a leaf node that does not need to be partitioned. So we can ignore two cases of {all values} and {empty}. In summary, this paper firstly proposed a new algorithm to reduce calculation of candidate segmentation points for discrete attributes. There are kinds of different values in a discrete attribute, and the improved algorithm on discrete attributes is as follows.
Step 1. Initialize a class partition table (including four fields: number, first collection, second collection, and Gini value), and set the counter = 1, = 1.
Step 2. If < /2, values are placed in the first collection of the class partition table and the Gini index of this division is calculated and then carry on for the next time.
Step 4. Put values in the first collection and the others into the second collection. Search for the values of the first collection in the list of the second collection. Find out if there is a second collection same as the first collection. If there is, this partition will be deleted; otherwise calculate the Gini index of this partition.
Step 6. Find out the minimum Gini value based on the optimized class partition table.
It can be seen that the improved algorithm eliminates repeated operations and unnecessary operations, which reduces computation greatly and reduces the time of creating a decision tree. And the values in collection 1 and collection 2 are sorted in ascending or descending order. The following processing is performed on collection 1 and collection 2.
Step 2. Sort the series of in descending order. Find the top values in series , and the corresponding and ( +1) ( and ( +1) ) are candidate segmentation points.
Step 3. There are 2 * candidate segmentation points in collection 1 (collection 2). 4 * candidate segmentation points are found in total.
Step 6. Repeat Step 5 until the number of series of is .
Step 7. Divide all values into +1 blocks using segmentation points. The values of continuous attribute have been divided into + 1 blocks through the above steps, and consider these + 1 blocks as + 1 discrete attribute values; then the segmentation method of discrete attribute values is used to process these blocks.
Step 8. Initialize a class partition table (including four fields: number, first collection, second collection, and Gini value), and set the counter = 1, = 1.
Step 9. If < ( + 1)/2, blocks are placed randomly in the first collection of the class partition table; the Gini index of this division is calculated and then carry on for the next time.
Step 11. Put blocks in the first collection and the others into the second collection. Search for the blocks in the list of the second collection and find out the values as same as the first collection. If there is, this partition will be deleted; otherwise calculate the Gini index of this partition.
Step 13. Find out the minimum Gini value based on the optimized class partition table.
Steps 8-13 are the same as the improved algorithm on discrete attributes.

Experiment and Simulation
This experiment uses the dataset of Function [11] as experimental samples. Attributes of the dataset include age, salary, vocation, level, and other attributes. There are discrete attributes, for example, vocation, and continuous attributes, for example, age in the dataset. The VC++ 6.0 is the experiment platform for this experiment. Comparison of the original SPRINT algorithm [9] and the improved SPRINT algorithm is shown in Table 2.
Visualization of data on Table 2 is shown as in Figure 1.
The quantities of data in the five sets are increasing, so the costing time is also growing. As shown in Figure 1, the improved SPRINT algorithm greatly reduces the time to generate decision trees. At the same time, the classified accuracy of the decision tree generated by the improved SPRINT algorithm is also tested.
The comparison results of classification accuracy are shown in Table 3.
As shown in Table 3, the improved SPRINT algorithm almost has the same or slightly better classification accuracy ratios as the original algorithm. With the increasing scale of dataset, the classification accuracy ratios have accordingly decreased. The decision tree becomes larger with the increase of the amount of data, which may result in the decreasing in  accuracy. Controlling the size of the decision tree needs to be further researched.

Conclusion
In summary, the improved SPRINT algorithm improves the calculation for searching the best segmentation by searching better candidate segmentation point for the discrete and continuous attributes, which reduces the unnecessary operations, increases the speed of generating decision trees, and reduces the time cost greatly.