This article describes how the costs of misclassification given with the individual training objects for classification learning can be used in the construction of decision trees for minimal cost instead of minimal error class decisions. This is demonstrated by defining modified, cost-dependent probabilities, a new, cost-dependent information measure, and using a cost-sensitive extension of the CAL5 algorithm for learning decision trees. The cost-dependent information measure ensures the selection of the (local) next best, that is, cost-minimizing, discriminating attribute in the sequential construction of the classification trees. This is shown to be a cost-dependent generalization of the classical information measure introduced by Shannon, which only depends on classical probabilities. It is therefore of general importance and extends classic information theory, knowledge processing, and cognitive science, since subjective evaluations of decision alternatives can be included in entropy and the transferred information. Decision trees can then be viewed as cost-minimizing decoders for class symbols emitted by a source and coded by feature vectors. Experiments with two artificial datasets and one application example show that this approach is more accurate than a method which uses class dependent costs given by experts a priori.

The inductive construction of classifiers from training sets is one of the most common research areas in machine learning (ML) and therefore in human-computer interaction. The traditional task is to find a hypothesis (a classifier) that minimizes the mean classification error (see, e.g., [

One way to incorporate costs in classification learning is to use a cost function that specifies the mean misclassification costs in a class-dependent manner a priori [_{k}

One example of application is the classification of a bank’s credit applicants as either “good customers” (who will pay back their credit) or “bad customers” (who are not likely to pay back their loans in full). A classifier for this task can be constructed from the bank’s records of past credit applications containing personal information on customers, information on the actual loans (amount, duration etc.), back payments on loans, and the bank’s actual profit or loss. The loss occurring in a single case can be seen as the misclassification costs for that example in a natural way. In the case of a good customer, the cost is the bank’s loss if that customer has been rejected. Where bad customers are concerned, the cost is simply the actual loss if the loan is not paid back in full. There are a lot of other possible applications where major costs resulting from false decisions have to be avoided. For example, a medical diagnosis may not overlook a dangerous disease like cancer, something that might not be very likely; yet not detecting it could lead to high costs (e.g., the death of the patient in an extreme case). Another example would be searching for texts in a text database such as on the Internet, the importance of which depends on the goals of specific user groups. Yet another application is the modeling of cognitive and general behavioral processes where, for example, emotional evaluation plays a major role.

One approach for using example-dependent costs has already been discussed and applied in the context of rough classifiers [

On the other hand, decision trees have the advantage of being able to be broken up into a set of rules which can be interpreted in a human-like fashion. Therefore, decision tree learning can be used as a tool for automatic knowledge acquisition in human-machine interaction, for example, in expert systems. In the cost-dependent case introduced here, this process can also be controlled by a factor of subjective importance defined by the cost of false decisions.

This article is structured as follows. Section

We start with the introduction of cost-dependent probabilities and a cost-dependent information measure as a generalization of the information measure introduced by Shannon, which only depends on classical probabilities [

In this construction (learning) of decision trees, which can be viewed as sequential decoders of class symbols coded by feature vectors, the next attribute for branching has to be selected if no unique class decision is possible yet. The attribute

Now we introduce a cost

Furthermore, we define the cost-transformed conditional probabilities

The well-known Bayes decision rule decides for the class, which has the highest probability in the cost-free case [

The class decision is usually based on the observed attribute values, not only on the prior class probabilities and costs. In the following, we will restrict our considerations to a single attribute

Note that the cost of misclassification also measures the (perhaps subjective or problem-dependent) importance of the class

We have simplified the definitions of cost-dependent probabilities here by regarding the attributes

Now we can define a cost-dependent entropy

The following proposition states two important properties of the cost-sensitive transinformation and relates it to its classical, cost-independent counterpart.

(a) If the costs _{.}

(b) If the mean cost

(a) If

(b) We consider the case of two classes and set

The function

An attribute

As mentioned in the introduction, the results obtained here are applied in cognitive science for a theoretical foundation of cost-controlled human behavior. One application of the explanation of the generation and possible control of psychopathological behavior is described in a paper written together with a well-known German psychotherapist and researcher in psychoanalysis [

The following section provides an overview of how a decision tree is constructed with CAL5. A comparison with other decision tree algorithms can be found in Section

The CAL5 algorithm [

In the following, we will give a more detailed-description of CAL5. If, during tree construction, a terminal node (leaf) representing an interval

We assume that one (real-valued) attribute

In the first step of the interval construction, all values of the training objects reaching the terminal node are ordered along the new dimension

With the confidence interval, the following “metadecision” is made for each tentative

If, for a class

If, for all classes

If neither

This procedure is repeated recursively until all intervals of

This computation of transinformation is done for all available attributes apart from the one immediately preceding the actual node. The attribute that delivers the maximum transinformation is used for branching. Branching stops if either terminal nodes of the branch contain a unique class label or the statistics in an interval is too poor to make either decision

Note also that interval

If the costs for not recognizing some classes are given, the original CAL5 uses class dependent thresholds

CAL5 has been compared with other algorithms for classification learning, particularly within the scope of the famous European STATLOG project described in detail in the book [

Considering the experimental results of the STATLOG project, CAL5 showed good results on average. Its performance averaged on the 24 datasets used in STATLOG was similar to that of C4.5 (even a little better [

An algorithm for the construction of decision trees using “total costs” (defined as the sum of misclassification costs and the cost for measuring an attribute

In this section, we describe an extended version of CAL5 capable of handling example-dependent costs. In Section

Now we extend the metadecision rules (1)–(3) and the (locally applied) optimal branching procedure introduced in Section

The algorithm performs an interval formation if a new branch has to be constructed with attribute

Based on decision theory, the following rule for class decision can be derived (see also Appendix and [

It can now be seen that, as a measure of importance, high costs

Using the class-dependent thresholds

If, for a class

If, for all classes

Case (3) of the description given in Section

For applying Bayes’ decision rule (1), the value of

Note that the transinformation computed for each attribute occurring in the feature vector (apart from the one immediately preceding on the path in the tree) for finding the optimal attribute for the new branch is now cost-dependent, since the intervals

The following offers a concise summary and description of the cost-dependent construction of decision trees with CAL5_OC. The algorithm has the following input and parameters:

a training set

the confidence value

the decision threshold

The output of the algorithm is a decision tree with decision nodes and leaves. A decision node is labeled with an attribute

The learning algorithm constructs intervals for an attribute

We implement the rules (1)–(3) defined in Section

Note that in addition to

Now we focus on a single attribute

The function

Let

Extend interval

In a postprocessing stage, it is possible to merge two intervals

Evaluate

Let

Create a decision node labeled with

create a leaf labeled with

create a subtree recursively by evaluating

create a leaf labeled with a class

Return the tree.

The theoretical considerations from the previous sections are demonstrated in following by experiments using two artificial datasets and one example of application [

Figure

Classifier for dataset NSEP (without costs). Errors of type 1 correspond to misclassifications for both classes occurring in intervals for which a class decision was made. Errors of type 2 correspond to misclassifications in intervals where no confident class decision could be made and the final class was assigned based on a majority decision.

The dataset (MULTI) shown in Figure

Classifier for the MULTI dataset (without costs).

From the NSEP dataset in Section

Cost functions for NSEP.

Figure

Classifier for NSEP_OC dataset (with object-dependent costs).

Now the discrimination function is piecewise linear: the decision region for class 1 is enlarged in the positive half-space of attribute

For the sake of comparison, we ran both the original CAL5 (constructed without costs) and CAL5_OC (constructed with object-dependent costs as described in Section

For the MULTI dataset, the cost functions

Classifier for MULTI_OC dataset (with object-dependent costs).

Comparing this with Figure

We conducted experiments with the German credit dataset from the STATLOG project [

In order to evaluate CAL5_OC, we designed the following cost model which features example-dependent costs as opposed to only class-dependent costs. If a good customer is incorrectly classified as a bad customer, we assumed a cost of

If a bad customer is incorrectly classified as a good costumer, we assumed that 75% of the entire credit amount will be lost (normally a customer will pay back at least part of the loan). When averaging the example-dependent costs for each class, we arrived at a ratio close to that originally given by experts, 1 : 5, which underpins the plausibility of our model. Note that when applying our approach to data from a real bank, we do not have to design a cost function based on the attributes. Instead, the cost values are naturally specified for individual customers. In the case of the German credit dataset, however, we did not have access to these values. In the following, we examine the example-dependent costs defined by the cost function above as the real costs of individual cases.

In our experiments we aimed at a comparison of the results that use example-dependent costs with the results where only class-dependent costs were given. This means that learning with CAL5 was based on a cost matrix, while learning with CAL5_OC was based on the example-dependent costs. However, evaluation was performed in regard to the example-dependent costs, since we viewed them as the real costs of the examples. In the case of CAL_OC, we did not use the given cost matrix estimated by the experts, but rather the new matrix estimated from the example-dependent costs. We constructed this new cost matrix by computing the average of the costs of class +1 and class −1 examples from the training set, which resulted in 6.27 and 29.51, respectively (the credit amounts were normalized so that they lied in the interval

We ran CAL5 with the new matrix and CAL5_OC with object-dependent costs using the German credit dataset with the (optimized) parameters

We compared the modified decision tree algorithm CAL5_OC with an extended perceptron algorithm (DIPOL, a piecewise linear classifier [

We also took a cost-proportional resampling method into consideration, as described in [

We applied all four approaches to the modified version of the German credit dataset described in Section

German credit dataset: results for different learning algorithms. “Default” is the classifier obtained when always predicting the class that maximizes

Algorithm | Estimated mean costs |
---|---|

Default | 4.38 |

SVMLight | 3.07 |

DIPOL_OC | 3.67 |

Resampling DIPOL | 3.61 |

CAL 5_OC | 2.97 |

The datasets NSEP_OC and MULTI_OC have mainly been designed for demonstrating the qualitative behaviour of the costs-sensitive learning methods. However, for the costs-sensitive versions of DIPOL and the SVM, diagrams corresponding to the ones for CAL5_OC (Figures

In this article we introduced a new, cost-dependent information measure and described how object-dependent costs can be used to learn decision trees (decoders) for cost optimal decisions instead of error minimal decisions. This was demonstrated through the use of decision theory and by defining the cost-minimizing extension CAL5_OC from the CAL5 algorithm, which automatically converts real-valued attributes into discrete-valued ones by constructing intervals. The cost-dependent information measure was used for the selection of the (locally) best next attribute for tree building. It can be used in other algorithms for decision tree learning, but it is of general importance for information theory, modeling in cognitive science and human-computer interaction because the control of behavior by error is replaced by control through costs for false decisions. There are many practical applications in classification learning where minimizing the costs of decisions plays a role, such as in medical diagnosis and financial areas.

Experiments with two artificial datasets and one example of application show the feasibility of our approach and that it is more adequate than a method that uses cost matrices given by experts if cost-dependent training objects are available. Since decision trees constructed with CAL5_OC also separate the classes in the feature space by axis-parallel hyperplanes, it can be used to attain symbolic representations of classes and rules depending on and ordered by their importance. In the future, it would be interesting to introduce misclassification costs into methods for constructing decision trees with hyperplanes in a general position, using distances to the hyperplanes as a measure of confidence for class decisions [

In contrast, for instance, to the cost-sensitive extension of DIPOL, CAL5_OC in its current form is not able to handle misclassification costs that depend not only on the original class of the example but also on the class into which it might be classified incorrectly. This means that each example in the training set must come with a whole vector of cost values corresponding to the different possible classes. We think that, in practice, these cost vectors per example might be difficult to obtain, whereas a single cost value per example (as it is used by CAL5_OC) could be given as the cost that occurred for the respective example in the past.

We also did not consider costs for measuring attributes (e.g., [

In Section

Decide for class

For the proof of the necessity of also using the second rule, we start with the formulation of a condition for the rejection of a class decision. This is given by the following.

The inequation can be transformed to

The value of

The rule for a decision for a single class in

Do not reject a class decision if there is at least one class

If condition (a) also holds for class