Cost-Sensitive Attribute Reduction in Decision-Theoretic Rough Set Models

In recent years, the theory of decision-theoretic rough set and its applications have been studied, including the attribute reduction problem. However, most researchers only focus on decision cost instead of test cost. In this paper, we study the attribute reduction problem with both types of costs in decision-theoretic rough set models. A new definition of attribute reduct is given, and the attribute reduction is formulated as an optimization problem, which aims to minimize the total cost of classification. Then both backtracking and heuristic algorithms to the new problem are proposed. The algorithms are tested on four UCI (University of California, Irvine) datasets. Experimental results manifest the efficiency and the effectiveness of both algorithms. This study provides a new insight into the attribute reduction problem in decision-theoretic rough set models.


Introduction
We are involved in decision making all the time.Most of the decisions are based on a group of criteria.In this case, decision making is often aimed at finding a proper balance or tradeoff among multiple criteria.There are a series of methods for analyzing multicriteria decision making, such as game theory.Game theory is an effective mathematical method for formulating decision problems as competition between several entities [1].These entities, or players, aspire to either achieve a dominant position over the other players or cooperate with each other in order to find a position that benefits all [2].Researchers have accumulated a vast literature on game theory and its applications.For example, recent advances in the study of evolutionary games are reviewed in [3][4][5], and some strategies in the spatial ultimatum game are discussed in [6,7], and so on.However, most of these studies do not consider attribute reduction, which can significantly reduce the computation complexity.
Different from the works mentioned above, in rough set theory, attribute reduction is an important concept.It supports the wide applications of rough sets.Moreover, classical rough sets [8][9][10] and their extensions [11][12][13][14][15] can be used in conflict analysis [16], a field related to decision making and game theory.Decision-theoretic rough sets (DTRS) [12,13] may be particularly relevant to decision making and benefit from some new insights provided by game theory.In the rough set theory, a concept is usually described by three classification regions: positive region, boundary region, and negative region.The three regions in DTRS are systematically calculated based on a set of loss functions according to Bayesian decision procedure.The loss functions can be interpreted based on practical notions of costs and risks.In DTRS models, an object is classified into a particular region because the cost of classifying it into the region is less than that of classifying it into other regions.The expected cost of classifying a set of objects is called decision cost.
Generally speaking, attribute reduction can be interpreted as a process of finding the minimal set of attributes that can preserve or improve one or several criteria.The minimal set of attributes is called an attribute reduct.Some researchers have investigated the attribute reduction problem in DTRS models.Most of them addressed the problem based on the preservation or extension of the positive region or the nonnegative region [17][18][19].However, for DTRS, the regions are nonmonotonic with respect to the set inclusion of attributes [18][19][20], so it is difficult to evaluate and interpret region-based attribute reduction.To tackle the problem, minimal-decision-cost attribute reduction was discussed in [20].However, most existing studies of attribute reduction in DTRS only concern decision costs but not test costs.
Test cost is the time, money, or other resources one pays for obtaining a data item of an object.Most of the existing attribute reduction problems assume that the data are already stored in datasets and available without charge.However, data are often not free in reality.Recently, the topic of test costs has drawn our attention due to its broad applications.According to the data models constructed in [21], the issues of test-cost-sensitive attribute reduction have been studied based on classical rough sets [22,23], neighborhood rough sets [24], covering rough sets [25,26], and so forth.In these works, both backtracking and heuristic algorithms have been implemented through an open source software Coser [27].Unfortunately, few works have addressed attribute reduction with test cost in the context of DTRS.
In this paper, we study the cost-sensitive attribute reduction problem for DTRS through considering the tradeoff between test costs and decision costs, which is remarkably related to decision making and game theory.Since the purpose of decisions making is to minimize the cost, the process of attribute reduction should help in minimizing the total cost, namely, the summation of test cost and decision cost.A decreasing average-total-cost attribute reduct is defined, which ensures that the total cost will be decreased or unchanged for decisions making by using the reduct.In view of this, a minimal average-total-cost reduct (MACR) in DTRS models is introduced.An optimization problem is constructed in order to minimize the average total cost.It is a generalization of the minimal-decision-cost attribute reduction problem discussed in [20].
Both backtracking and heuristic algorithms are proposed to deal with the new attribute reduction problem.The backtracking algorithm is designed to find an optimal reduct for small datasets.However, for large datasets, it is not easy to find a minimal cost attribute subset.Therefore, we propose a heuristic algorithm to deal with this problem.To study the performance of both algorithms, experiments are undertaken on four datasets from the UCI library [28] through the software Coser.Experimental results show that the efficiency of the backtracking algorithm is acceptable, especially when the loss functions are not much more than the test costs, while the heuristic algorithm is rather efficient, and it can generate a minimal total cost reduct in most cases.Even if the reduct is not optimal sometimes, it is still acceptable from a statistical perspective.Moreover, both algorithms perform well on classification accuracy with CART and RBF-kernel SVM classifiers.Meanwhile, the number of selected attributes is effectively reduced by the two algorithms.
The rest of the paper is organized as follows.In Section 2, we review the main ideas of DTRS.Section 3 gives a detailed explanation of the minimal-total-cost attribute reduction in DTRS models.An optimization problem is proposed.In Section 4, we present a backtracking algorithm and a heuristic algorithm to address the optimization problem.Experimental settings and results are discussed in Section 5. Section 6 concludes and suggests further research trends.

Decision-Theoretic Rough Set Models
In this section, we review some basic notions of DTRS model [12,13,17], which presents a theoretical basis for our method.Definition 1.A decision system (DS)  is the 5-tuple: where  is a finite nonempty set of objects called the universe,  is the set of conditional attributes,  is the set of decision attributes with only discrete values,   is the set of values for each  ∈  ∪ , and   :  →   is an information function for each  ∈  ∪ .
In a decision system, given a set of conditional attributes  ⊆ , the equivalence class of an object  with respect to , namely, { ∈  |   () =   (), ∀ ∈ }, is denoted by []  or [], if it is understood.In DTRS models, the set of states Ω = {,   } indicates that an object is in a decision class  and not in , respectively.The probabilities for these two complement states can be denoted by ).With respect to the three regions: positive region POS(), boundary region BND(), and negative region NEG(), the set of actions regarding the state  is given by A = {  ,   ,   }, where   ,   ,   represent the three actions of classifying an object  into the three regions, respectively.Let   ,   , and   denote the cost incurred for taking actions   ,   , and   , respectively, when an object belongs to , and   ,   , and   denote the cost incurred for taking the same actions when the object does not belong to .The loss functions regarding the states  and   can be expressed as a 2 × 3 matrix given in Table 1.
Based on the loss functions, the expected costs of taking different actions for objects in [] can be expressed as The Bayesian decision procedure leads to the following minimal-risk decision rules: Consider a special kind of loss functions with That is, the cost of classifying an object  belonging to  into the positive region POS() is less than or equal to the cost of classifying  into the boundary region BND(), and both of these costs are strictly less than the cost of classifying  into the negative region NEG().The reverse order of costs is used for classifying an object that does not belong to .The decision rules can be reexpressed as follows: where the parameters , , and  are defined as When we have 0 ≤  <  <  ≤ 1.After tie-breaking, the simplified rules are obtained as follows: Let   = { 1 ,  2 , . . .,   } denote the partition of the universe  induced by .Based on the thresholds (, ), one can divide the universe  into three regions of the decision partition   : where ; the Bayesian expected cost of positive rule, boundary rule, and negative rule can be expressed, respectively, as follows:

Minimal-Total-Cost Attribute Reduction in Decision-Theoretic Rough Set Models
In this section, we focus on cost-sensitive attribute reduction based on test costs and decision costs in DTRS models.The objective of attribute reduction is to minimize the total cost through considering a tradeoff between test costs and decision costs.Minimizing the total cost is equal to minimizing the average total cost (ATC), so we study the minimal average-total-cost reduct problem.
Test cost is intrinsic to data.There are a number of testcost-sensitive decision systems.A corresponding hierarchy consisting of six models was proposed in [21].Here, we consider only the test-cost-independent decision system, which is the simplest though most widely used model.Definition 2 (see [21]).A test-cost-independent decision system (TCI-DS)  is the 6-tuple: where , , , ,  have the same meanings as in a DS and  :  → R + is the test cost function.Test costs are independent of one another; that is, () = ∑ ∈ () for any  ⊆ .
By introducing test cost into DTRS models, we can obtain the following definition.Definition 3. A test-cost-independent cost-sensitive decision system in DTRS models (DTRS-TCI-CDS)  is the 7-tuple: where , , , , ,  have the same meanings as in Definition 2 and (  ) 2 × 3 is the loss function matrix listed in Table 1, where  ∈ {, , } and  ∈ {, }.
An example of DTRS-TCI-CDS is given in Tables 2, 3, and 4. From Table 2 to Table 4, there are a decision system where  = { 1 ,  2 , . . .,  9 } and  = { 1 ,  2 , . . .,  6 }, a corresponding test cost vector, and a corresponding loss function matrix, respectively.Table 2: An example decision system.For a given DTRS-TCI-CDS,  ⊆ ; the decision cost is composed of the three types of cost formulated in (7), so the decision cost can be expressed as where (6), we can rewrite the decision cost formulation as Obviously, we can obtain the average decision cost as follows: Because the test cost of any object is the same for the test set , the average total cost (ATC) is given by ATC (, ) =  () +  (, ) .
Similar to [20], we study the decreasing cost attribute reduction to avoid the interpretation difficulties in region preservation based definitions.The definition of decreasing average-total-cost attribute reduct is presented as follows.
Definition 4. In a DTRS-TCI-CDS,  = (, , , , , , (  ) 2×3 );  ⊆  is a decreasing average-total-cost attribute reduct if and only if According to the definition, we choose the subsets of  which ensure that ATC will be decreased or unchanged for decisions making in the processing of attribute reduction.
In most situations, users want to obtain the smallest total cost in the classification procedure, so we propose an optimization problem with the objective of minimizing average total classification cost.The proper attribute set to make ATC minimal is called minimal average-total-cost reduct (MACR).The optimization problem is, namely, the MACR problem.We define them as follows.If we set () =  for all  ∈  where  is a constant, the MACR problem is essentially the minimal-decision-cost attribute reduct problem [20], so the former is a generalization of the latter.

Algorithms
Since the MACR problem is a combinational problem and it is not easy to get the optimal solution in a linear time, we use heuristic approach to obtain the approximate optimal solution.However, to evaluate the performance of a heuristic algorithm in terms of the quality of the solution, we should find an optimal reduct first, so an exhaustive algorithm is also needed.In this section, we propose a backtracking algorithm and a -weighted heuristic algorithm to address the MACR problem.

The Backtracking Attribute Reduction Algorithm.
The backtracking algorithm is illustrated in Algorithm 1.In order to invoke this backtracking algorithm, several global variables should be explicitly initialized as follows: (1)  = 0 is a reduct with minimal average total cost; (2) cmc = (, ) is currently minimal average total cost; (3)  = 0 is current level test index lower bound.The backtracking algorithm is denoted as backtracking (, ).A reduct with minimal ATC will be stored in  at the end of the algorithm execution.Generally, the search space of the attribute reduction algorithm is 2 || .To reduce the search space, we employ one pruning technique shown in lines 3 through 5 in Algorithm 1.The attribute subset  will be discarded if the test cost of  is not less than the current minimal average total cost (cmc), in that the decision costs are nonnegative in real applications.
Note that total costs may decrease with the addition of attributes, which means that ATC under an attribute set may be less than that under some of its subsets.That is different from the previous works which considered only test cost [25], in which test costs increase when more attributes are selected.
The following example gives an intuitive understanding.
Example 7. Take the DTRS-TCI-CDS listed in Tables 2-4 for example.By computation, we find that ATC is 3974.8 when the selected attribute set  = { 4 }, while ATC is reduced to 3346.2 when  = { 4 ,  6 }.
Therefore, no matter whether the currently selected attribute subset  satisfies ATC(, ) < cmc or not,  continues expanding to search a minimal ATC, which is shown in line 10 of Algorithm 1.

The 𝛿-Weighted Heuristic Attribute Reduction Algorithm.
The -weighted heuristic attribute reduction algorithm is listed in Algorithm 2, in which the algorithm framework contains two main steps.Let  denote the set of currently selected attributes.First, we combine the current best attribute subset  ⊆ ( − ) with  according to the heuristic attribute significance function (, , ) until  becomes a superreduct.This step is essentially the attribute addition.Then, we delete the attribute  from  to guarantee  with the current minimal total cost.Lines 4 through 13 contain the key code of the addition step.There are two main differences from those in existing works [22,25,29].One is the heuristic attribute significance function.We propose the -weighted attribute significance function as follows: where    is the attribute in  and (   ) is the test cost of    .The other difference is the computation steps.At first, the dimension of  is 1, which means that we test current left attributes one by one.However, since the positive region may shrink with the addition of attributes in DTRS models [19], for all   ∈ , (POS (,) ∪{  } () − POS (,)  ()) may not be more than 0 at the same time.In this case, we cannot choose a suitable attribute to make current POS (,)  () expand to reach POS (,)  () ⊇ POS (,)  ().To address this situation, we gradually increase the dimension of , namely, consider multiattributes simultaneously, and compute the corresponding values of attribute significance function until at least one value is more than 0.

Experiments
In this section, the performance of our two algorithms is studied.We try to answer the following questions by experimentation.
(1) Are both the backtracking algorithm and the heuristic algorithm efficient?
(2) Is the heuristic algorithm effective for the MACR problem?
Since there are no intrinsic test costs and loss functions in the datasets mentioned above, we will create these data for experimentations.First, we generate test costs that are always represented by positive integers.Let   be a condition attribute; (  ) is set to a random number in [1,100] subject to the uniform distribution discussed in [22].Then, we produce loss functions   ( ∈ {, , },  ∈ {, }), which are random nonnegative integers satisfying (3) and (5).Since the loss functions are often more than test costs in real life, we set the average of   to be in [100, 5000].Of course, the assumptions of cost value could be easily changed if necessary.To observe whether the algorithm efficiency is influenced by the ratio of loss functions to test costs, experiments shown below are undertaken with two groups of cost settings for each dataset listed in Table 5.Each group contains 100 different cost settings.Test costs in both groups are the same, but the loss functions are different.The average values of loss functions (ALF) in group 1 and group 2 are around 500 and 3000, respectively.Experiments are undertaken on a PC with Intel 2.20 GHz CPU and 4 GB memory.

Efficiencies of the Two Algorithms.
We study the efficiencies of both algorithms using two metrics.One is the number of backtrack steps Algorithm 1 is invoked.Comparing it with the size of search space [25], the efficiency of the backtracking algorithm is investigated.The other is the runtime comparison between the two algorithms.The metric is used to study the efficiency of the heuristic algorithm.The search space size and the average number of backtracking steps for Algorithm 1 are depicted in Table 6, and the average run-time for both algorithms is shown in Table 7, where the unit of run-time is 1 ms.
From the results, we note the following.
(1) In both groups, the number of backtrack steps is less than the search space size, which manifests the effectiveness of the pruning technique in Algorithm 1. (2) With the increasing of ALF, both the backtrack steps and the run-time of Algorithm 1 grow, which means that the efficiency of the backtracking algorithm is influenced by the ratio of loss functions to test costs.The reason is that, when the loss functions are much more than test costs, currently minimal ATC, namely, cmc in Algorithm 1, is also high compared to current test costs.In this case, the pruning technique shown in lines 3 to 4 of Algorithm 1 cannot make effect.
(3) The run-time of Algorithm 2 is small compared with Algorithm 1, especially for the dataset Zoo.Therefore, the heuristic algorithm is very efficient.Moreover, the heuristic algorithm is stable in terms of run-time with the increasing of ALF.
In a word, the heuristic algorithm is good at the efficiency.Although the backtracking algorithm is not very efficient sometimes, it is still needed to evaluate the performance of a heuristic algorithm in terms of the quality of the solution.

Effectiveness of the Two Algorithms.
In this part, we observe the effectiveness of both algorithms by using four metrics.First, two metrics defined in [22], namely, finding optimal factor (FOF) and average exceeding factor (AEF), are computed to measure the performance of the heuristic algorithm from the perspective of cost.In the computations, the results of the backtracking algorithm are used to evaluate the effectiveness of the heuristic algorithm.The results of the two metrics are shown in Figure 1.
From the results, we note the following.
(1) The values of FOF and AEF are not significantly different between ALF ≈ 500 and ALF ≈ 3000.
Maybe we can conclude that the performance of the heuristic algorithm is little influenced by the ratio of loss functions to test costs.
(2) All FOF are above 0.5, and all AEF are below 0.1.In other words, the results are acceptable.
Then, we compare the classification performances of the original data and the reduced data obtained by our two algorithms based on 10-fold cross validation.CART and RBFkernel SVM are used as learning algorithms, respectively.The results are depicted in Tables 8-9.We also present the comparison of the average numbers of selected attributes, which is shown in Table 10.
From the results, we observe the following.
( Figure 1: Assuming that ALF is around 500 and 3000, respectively, the effectiveness of the heuristic algorithm is measured by using two metrics.(a) Finding optimal factor (FOF) is the fraction of successful searches of an optimal reduct in experiments.The higher the FOF is, the better the heuristic algorithm is.It is shown that all FOF are above 0.5.(b) Average exceeding factor (AEF) is the average value of the fractions beyond the minimal-average-total costs.The lower the AEF, the better the algorithm.All AEF are below 0.1 in the figure .but the numbers of selected attributes are effectively reduced, which is consistent with the essence of DTRS models.Different from the classical rough set, classification error is acceptable within a certain range according to the thresholds in DTRS models.Consequently, the reduction effectiveness is improved.
(2) With the increasing of ALF, all numbers of selected attributes grow, and the classification performance of most datasets improves.This means that the tolerability of classification error is decreasing when the classification costs increase.
(3) For all datasets, the classification performance of Algorithm 1 is a little better than that of Algorithm 2; meanwhile, the numbers of selected attributes are more in most cases.

Conclusions
In this paper, we address cost-sensitive attribute reduction problem in DTRS models.By considering the tradeoff of decision costs and test costs, minimal average-total-cost attribute reduct is defined, and the corresponding optimization problem is proposed.Both backtracking and heuristic algorithms are designed to deal with the optimization problem.Experimental results demonstrate the efficiency and the effectiveness of both algorithms.By combining test costs with the existing elements in DTRS models, such as the loss functions and the probabilistic approaches, our model is practical in real applications.
The following research topics deserve further investigation.
(1) The MACR problem could be addressed again based on more complicated test-cost-sensitive decision systems (DS), such as the simple common-test-cost DS and the complex common-test-cost DS [21].The corresponding algorithms may also be more complicated.(2) Sometimes the costs one could afford are limited.We could consider the attribute reduction problem with test cost constraint or total cost constraint in DTRS models.(3) Recently, from the viewpoint of rough set theory, Yao [30,31] has discussed three-way decisions, which may have many real-world applications.One could explore the cost-sensitive attribute reduction problem for three-way decisions with decision-theoretic rough sets.
In summary, this study suggests new research trends concerning decision-theoretic rough set theory, attribute reduction problem, and cost-sensitive learning applications.
) The values of classification accuracy by our algorithms are a little lower than those by the raw data,

Table 1 :
The loss function matrix.

Table 3 :
An example test cost vector.

Table 4 :
An example loss function matrix.

Table 6 :
The average number of backtrack steps for Algorithm 1.

Table 7 :
Average run-time comparison.

Table 8 :
Classification performance comparison with CART classifier.

Table 9 :
Classification performance comparison with RBF-kernel SVM classifier.

Table 10 :
The comparison of the average numbers of selected attributes.