Recently, the importance of mobile cloud computing has increased. Mobile devices can collect personal data from various sensors within a shorter period of time and sensorbased data consists of valuable information from users. Advanced computation power and data analysis technology based on cloud computing provide an opportunity to classify massive sensor data into given labels. Random forest algorithm is known as black box model which is hardly able to interpret the hidden process inside. In this paper, we propose a method that analyzes the variable impact in random forest algorithm to clarify which variable affects classification accuracy the most. We apply Shapley Value with random forest to analyze the variable impact. Under the assumption that every variable cooperates as players in the cooperative game situation, Shapley Value fairly distributes the payoff of variables. Our proposed method calculates the relative contributions of the variables within its classification process. In this paper, we analyze the influence of variables and list the priority of variables that affect classification accuracy result. Our proposed method proves its suitability for data interpretation in black box model like a random forest so that the algorithm is applicable in mobile cloud computing environment.
Mobile cloud computing becomes a significant issue for data mining. Since multimodal sensor data is gathered from mobile devices, data mining in a mobile cloud environment is an important research area. Multidimensional data from mobile devices such as health information and GPS increases exponentially so that it becomes difficult to handle manually.
There are some researches on the progress that measures variable impact in classification and regression from the big data with multidimensional attributes by using data mining algorithms. As data becomes more complex, the importance of research in interpreting the meaning of data classification and regression results is increasing. The main problem of the multidimensional data analysis is the curse of dimensionality. Since highdimensional data streams in real time, which is socalled “small
Assume the situation that the doctor who diagnosed patient
We assume two people
As examples suggested above, the needs for variable impact measurements research is increasing. However, even if the prediction accuracy of the learning algorithm is high, there is a danger that the reliability of the doctor’s diagnosis may deteriorate if the physician cannot directly confirm the cause of the algorithm result. Also, in the second case, it is very important for the banking industry to determine what data from the customer has affected the classification results before deciding whether to approve the customer’s loan or not.
Recently in bioinformatics field, as personal medical data becomes more complicated and accumulated in real time, the related work was proposed [
However, random forest algorithm has a critical problem. Since it is a black box model, we cannot see which variable is affected in classification result. It is important to interpret the result of the classification with variable importance measurement. Hapfelmeier et al. [
In this paper, we propose a new method that accurately grasps the influence of relative classification among variables in measuring the influence of classification of variables using random forest algorithm in an attempt to solve the problems. To solve this problem, this paper proposes a method to incorporate the economics theory called Shapley Value into the MDA index.
The random forest algorithm, which is a kind of ensemble learning technique, generates several decision trees by bootstrapping the learning data and arbitrarily learns them. We then combine the learning results of all the trees to obtain the average in the case of regression and the prediction accuracy in the case of classification by the majority. By learning random decision trees and then averaging them, random forests solve the over sum problem by reducing the variance compared to the single decision trees. In particular, random forests are more suitable for the field of bioinformatics through the study that they have a good performance when sorting data with multidimensional data attributes but small number of data for each “small
The principle of random forest operation is as follows. First, various subsets are arbitrarily generated from existing learning data for random forest learning. The most important characteristic of the random forest is the bagging. Bagging was proposed by Breiman [
Linear regression analysis and decision trees are the most frequently used algorithms for verifying the influence of classification results [
There are two main indicators to measure the influence of classification of a variable through the random forest. One is the Mean Decrease Impurity (MDI) index, which measures the classification impact of variables by totaling the amount of decrease in impurity as the classification is performed, and the other is the sum of the amount of decrease in accuracy depending on the presence or absence of specific variables (Mean Decrease Accuracy). However, since both indicators adapt biasedly to the order of variables in the tree structure, there is a disadvantage in that the influence of classification is provided at a larger value than the actual value. According to [
This paper has the following contributions:
We propose a measuring technique of variable impacts based on Shapley Value method on random forest regression. The proposed method attempts to solve the problem that highly correlated variables gain relatively high contribution no matter what their real contribution in prediction is.
We proposed a method that demonstrates the impact of variable coalitions. Considering that not only individual variables are important but also the variable impact of variable sets is, our proposed method is able to inspect the interaction between variables. It will increase the overall accuracy of a variable when a high priority of classification influence is improved when it is used as a partitioning variable in the tree.
Finally, we propose a coherent ranking of variable impacts based on the marginal contribution of each variable.
The rest of this paper is organized as follows. In Section
In this section, we discuss the previous research for measuring variable impact index. In Section
We explain the related research of variable impact measurement index in a random forest. The representative methods of the variable impact measurement index are Mean Decrease Impurity (MDI) and Mean Decrease Accuracy (MDA) proposed by Breiman [
Breiman [
The equation of variable importance (VI) for variable
MDI has the advantage of being easy to compute, but it has the disadvantage that it can be biased only for categorical variables that contain multidimensional attributes. For example, if there are continuous variables and categorical variables that contain several classes, this means that the variables are more likely to be biased because they can be judged to be more superficially partitioned when categorical variables are selected under the same conditions. When attempting to split a tree into a specific variable, the most effective partitioning is the moment when the impurity is lowest. If the degree of impurity is reduced to a maximum by a single partition, this partition is considered to be an efficient partition, which means a high contribution to tree partitioning.
On the contrary, when attempting to divide into a specific variable, if the amount of decrease in impurity before and after the division is 0, it is meaningless to perform the division because the data is not classified through the variable. Therefore, in this case, the importance of the variable is judged to be zero.
MDA is also called permutation importance. This is because when a decision tree is created based on a set of learning datasets divided through subsampling, the intuition behind permutation has an importance that is not a useful feature for predicting an outcome. OOB (OutOfBag) is one of the subsampling techniques to calculate prediction error of each of the training samples utilizing bootstrap aggregation. MDA is the method that calculates variable importance by permutation and the method uses OOB to divide its sample data. In other words, OOB estimates more accurate prediction value by computing OOB accuracy before and after the permutation of variable
Since
Strobl et al. [
Regression coefficient simulation design [
















5  5  2  0 



0  0  0  0  0 
In this section, we examine related studies on data mining techniques applying Shapley Value. Most of the studies show that the reason for applying the Shapley Value is to grasp objectively important indicators of the variable or feature in various algorithms.
Cohen et al. [
Lipovetsky and Conklin [
In a dynamic environment where multiple agents communicate with each other, each agent looks for a single equilibrium point to determine its behavior. In this study, Bowling and Manuela [
In this section, we explain about Shapley Value model which corresponds to the game theory of economics area. We explain Shapley Value for each step in Sections
Shapley Value was proposed by Lloyd Shapley in 1953, the theory about fair distribution with players in a mutual interest relationship in a cooperative game situation. In game theory, the game can be divided into two types. One is a cooperative game in which players form certain coalitions by mutual agreement to maximize their communal payoff and the other is a noncooperative game where players maximize the interests by acting individually rather than from any collaboration with each other. According to the Shapley Value, players form coalitions and create certain common payoff. Players in each coalition receive differentiated payoff based on the fair distribution of their contributions using Shapley Value.
The following concept is used to describe the Shapley Value [
According to [
First, Shapley Value follows the
For each
Second, Shapley Value follows the
For each carrier
Third, Shapley Value follows the
For any two games
The Shapley Value is a theory that equitably and reasonably distributes the collective payoff generated from the coalition to its players. Therefore, the following formula is used to calculate the Shapley Value for the player
Given a coalition game
A set of players is formed
In this section, we explain the proposed method. In this study, we propose a method to apply Shapley Value from the game theory to solve the problems of the previous research for the variable impact of the random forests algorithm. Our research follows five steps’ process. The details are as follows.
First of all, we calculate each contribution of variables. When we generate various regression trees in random forests algorithm, we traverse each of the tree paths to assign each value of variables used in regression trees. We can assign a single path per a single coalition. We perform this contribution calculation step based on the MDA method, which permutes random variables to calculate the prediction accuracy of variables so that we are able to calculate the marginal contribution of each variable.
Secondly, we construct coalitions of all variables used in random forests. We consider coalitions for individual variables as players of the cooperative game situation by connecting the contributions of specific variables. Each variable has its own payoff according to the joint contribution with each coalition. Figure
Construction step in Shapley Value applied method.
Thirdly, we assign each coalition with their contribution values. We assign value in every coalition. In this case, the number of coalitions is the same as the power set for all the variables used in the regression tree. We compare the coalition formed in step 5.2 with the power set of the variables used in random forests. If a variable does not belong to the same tree path and a value for all power sets is not assigned, the value of this coalition is assumed to be zero. This is because the coalition determined that there is no contribution to prediction accuracy since it is a coalition that did not contribute to the regression tree.
Fourth, we calculate variable impact using Shapley Value method. We combine variables and their contribution as
Finally, we provide a coherent ranking based on the variable impact. Shapley Value is calculated for the impact of individual variables as well as the priority value of the variable impact based on the value assigned to the contribution of the coalition. In this case, the ranking can be considered not only for rankings for individual variables but also for impact on the value of coalition. It is possible to line up the highest ranking of variable impact or the lowest ranking. In future work, we can use this ranking as dimension reduction method to improve prediction accuracy rate.
In this section, we measure the variable impact by using Shapley Valuebased method in random forest regression. On the experiments, we compare variable impacts with other measuring techniques which are previously researched with our proposed method: MDI and MDA.
The experimental environment is Intel(r) Core(TM) i76700HQ CPU @ 2.60 GHz/2.592 GHz, RAM 16.0 GB, x64 Windows OS. We use R and Python 3.5 as programming languages: we mainly use Python for the experiments in variable impact measure techniques and we use R for the data visualization and application towards previous works for random forest algorithm with the randomForest package.
In the previous experiment, we figure out the bias selection problem in previous variable impact measuring techniques: MDI and MDA. We set certain formula to simplify the problem. Assuming that there is a formula
However, MDI and MDA show certain bias in variable selection stage. Variable impact measurement results of
Figures
The variable impact of MDI.
The variable impact of MDA.
When
To solve this bias selection problem, we applied the Shapley Valuebased technique. We generated 10, 150, and 300 regression trees, respectively, to measure the variable impact. Table
Variable impact of three variables (
MDI  MDA  SVC  


0.2509  0.1980  0.0185 

0.6085  0.3709  0.0229 

0.1407  0.0952  0.0201 
As shown in Tables
Variable impact of three variables (
MDI  MDA  SVC  


0.2479  0.1865  0.0276 

0.6003  0.3561  0.0267 

0.1518  0.0833  0.0314 
Variable impact of three variables (
MDI  MDA  SVC  


0.2629  0.1996  0.0339 

0.5841  0.3536  0.0260 

0.154  0.0801  0.0137 
However, the proposed method based on Shapley Value (SVC) reduces the difference in influence between these variables. As Table
Unlike the previous method, which had a large difference in variable impact between variables, the Shapley Valuebased method suggested a solution to this bias problem by reducing the difference in influence between variables.
Figures
The variable impact of SVC.
The variable impact of three techniques.
On the other hand, the result of SVC are shown in Figure
However, our proposed method has a limitation that the variation of the variable impact range greatly occurs according to the number of tree parameter. The drawback comes out when
Figure
A boxplot graph of variable impacts with three variables on MDI technique.
A boxplot graph of variable impacts with three variables on MDA technique.
However, Figure
A boxplot graph of variable impacts with three variables on SVC technique.
Therefore if there is a significant change in the combination of candidate variables, the range of variable impact can be fluctuated. In terms of the influence of
Yet, the contribution of this research is that even if the fluctuation of a range of variable impact is larger than MDI or MDA, SVC can be judged more reliable on relative relationship towards variables. Our proposed method gave a significant reduction of bias that was provided from MDI and MDA in variable impact calculation.
In this experiment, we use a real dataset which is named Boston Housing Data [
Figure
The graph of variable impact measurement of random forest regression.
However, this ranking is not considerable on variable impacts between correlated variables. For example, there is a phenomenon called multicollinearity issue. Multicollinearity means that more than two input variables are highly correlated so that the impact of those variables is overestimated. Since the phenomenon spoils the relevant importance between input variable and predictor, we need to minimize the possibility of multicollinearity.
In this data, there is a high correlation among NOX, INDUS, and TAX. INDUS means the proportion of nonretail business acres per town and NOX means nitric oxides concentration. It is inferable that INDUS and NOX have a positive correlation: as the proportion of industrial area increases, the ratio of nitric oxides concentration also increases. The tax ratio increases when INDUS is increased. Therefore, INDUS, NOX, and TAX are highly correlated.
However, the impact of those correlated variables is relatively high. Even though those variables gained rather smaller contribution than LSTAT or RM, the ranking should be reliable in order to make a reliable decision. It is more efficient if we use only one of those variables which are correlated to each other. Eliminating unnecessary variables due to the variable impact ranking reduces dimensionality. In order to resolve this problem, we use our proposed method. The experiment steps are followed.
First, we compare the prediction accuracy of the random forest regression tree when we permute a certain variable randomly so that the marginal contribution of the specific variable to be calculated. Second, we construct coalitions for individual variables as players of the cooperative game situation by connecting the contributions of specific variables. Each variable has its own payoff according to the joint contribution with each coalition. Third, we assigned a power set as a coalition of
We implemented MDI and MDA in Python to compare the variable impact measurement results with our Shapley Valuebased method proposed in this research. In MDA, we shuffle dataset 10 times for permutation towards random variables. We used a crossvalidation technique that permutes the variables randomly for comparison.
Tables
A comparison of variable impact measuring technique on Boston Housing Data (
MDI  MDA  SVC  

CRIM  0.0164  0.012  0.0701 
ZN  0.0015  0.0009  0.0240 
INDUS  0.052  0.0206  0.007 
CHAS  0.0009  0.0001  0 
NOX  0.0119  0.0065  0.057 
RM  0.5842  0.803  0.1705 
AGE  0.0307  0.0118  0.0718 
DIS  0.0177  0.0108  0.0733 
RAD  0.003  0.0013  0.0678 
TAX  0.017  0.0118  0 
PTRATIO  0.0387  0.0203  0.0889 
B  0.0123  0.0021  0.0648 
LSTAT  0.2138  0.2404  0.3555 
A comparison of variable impact measuring technique on Boston Housing Data (
MDI  MDA  SVC  

CRIM  0.0187  0.0125  0.0954 
ZN  0.0034  0.0008  0.0620 
INDUS  0.0547  0.02  0 
CHAS  0.0008  0.0001  0.0069 
NOX  0.0209  0.0077  0.0887 
RM  0.5396  0.7758  0.4954 
AGE  0.0252  0.0138  0.0913 
DIS  0.019  0.0105  0.0104 
RAD  0.0065  0.0013  0.0972 
TAX  0.0173  0.0118  0.0996 
PTRATIO  0.0252  0.0178  0.0286 
B  0.0107  0.0024  0.0958 
LSTAT  0.2579  0.2337  0.3652 
A comparison of variable impact measuring technique on Boston Housing Data (
MDI  MDA  SVC  

CRIM  0.0213  0.012  0.0144 
ZN  0.0058  0.001  0.0023 
INDUS  0.0647  0.0173  0 
CHAS  0.0011  0.0001  0.0202 
NOX  0.0174  0.0062  0.0172 
RM  0.5255  0.7853  0.483 
AGE  0.0285  0.0126  0.021 
DIS  0.0204  0.0097  0.0226 
RAD  0.0045  0.001  0.02 
TAX  0.0169  0.0119  0.0075 
PTRATIO  0.0466  0.0162  0.0171 
B  0.0097  0.0015  0.0192 
LSTAT  0.2378  0.2258  0.982 
The value of SVC with
As we mentioned before, RM and LSTAT are notably the most important variable in Boston house price prediction. Regardless of the number of the tree trained by random forests, RM and LSTAT are on the highest ranking. Also, we can see the result of our proposed method, which refers to Shapley Valuebased Calculation, the highest ranking maintains the same as MDI and MDA.
We figure out that our proposed method solves the multicollinearity issue in biased variable impact measurement in MDI and MDA. Table
On the other hand, our proposed method reduces the possible multicollinearity problem. When
The result of Shapley Valuebased method is that the variable impact of CHAS is zero. CHAS refers to Charles River dummy variable. The variable seems relevant in both MDI and MDA for the lowest variable impact. So far it is possible to eliminate CHAS variable as a dummy variable which does not contribute to any prediction. For a null player with no contribution, it is the result of the axiom of the Shapley Value that the payoff is not distributed.
Figure
The boxplot of variable impacts in MDI.
The variable impacts with total variables
The variable impacts with minor variables
The boxplot of variable impacts in MDA.
The variable impacts with total variables
The variable impacts with minor variables
The boxplot of variable impacts in SVC.
The variable impacts with total variables
The variable impacts with minor variables
However, our proposed method reveals the limitation in the experiment that SVC method provides highly unstable variable impacts rather than other techniques. The fluctuation of the range of variable impact seems to distract the experiment result. However, the average variable impact of SVC methods shows better importance than other techniques.
Table
Coherent ranking based on SVC.
Rank  Variable coalition  Value 

1 

2.6902 
2 

0.7613 
3 

0.5866 
4 

0.5694 
5 

0.4391 
6 

0.3147 
7 

0.2797 
8 

0.2589 
9 

0.2438 
10 

0.2247 
In this paper, we proposed a method to measure the influence of variables using Shapley Value method in random forest algorithm. One of the existing methods for measuring the classification impact of variables is the Mean Decrease Impurity technique, which uses Gini coefficients to determine the influence of variables through data impurity reduction. The other is the Mean Decrease Accuracy method, which limits the influence of classification by calculating the difference in the prediction accuracy of the changing data by permitting the variable. Both indicators are commonly used to measure the classification impact of variables using real data.
Our proposed approach performs better than other approaches for two main reasons. First, our approach tries to solve the multicollinearity problem in other techniques. In the previous approach, the variable impact calculation was less accurate because of the correlated variables. In this paper, we proposed Shapley Valuebased approach so that the payoff is fairly distributed by its contribution among variables. Second, our approach considers not only the impact of the individual variable but also the impact of the group of variables. There are synergies between variables which perform effectively when those are combined. Previous approaches did not consider the impact of the group. However, in this paper, we would like to consider the impacts of the group so that we can inspect the synergies between variables.
Through this research, we have made the following three contributions. First, this paper presents the problems of existing techniques for finding the influence of variable classification using the random forest and tries to solve it by combining Shapley Value of economics theory. As Shapley Value is applied to a variety of machine learning or data mining algorithms, it is the first study to incorporate the Shapley Value of economics theory to measure the exact classification impact of random forests. Second, we can obtain the priority of the variables that affect the accuracy of the classification result through the proposed method. The proposed method improves the accuracy of random forest prediction based on this priority. Finally, this research improves the analytical power of the black box model. The interpretation of variable importance is critical in the classification problem. Our proposed method is suitable for measuring variable impact in black box model such as random forest. Furthermore, the algorithm is applicable in mobile cloud computing environment.
In future work, we will conduct the experiments with several different data. Moreover, we will research about reducing the complexity so that we could improve the performance of variable impacts measuring techniques based on Shapley Value.
Player willing to participate in cooperative game
Total set of
Coalition of the players who share common payoff
A payoff that players gain from the coalition.
Per capita crime rate by town
The proportion of residential land zoned for lots over 25,000 sq.ft.
Proportion of nonretail business acres per town
Charles River dummy variable (=1 if tract bounds; 0 otherwise)
Nitric oxides concentration (parts per 10 million)
Average number of rooms per dwelling
Proportion of owneroccupied units built prior to 1940
Weighted distances to five Boston employment centers
Index of accessibility to radial highways
Fullvalue propertytax rate per $10,000
Pupilteacher ratio by town
% lower status of the population
Median value of owneroccupied homes in $1000.
The authors declare that they have no conflicts of interest.
This work was supported by Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korea government (MSIP) (no. R0115161009, Development of Smart Learning Interaction Contents for Acquiring Foreign Languages through Experiential Awareness).