Predicting Bank Operational Efficiency Using Machine Learning Algorithm: Comparative Study of Decision Tree, Random Forest, and Neural Networks

,


Introduction
e financial crisis that hit Ghana from 2015 to 2018 has raised various issues with respect to the efficiency of banks and the safety of depositors' in the banking industry.As measures to mitigate this financial crisis, the central bank (bank of Ghana) instituted some measures to reform the banking sector.e sole aim is to provide efficient banking services to the Ghanaian economy.It is also to make the local banks competitive globally.First and foremost, governments decided to avoid closure of distressed banks by recapitalizing them.For instance, the central bank directed a state owned bank (Ghana Commercial bank) to take over UT and Capital bank in 2017 [1].Secondly, the bank of Ghana also avoided closure of banks in order to protect depositors' funds by merging five distressed banks, namely, Unibank, Biege bank, Construction bank, Royal bank, and Sovereign bank [2]. is consolidation process was envisaged to restore the financial viability of the distressed banks.Finally, the central bank also raised the minimum capital requirements of all commercial banks in Ghana to 400 million Ghana Cedis [3].As part of measures to improve the banking sector and also restore customers' confidence, efficiency, and performance analysis in the banking industry has become a hot issue.
is is because bank managers and other stakeholders want to detect and mitigate the underlying causes of inefficiencies within their banking operations.As two nonparametric models, DEA models share some similarities with machine learning algorithms.For example, DEA and machine learning algorithms both make assumptions about the functional form that links its inputs to outputs.Bank's branch efficiency is also a comprehensive measure from various performance aspects using many financial variables [4]. is indicates that the relationship between the bank efficiency and multiple variables is highly complex and not straight forward.Machine learning algorithms have also been viewed as a good tool to approximate numerous nonparametric and nonlinear problems [5]. is means that the banking industry provides good opportunities for the applications of a combined DEA and machine learning models.ere are also few literatures dealing with developing country bank branch efficiency using DEA and machine learning algorithms. is paper presents a combined DEA and three machine learning approaches in evaluating bank efficiency and performance using 444 Ghanaian bank branches.e results were also compared with the corresponding efficiency ratings obtained from CRS DEA.Finally, the prediction accuracies of the three machine learning algorithm models were compared.e motivation behind this study is the fact that the DEA property of unit invariant is similar to the property of scale preprocessing required by machine learning algorithms such as NNs. is validates the rationale to compare the results of pure DEA and DEAmachine learning algorithm model results.
e rest of the paper is organized as follows.Section 2 gives a brief review of related works on the topic.Section 3 presents the methodology and the framework used in the study.Section 4 gives both DEA and the three machine learning algorithm results analysis and further discussions.Finally, Section 5 presents our conclusions, recommendations, and future work suggested by the study.
e study suggested that for a collection of IT investment, IT impacted substantially on organizations revenues.In 2004, [20] also used DEA to assess the efficiency of 27 banks and suggested a positive impact of IT on the banks' efficiency.Chen et al. [21] in a study also used 27 DMUs of banks that suggested only three firms as efficient in the two efficient calculation phases.is work [22] assessed 40 Internet company's firms' performance using a DEA model which, according to them, can work well and can also be used to differentiate the causes of inefficiency.A study [23] attempted to evaluate the impact of ICT on the productivity of hotels in Portugal through Data Envelopment Analysis (DEA).e study did not only demonstrate how important ICT is in realizing advanced levels of productivity.It also discussed other explicit concerns which should be taken into consideration so that the positive returns of the investment in ICT can be achieved.Comparatively, DEA is a better way to arrange and evaluate data since it allows efficiency to change over time and requires no prior assumption on the specification of the best practice frontier [24].It has also been reported in the literature [24] that DEA is a prominent method for performance analysis in the banking industry.However, the DEA frontier is very sensitive to the presence of outliers and statistical noise.It can hardly be used to predict the performance of other decision making units [24].As a result, studies have started introducing machine learning recently as good substitutes to support in approximating efficiency frontiers for decision makers [24].For example, [25] demonstrated how a machine learning algorithm, such as decision tree, was combined with DEA to predict the impact of IT on firms' performance.In another work [26], the authors used three decision tree algorithms, namely, C5.0, C4.5, and CART, to build the various decision tree predictive models.e study suggested that the C5.0 algorithm gave an accuracy of 100%, followed by the CART algorithm with an accuracy of 84.6% and, finally, the C4.5 algorithm with an accuracy of 83.34 on average.e study, therefore, recommended the usage of the C5.0 predictive model in predicting the financial performance of rural banks in Ghana.Chen et al. also [27] applied an innovative Data Envelopment Analysis method under a stochastic environment.
e results of the study reveal that the overall efficiency level of the Chinese banks remains still low.is, according to the authors, was considerably determined by the contextual variables of the ownership structure and cost structure of the Chinese banks [27].Another study [28] also used a novel approach, Synthetic Minority Oversampling Technique (SMOTE), to convert imbalanced data in a balanced form.e authors used Lasso regression to reduce the redundant features from the failure predictive model.e result of this study holds its application to various stakeholders like shareholders, lenders, and borrowers, etc. to measure the financial stress on banks [28].A work [29] found a very similar performance for both models where random forest shows slight superiority to logistic regression.Both models yield an AUC of ∼0.65, and from the results obtained, it indicates that they are able to correctly predict ∼60% of both healthy and financially distressed companies ahead of time [29].A study [30] also compared the accuracy of two approaches: traditional statistical techniques and machine learning techniques in an attempt to predict the failure of 3000 US banks.e empirical result of the study reveals that the artificial neural network and K-nearest neighbor methods were the most accurate.Finally, [4] utilized a Multinomial Logistic Regression to select the most significant predictor variables to build a neural network model and suggested that the models in each case yielded a favorable classification and prediction accuracy rate.

Basics of Data Envelopment Analysis (DEA)
Data Envelopment Analysis is a nonparametric method that produces a comparative ratio of weighted outputs to inputs for each Decision Making Unit (DMU) under consideration [31][32][33].is study presumes that there are n DMUs to be evaluated and in this case (n � 444).Each DMU consumes m different inputs.
Specifically, DMUj consumes the amountx ij of input i and generates a quantity y rj of output r. e further adopts 2 Advances in Fuzzy Systems that x ij > 0 and y rj > 0. e input-oriented efficiency of a specific DMU 0 under the postulation of Variable Returns to Scale (VRS) can be deduced from the following primal-dual linear programs, the BCC model proposed by [31].e BCC Envelopment model is as follows: where s + and s + are the slacks in the system.BCC Multiplier form is as follows: max is is also subject to v T X 0 � 1 and Determining a DEA involves solving n linear programming tasks of the above model, one for each DMU.e ideal estimation of the variance θ determines the corresponding diminution of all inputs for DMU0 that will transfer it onto the frontier, which is the envelopment surface defined by the efficient DMUs in the sample.DMU0 is DEA efficient as far as there exists an ideal solution μ * , v * of (3) with μ * > 0 and v * > 0, and an ideal solution (θ * , λ * ) of equation ( 1) such that where z * 0 is the response for the BCC Envelopment form and w * 0 is the response for the BCC Multiplier form.
From this end and beyond, the optimal value is denoted by * .e condition on μ * > 0 and v * > 0 assures that DMU0 is an efficient frontier and that slack values of all constraints in the (1) are 0 because of the complementary slackness proposition for dual programs.is model allows Variable Return to Scale (VRS) proposed by [31].
If the convexity constraint ( 1 → λ �1) in ( 1) and the variable u 0 in (3) are taken out, the feasible region is increased, which results in the decrease in the number of efficient DMUs, and all DMUs are operating at Constant Return to Scale (CRS), and resultant DEA model is CCR also proposed by [32].

Decision Tree Algorithm.
e decision tree is one of the topmost machine learning algorithms which suggest a graphical or a diagrammatic illustration of a technique for classifying, predicting, and evaluating an item of importance or concern.It is an easy and commonly used classification method.It deals with decision analysis by employing a tree-like structure of decisions and its relative potential outcomes [34].It has nodes where at each node in a decision tree, an attribute must be selected to divide the node's instances into subgroups.
Decision Tree accepts input set of well-ordered data, and output shaft, which is delivered in which each end node (leaf ) is a decision (a class) and each nonend node (middle) shows a test [35].e most common algorithms used in decision tree are ID3, CART, CHAID, and C4.5 with its extension C5.0.For this study, the C5.0 which is an extension of the ID3 and successor of the C4.5 algorithm proposed by Ross Quinlan in 1994 [36] was adopted for the study and implemented in R studio using R codes with package C5.0 [37].

Random Forest Algorithm. Random forest (RF) algorithm is for classification and prediction developed by
Breiman in 2001 and cited by [38] that utilizes an ensemble of classification trees [39][40][41].RF is an ensemble machine learning algorithm [39].e fundamental principle of the RF algorithm is that constructing a smaller DT with limited characteristics is an inexpensive process in terms of computation [39].us, it is possible to construct numerous small, weak decision trees in parallel and merge these smaller trees to form one strong learner by using their mean performance or even or selecting the popular one.In terms of application and practicability, RF algorithms are considered to be more precise learning algorithms to date [39].
e RF algorithm adopted for this study was Leo Breiman and Adele Cutler random forest algorithm [39,42].
is was implemented in R studio using R codes with "randomForest" package [43,44].A random forest model has a better ability in modeling and predicting.An important feature of Breiman's algorithm, according to [45], is the variable importance calculation.[46,47].A neural network is a massively parallel distributed processor made up of simple processing units that have a natural tendency for storing experiential knowledge and making it available for use.ANNs can be grouped into two major categories: feed-forward and feedback (recurrent) networks.In the former network, no loops are formed by the network connections, while one or more loop may exist in the latter.e most commonly used family of feed-forward networks is a layered network in which neurons are organized into layers with connections strictly in one direction from one layer to another [48].

Artificial Neural Network. Artificial neural network (ANN) is a type of Artificial Intelligence (AI) technique that mimics the behavior of the human brain
e basic system of NN without the hidden layer consists of only two layers: the input and output layer.
is is normally called the skip layer because it is made up of a straight forward linear regression modeling in a NN design.
e input layer communicates directly with the output layer without involving the hidden layer. is study adopted the backpropagation algorithm for building the neural network model for predictions.e linear combination functions and Advances in Fuzzy Systems sigmoid transfer functions were used.e S-shaped or binary sigmoidal function is, by far, the most common transfer function [49].
e formula for the sigmoid is given: e codes for the NN model building was written in R codes using the RMiner studio with the "neuralnet" package [50]. is study adopted the use of only one hidden layer with five (5) Neurons in a three (3) layer network. is number of hidden neurons was chosen based on the equation Nh � Ni-1 proposed by [51] and cited by [52], where Nh is the number of hidden neurons to be used and Ni is also the number of input neurons.For this study, the number of inputs was 6 which implies that Ni � 6.

List of Performance Measures.
ere are so many metrics for evaluating machine learning algorithms, but for the purpose of this study, we would focus on the following: Classification Accuracy.Classification accuracy is actually the meaning of the term accuracy in machine learning performance measure [53].Mathematically, it is defined as the ratio of the number of predictions done correctly by the machine learning algorithm to the total data set: accuracy � number of correct predictions the total number of predictions made .
(  At the data collection stage (Stage I), the raw data were collected from the banks.After the collection of the data, it is preprocessed in Stage II before it finally enters the stage where the predictive model development takes place, thus Stage III.It is during the preprocessing stage where the entire dataset is organized, transformed, or Encoded into a form that can easily be used by the model.In this case, the financial data, such as the Cedi value of IT expenditure (I), fixed asset (A), total deposit (D), profit (R), rate of performing loans (%PL), and finally the number of employees (E) from banks was used to calculate the efficiencies of the various banks (DMUs) using the CRS technology.e efficiencies of the banks at both deposit and investment stages were classified into classes (Class A: efficient and Class B: inefficient) based on their efficiency scores (efficiency score of 1 unit or 100%).In real life situations, it is very difficult to have units or departments attaining 100% efficiency and bank branches in Ghana are not an exception [4].

Methodology
is is also evident in the Ghanaian banking sector, where the central bank (bank of Ghana) always has a minimum capital requirement for banks and other financial institutions operating in Ghana [3,54,55].is means that there is always a "cutoff point" for banks to meet in order to be efficient and remain competitive in the banking sector.Based on this, the authors also inferred and considered banks with an efficient value of 80% or more as efficient.e study, therefore, adopted and used the efficiency "cutoff point" (DMU efficiency ≥ 0.8) suggested by [4,25] and considered efficient bank as one with an efficiency value ≥0.8. is efficiency classification (Class A and Class B) was used as the response variable.Now the banks, financial data, such as the IT expenditure, fixed assets, etc., were used as predictor variables to predict the efficiency scores (classes) of each bank branch for building the models.
e efficiency classes and the predictor variables formed the dataset of our models.is final dataset for building the model is then randomly divided into two, 70% to be used to train or build and also validate the model using K-folds cross-validation.e remaining 30% (test dataset) was used to test the models.us, in the case of the banks, 70% of the DMUs which were selected randomly from the total dataset, were used to build and validate the model.is model was used to predict the efficiency of the other 30% of banks (DMUs).During the model building, the dataset also goes through rule extraction and finally building the classifier.

Two-Stage DEA Model for Efficient Analysis.
In this DEA model, the various units under consideration were bank branches in Ghana whose performance or productivity measures were grouped into inputs and outputs.Using the bank as an example to derive our model shown in Figure 2, the banks business process and activities are viewed as a dual role process.e first stage (Deposits Stage) of the model 4 Advances in Fuzzy Systems consists of a collection of funds (Deposits) in Ghana Cedi as an intermediate measure of customers using their fixed asset, a number of workers (Employees at each unit), and IT infrastructure.In the next stage (Investment Stage), these banks use the deposits accumulated in stage I and their stage I efficiency scores invest the deposits into securities and also give loans to its customers.Returns (Profits) generated from the investment in securities and percentage of performing loans [56][57][58], which is a good indicator of risk status, were used as two outputs in stage II.
e DEA model used at the deposit stage I produce stage I efficiency for each DUM and the DEA applied at the Investment stage II also gives stage II efficiency for each DMU.DEA models employed at the overall stage finally give the overall efficiency for each DMU.
e proposed DEA framework for the banks dual role operation is depicted in Figure 2.
e various variables used at the deposit stage, investment stage, and finally, the overall stages are stated as follows:  IT expenditure, the Cedi value of fixed assets, the number of staff at each branch, profits generated from investing deposits, percentage of performing loans on the various bank branches, and the Cedi value of the total deposit were obtained from the various DMUs.Specifically, the audited 2016 financial statements from each bank were used.
For each bank branch, the technical efficiencies in both stages and their corresponding overall efficiencies were analyzed using CCR DEA technology.e efficiency of the stage I was calculated using IT expenditure (GH ¢), Fixed Asset (GH ¢), and the number of employees at each DMU as inputs with deposit as the main output.With respect to stage II, the stage I efficiency (Do) and the deposit (GH¢) realized from stage I were used as input with banks profit after all the necessary deductions and the percentage of performing loans as outputs.
e efficiency of the overall stage was also calculated using fixed assets, the number of employees IT expenditure, and efficiency of stage II (No) whiles their outputs were the banks profit after all the necessary deductions and the percentage of performing loans.e efficiency of each DMU at each stage was calculated using the DEA Two-Phase BuildHull Algorithm which was implemented in Rminer studio Version 1.2-5 [59] with its package Robust Data Envelopment Analysis (rDEA).e overall efficiency score of each DMU was categorized as either efficient-Class A or inefficient-Class B.

Predictor Variables.
ese are variables also called independent variables or experimental variables employed in statistical analysis to forecast or predict another variable called target or dependent variable [38,60,61].For this study, the predictor variables were fixed assets, IT expenditure, number of employees, total deposits, percentage of performing loans, and profits accrued from investing the deposit.

Response Variables.
e response variables (overall efficiency scores of the bank branches) that were categorized into efficient (Class A) or inefficient (Class B) were used as the response variable for the predictive models.

Bank Efficiency Scores and Classes
Using the Adapted DEA Two-Phase BuildHull Algorithm.For each DMU (Bank), the technical efficiencies in both stages (Deposit Stage, Investment Stage) and their corresponding overall efficiencies were analyzed using CCR DEA Two-Phase BuildHull algorithm proposed by [62].Using the 444 bank branches (DMUs), the efficiency of each bank branch at each stage was analyzed using the following scenarios: Scenario 1: When an efficient unit is defined as a unit with an efficiency score of 1 unit or 100%.
For banks' efficiency in terms of utilizing their resources to collect deposits from customers, only 14 (3.15%) bank branches were efficient (had 100% efficiency).Just 33(7.43%)bank branches had an efficiency score of between 80% to 99%, 1 (0.23%) bank branches also had efficiency score of between 70 and 79, 21 (4.73%) had efficiency score of between 60 and 69, 19 (4.28%) had between 50 to 59, and finally 356 (80.18%) had an efficiency score below 50%. is 356 (80.18%) number of bank branches confirms the fact that a lot of Ghanaian banks are not efficient in using their resources to collect deposits from customers as most banks were struggling to meet the minimum capital requirements set by the central bank (bank of Ghana) in 2017 [3,54,55].
e result of the deposit stage efficiency is also shown in Figure 3.
For banks' efficiency with respect to investing customers deposits shown in Figure 4, only 1 bank (DMU200) was efficient in investing the deposit to generate profit for the 6 Advances in Fuzzy Systems banks while only 1 bank (DMU219) had an efficiency score of between 80% to 100%. is result suggests that close to 99.5% of Ghanaian bank branches that were considered for the study were not efficient in investing.is also confirms reports of crises that have hit the Ghanaian banking industry with issues such as an alleged manager and board of directors squandering depositors' money without investing them [63]. is has also led to about seven (7) universal banks collapsing in 2017 and 2018 [1,2].For overall efficiency in the entire banking operations also shown in Figure 5, 79 (17.79%) bank branches were efficient (had a 100% efficiency score) with the majority (290 representing, 65.32%) of them having an efficiency score of between 80% and 99%. 4 (0.9%) bank branches had an efficiency score of between 70 and 79%, 32 (7.21%) had an efficiency score of between 60 and 69%, and finally 39 (8.78%) branches had between 50 to 59%.In terms of overall efficiency, there was no bank branch that had less than 50% efficiency score evident in Figure 5.
is analysis means that even though most bank branches in Ghana do not experience a higher percentage of efficiency in collecting deposits and investing the deposit, they still enjoy the highest overall efficiency.
e results suggest that banks in Ghana should identify ways of improving their efficiencies in both the deposit stage and investment stage and should not only rely on their overall efficiency scores as a means of measuring their performance and success.

Machine Learning Algorithm Results and Discussion.
In this study, machine learning algorithms were used in order to identify the best performing classification models.ree types of machine learning algorithms were employed: decision tree, random forest, and artificial neural network.To determine how accurate models were with real world data, we held back a subset of the dataset for testing purposes.
us, the data set was split into training and Advances in Fuzzy Systems validation (70%) and 30% for testing.For the performance analysis, the test dataset was used for assessment.

Comparative Analysis of the Machine Learning
Algorithms.To estimate the three machine learning algorithm models used in the study, performance, the overall accuracy, Kappa, sensitivity, specificity, and the level of significance using their P values were considered as the evaluating measures.10-fold cross-validation (CV) was applied to check overfitting and performance of all predicting models.e mean values of the10-fold CV for each measure are given in Figure 6 and the range of these values from all predicting models is also given.For the test dataset, the DT model performed better than the other two models, but the difference in measures between DT and RF was very small.However, NN had the lowest accuracy.After the analysis of the results of the three algorithms, the following were suggested by the study: For predicting the overall efficiency and performance (where DEA score of 0.8-1 is classified as an efficient bank) of banks, the DT was the best.It gave the highest accuracy (100%) in predicting the overall efficiency of each bank with a kappa value of 1 and P value of 1.1e − 11.
is was followed by RF with 98.5% accuracy with a kappa value of 0.95 and P value of 0.00.e last algorithm in terms of prediction accuracy was the NN which had an accuracy of 86.6% with a very low kappa value of −0.014 and poor P value 0.66 as compared to the other two models.

Conclusion
In this study, the authors combined DEA with three machine learning algorithms for analysis and predicted the efficiency of bank branches in Ghana.e DEA and its Two-Phase BuildHull algorithm were implemented in R studio using R codes to assess the efficiency of the 444 bank branches at the deposit stage and investment stage.e overall stage efficiency of each bank was also calculated and categorized as efficient (Class A) or inefficient (Class B) using the adopted "cutoff point" of 0.8 units or 80%.
is efficiency class designated by the CCR DEA was used as the response variable.
For the predictive models, we utilized three popular machine learning algorithms and compared them to each other using several performance metrics.Four hundred and forty-four (444) commercial bank branches in Ghana were involved in this study were 70% banks branches dataset were randomly selected to train and validate each of the three models.e proposed models were used to predict the efficiency of the remaining 30% bank branches.
e best performed machine learning algorithm models (in terms of several performance measures) were determined using a holdout sample data set.
e results suggested that the decision tree and its C5.0 algorithm predicted all the 134 holdout sample dataset (30% banks).us the DT had an accuracy of 100% with a Kappa value of 1 and P value of 0.00 which shows how significant the DT model was.
e next best performing predictive model was the random forest algorithm with a predictive accuracy of 98.5% with a kappa value of 0.95 and P value of Advances in Fuzzy Systems 0.00.Finally, the random forest algorithm predictive model was followed by the neural network model, which also predicted 116 (86.6% accuracy) out of 134 banks efficiency classes correct, but with a very low kappa value of −0.014 and poor P value 0.66 as compared to the other two models.
Overall, these results of the study may have important implications for Ghanaian banks.In this analysis, we determined the efficiency of each bank at stage I (deposit efficiency), stage II (investment efficiency), and finally, the overall efficiency of each bank.According to our analysis and findings, most banks (369 representing 83.1%) in Ghana were efficient in their overall banking operations using the "cutoff point."Even though a lot of these banks were efficient in their overall banking operations, their efficiency in collecting deposits (47 banks representing 10.59%) and especially investing the deposit (only 2 banks representing 0.45%) was poor.
e study, therefore, suggests to bank managers and other stakeholders in Ghana to take a second look at their efficiency and performance in collecting deposits and investing the deposit. is means that managers and other stakeholders should not only depend or over-rely on their individual overall efficiency.e study concluded that banks in Ghana can use the result of this study to predict their respective efficiencies.us, the use of the decision tree predictive model as it was the best performing predictive model.Future studies can look at combining DEA with other topmost machine learning algorithms to predict the efficiency of the banks and the results compared with this study.Other factors that can also impact on banks' efficiency and performance, such as liquidity ratio, can also be taken into consideration as predictor variables in future studies.

4. 1 .
Proposed Framework of the Study. is framework suggested by this study was used to build the predictive models.It consists of three different stages: Data collection stage (Stage I), Data preprocessing stage (Stage II), and the predictive model development stage (Stage III). is means that the dataset for the model development goes through three different stages, Stage I, Stage II, and Stage III (Figure 1).

Figure 3 :Figure 4 :
Figure 3: A graph showing the deposit efficiency scores of the 444 DMUs (the authors' construct).

Figure 5 :
Figure 5: A graph showing the overall efficiency scores of the 444 DMUs (the authors' construct).

Figure 6 :
Figure 6: e graph shows the performance of the three models (the authors' construct).
Analysis.With the 134 banks (30%) that were used as test dataset, the decision tree model predicted all of them correct (100% accuracy) with a kappa value of 1 and P value of 1.1e-11 which shows how significant the model was.e confusion matrix and detailed statistics of the prediction are shown as follows: