Automated Prediction of Employee Attrition Using Ensemble Model Based on Machine Learning Algorithms

Competent employees are a rare commodity for great companies. The problem of maintaining good employees with experience threatens the owners of companies. The issue of employee attrition can cost employers a lot as it takes a lot to compensate for their expertise and efficiency. For this reason, in this research, we present an automated model that can predict employee attrition based on different predictive analytical techniques. These techniques have been applied with different pipeline architectures to select the best champion model. Also, an autotuning approach has been implemented to calculate the best combination of hyper parameters to build the champion model. Finally, we propose an ensemble model for selecting the most efficient model subject to different assessments measures. The results of the proposed model show that no model up until now could be considered ideal and perfect for each case of business context. Yet, our chosen model was pretty much optimal as per our requirements and adequately satisfied the intended goal.


Introduction
Currently, machine learning and data mining are considered the most effective and active research areas. Different data mining techniques are used in classification, clustering, and prediction [1,2]. Because of the importance of data mining and machine learning, many other methods are applied in different fields, such as education, healthcare, banking, security systems, mobile game industry, and human resource management [3,4]. Employee attrition is a drop in the number of workers of an organization, where the employees have left the business voluntarily or retired. In any organization, highly efficient employees are considered the most valuable asset [5]. Retaining the most marketable or highperformance employees is a big challenge in many organizations. e problem of employee turnover (attrition) has gained popularity in many organizations because of its adverse effects on various subjects ranging from organizational performance and efficiency to disturbances in projects' progress and long-term growth strategies [6]. In fact, this problem adds new spending on organizations to spend more on human capital, recruitment, preparation, and development for the new staff [7].
For the reasons given above, organizations need to predict the level of attrition and keep their employees through more reasonable company policies and regulatory environments. e current research would help most companies to know the level of satisfaction of their employees and obtain some valuable information, which would help control the attrition rate. In the current research, a machine-learning model founded on artificial neural networks and support vector machines was proposed to predict employee attrition for assisting organizations to control the attrition rate. Section 2 of the paper offers literature review about employee attrition and other prediction models using machine-learning methods. Section 3 will designate different machine learning algorithms used in the projected model. e used data set and investigational results of this study will be discoursed in Section 4. Lastly, the conclusion and future work will be offered in Section 5. e main contribution of this work has many objectives. On the one hand, it is addressing the challenge of employee attrition problem. On the other hand, it is addressing different machine learning techniques that create an ROI to help the enterprises understand the real causes of why the employees are churned. Moreover, the proposed model will be used as an alert to the enterprise's human resource decision makers to prevent their employees from being churned. In addition, it is presenting new outcomes supporting or opposing the current study and the other literature available on this particular domain.

Related Work
In this section, we present a literature survey on some employee attrition models implemented in many pieces of research. In their study, Sisodia et al. [5] built a prediction model for employee churn rate. ey used five machine learning algorithms, such as linear support vector machine and C5. Decision tree, k-nearest neighbor, Naïve Bayes classifier, and random forest outperformed all other classifiers. Alao and Adeyemo [7] generated five different decision tree models and two rule sets. e generated output from both is used to develop a prediction model for predicting new cases of employee attrition. Another study to evaluate different machine learning algorithms in predicting employee attrition was presented by Zhao et al. [8]. Ten different algorithms were applied in that study on three different datasets. e datasets represent organizations of various sizes, ranging from small-, medium-, and largesized employee populations. e study concluded that no algorithm outperforms the others in the small dataset. In the medium dataset, the extreme gradient boosting trees result in greater accuracy, while in the large dataset, the gradient boosting trees were the recommended algorithm. A prediction model for prioritizing the features with a high impact on employee attrition and its causes is presented in the study of Yadav et al. [9]. ey applied many machine learning techniques, and the decision tree brought about the highest accuracy in their experiment on experienced employee data. In another study by Khare et al. [10], the logistics regression method was proposed to develop a risk equation for predicting employee attrition based on separated and existing employees' demographic data. Far ahead, the same equation was applied for estimating attrition risk with the existing positioned workers. e cluster with higher chances was defined to discover the reasons and help build a strategy for minimizing risk. In another employee attrition model presented in Alduayj and Rajpoot's [11] study based on machine learning, three experiments were applied, and in each one, three algorithms were used. e first experiment was on the original data, which was imbalanced. In this experiment, the SVM algorithm reported the best F1 score value. ey provided an adaptive synthetic sampling method in the second experiment to overcome the class imbalance problem. It was noticed in that experiment that the performance of all methods enhanced. In the last experiment, they sampled the dataset manually, and this process led to lower performance. e study conducted by Zhu et al. [12] suggested multiple time series modeling techniques for identifying the best models to forecast employee turnover. Based on their statistical evaluation, they selected eight univariate models with acceptable R2 values, and the dynamic regression model is the top prediction model. Fallucchi et al. [13] carried out research and applied many machine learning techniques to predict the factors that may lead the employee to leave the company.
e Gaussian Naïve Bayes classifier gave the best recall value that contributes to the classifier's ability to discover the positive instances. A hybrid model for customer churn forecasting was given in the study of Jamalian and Foukerdi [14]. In that model, the principal component analysis (PCA) algorithm was used in feature selection. e LOLIMOT and C5.0 algorithms were skilled with features of several sizes. e output of each classifier is merged with weighted voting, and the output of the hybrid model had a higher accuracy than individual classifiers. Also, prediction models are presented in different fields like the one presented in Arumugam's [15] study. e model is for paddy crop productivity. e author has proposed a plan for agriculture that may be of assistance to farmers. Table 1 summarizes the machine learning algorithms used in each of the mentioned literature.
e contribution of this work is to automate and support the decision-making processes in an important and vital problem in human resource management. Furthermore, different predictive analytical techniques have been implemented with different pipeline architectures to select the best champion model to be deployed in the production environment. In addition, an autotuning technique is implemented to calculate the best combination of hyperparameters to build the champion model. Moreover, an ensemble model has been proposed to select the best efficient model subject to different assessment measures. Finally, the different proposed models were measured and compared according to different assessments and statistic measures.

Proposed Model
Building a machine learning (ML) model in a real-world environment is performed through three different phases: data, discovery, and deployment. e data phase is concerned with collecting the data, exploring the data, dividing the data, addressing the rare event issues in case of an unbalanced dataset, managing the missing values, handling extreme or unusual values, and finalizing the selection of essential features to be used by the model. e discovery phase tasks are to select an algorithm, improve the model, optimize the complexity of the model, and regularize and tune the hyperparameters of the model. Deployment phase tasks are assessing the models, comparing the ML models, and scoring the champion ML model. e primary steps for predicting the employee attrition problem in the proposed model are shown in Figure 1. Once the data is collected, it goes to the most important step in the prediction models, which is the preprocessing step. In such step, different processes, such as imputation to the missing values of the dataset and feature transformations for skewed and high kurtosis variables, are carried out. Feature transformation will help in model generalization for the new incoming data while we are scoring the model.

Material and Methods
We used a real dataset from SAS (www.sas.com) library, containing 35 variables/columns that vary from categorical and interval variables, and 1.5k rows. e following table demonstrates the data preparation setting for the concrete and interval variables. e threshold for interval/nominal variables is shown in Table 2. In case a numeric input has extra levels compared to the interval cut-off, it will be an interval. Otherwise, it will be nominal. e maximum class level threshold is used to reject the categorical variables, if it has more class levels than the predefined threshold. If a variable has more missing values than the maximum per cent missing, then the threshold to reject missing variables will be rejected, and the partitioning ratio threshold is used for partitioning the dataset into training, testing, and validation partitions. For preliminary model fitting, the training dataset is used. Furthermore, to find the sweet spot among overfitting, underfitting, and "optimize complexity" of the model, validation data is used. Validation data fine-tunes the models built on training data and determine whether additional training is required. e test dataset is used for a closing evaluation of the model.
A stratified random sample is used as a partitioning method. Conversely, it initially splits the people into small clusters or levels according to similar features with the attrition target variable. Consequently, a graded sampling approach would assure that the members of all subgroups are involved in data assessment.

Proposed Model Technologies.
Various machine learning algorithms were developed to learn from the data referred to as training samples. e trained model analyzes and predicts the intended class when new data are generated. In this section, we describe the ML algorithms used in prediction.

Multilayer Perceptron Classifier (MLP).
e first paper, which introduces how neurons can work, was introduced by Warren McCulloch and mathematician Walter    [16]. A multilayered perceptron is a feedforward artificial neural network model in which the input data is mapped to a collection of suitable outputs. It has three layers, namely the input, production, and concealed layers. e input layer receives the processing signal. e processing of MLP consists of an infinite number of hidden layers between the input and output [17]. We demonstrated the backpropagation algorithm for training MLP. Figure 2 shows a typical MLP neural network. e hidden layer is required for classifying indivisible datasets. e j th output of feedforward MLP is as follows: where ∅ i (x) is the input vector, b 2 j is the bias of the output neuron, and j(x) is the output of hidden neuron i.
where b (1) i is the bias of hidden neuron i.

Random Forest (RF).
A random forest is a classifier collaborative of decision trees produced by two randomization sources. Initially, all decision trees are trained on a randomly selected example of the actual data with a replacement of the identical size as the training dataset [18]. It is expected that nearly 37% of the instances in the produced bootstrap samples will be duplicated. Attribute sampling is the second randomization source used in random forests. To accomplish this, a small fraction of the input variables is chosen randomly at each node split to find the best split. e suggested value by Breiman [19] for this hyperparameter is ⌊log 2 (no_of_selected_features)+1⌋. To classify, the ensemble's final forecasting is determined by majority voting. One of the advantages of random forest is that it is hyperparameter-free, or at the very least, the default hyperparameter setting performs excellently on average [20]. In any case, other hyperparameters in the random forest that can be tuned are those that govern the decision trees' depth.
Overall, in a random forest, decision trees can grow until all their leaves are genuine. e tree's growth can be constrained by demanding the fewest number of cases in each node or imposing a maximum depth before or after the split [21].

Gradient Boosting (GB).
Gradient boosting is a regression algorithm similar to boosting [22]. e goal of gradient boosting on a given training dataset D � x i , y i N 1 is to find an approximate value, F (x), of the function F * (x), which, by minimalizing the predicted value of a particular loss function, relates instances x to their corresponding output values y, L(y, F(x)). GB generates a weighted sum of functions as an additive estimation of F * (x) as follows: where ρ m is the weight of the m th function, h m (X). ese functions are the ensemble's models. e estimation is built iteratively. Firstly, a constant approximation of F * (x) is gained as follows: e following models are required to minimalize.
Every h m can be thought of as a step of the greedy step gradient descent optimization for F * . To accomplish this, for every model, h m , is trained on a new dataset D � x i , r mi N i�1 , with pseudoresiduals, r mi , obtained by the following: where the value of ρ m is calculated by resolving a line search optimization issue [21].

Ensemble Model.
Ensemble methods are the tactics to develop numerous models and merging them to produce improved outcomes. In the majority voting ensemble models, every model predicts for all test instances, and the final output prediction is the one receiving majority of the votes. Ensemble produces a new model by taking a function of posterior possibilities (for class targets) or the predicted values (for interval targets) from numerous models. e algorithm used in majority voting works as follows:

Results Discussion
As shown in Figure 1 of the projected model, different machine learning techniques have been implemented, such as gradient boosting, artificial neural networks, random forest, and ensemble models. Moreover, various performance measures have been implemented to find the most efficient machine learning techniques, such as cumulative lift, lift, accuracy, and F1 score.
Cumulative lift is evaluated by classifying all partitions in downward order by the foretold possibility of the target event P_AttritionYes, representing the expected possibility of the event "Yes" for target attrition. e data is partitioned into 20 quantiles (demideciles, with 5% of the data in each), and the quantity of events in all quantiles is calculated. Figure 3 shows the value of cumulative lift for different algorithms in train, validation, and test partition. e cumulative lift for a specific quantile is the proportion of the number of events among each quantile up to and involving the present quantile to the number of events that will be there randomly, or consistently, the proportion of the cumulative response percentage to the baseline response percentage. e cumulative lift at depth 10 involves the top 10% of the data, the first 2 quantiles, with 10% of the events at random. Hence, cumulative lift calculations show that observing an event in quantiles is way too probable compared to randomly picking observations. Lift measure is estimated by classifying all partitions in a downward order by the expected likelihood of the target event P_AttritionYes, representing the expected possibility   Computational Intelligence and Neuroscience of the event "Yes" for the target attrition. e data was segmented into 20 quantiles (demideciles, with 5% of the data in each), and the number of events in all quantiles are calculated. Lift is the ratio of the number of events in that quantile to the number of events that will be there randomly, or homogeneously, it is the proportion of the response percentage to the baseline response percentage. With 20 quantiles, it is probable that 5% of the events occur in all quantiles. us, lift measures show how prospective is observing an event in each quantile compared to choosing random observations. e different values of lift measure for the different algorithms in train, validation, and test partitions are shown in Figure 4.
Sensitivity measure: the ROC curve is a graph of sensitivity against specificity grounded on the confusion matrix. ese values are computed at different cut-off values. e Kolmogorov-Smirnov (KS) cut-off reference line is drawn at the value of 1-specificity for easing the identification of the most optimal cut-off to use while counting one's data, where the most significant variance between 1-specificity and sensitivity is detected for the VALIDATE partition. Figure 5 shows the different values of sensitivity measures for the different algorithms in train, validation, and test partitions. e Kolmogorov-Smirnov statistic measures the distance between the reference distribution's cumulative distribution function and the sample's empirical distribution function or between the practical distribution functions of both models. In addition, when the K-S value gets lower than 0.05, one will learn that the lack of fit is significant.
Accuracy measure: accuracy is the observations' proportion, which is precisely categorized as an event or nonevent, and it is estimated at different cut-off values. Cut-    off values range between 0 and 1, in increments of 0.05. At all cut-off values, the forecast target categorization is considered by if P_AttritionYes, the projected possibility of the event "Yes" for the target attrition, is bigger or equal to the cut-off value. When P_AttritionYes is bigger or equivalent to the cut-off value, then the predicted categorization is the event. Otherwise, it is a nonevent. Once the forecast categorization and the original classifications are both events (true positives) or nonevents (true negatives), the observation is rightly sorted. In case the expected sorting and real categorization contradict, then the observation is inaccurately sorted. e following is the formula to estimate accuracy.
(7) Figure 6 shows the different values of accuracy measure for the different algorithms in train, validation, and test partitions.
F1 Score measure: the F1 score incorporates the criteria of precision and recall (or sensitivity), which are the measures of classification grounded on the confusion matrix estimated at different cut-off values. Cut-off values range between 0 and 1, in increments of 0.05. At all cut-off values, the forecast target categorization is considered by whether P_AttritionYes, the prophesied probability of the event "Yes" for the target attrition, is bigger or equal to the cut-off value. If P_AttritionYes is larger than or equivalent to the cut-off value, the foretold classification is an event. Otherwise, it is a nonevent. Figure 7 shows the different values of the F1 score measure for the different algorithms in train, validation, and test partitions. . Tables 3-5 show differentfit statistic measuresthat are the basis for choosing the best or top model to be deployed in the production environments. Such measures are the Gini coefficient, misclassification rate, and average square error. e Gini coefficient is a statistic, measuring the degree of discrimination in a population. e Gini coefficient ranges between 0 and 1, where 0 represents perfect equivalence and 1 represents perfect discrimination [23]. Small Gini led to a better model, which is the gradient boosting in the test partition dataset. e misclassification rate is a performance metric, which informs the fraction of the wrong guesses without differentiating between negative and positive forecastings [24]. A low misclassification rate leads to a better model than others: the neural network model in the test dataset partition.

Models Fit Statistics Discussion
ere is no correct value for average square error (ASE). However, the lower the value, the better, and 0 means the model is perfect [25,26]. In our case, the better is the neural network model. A final word worth mentioning is that no model is    better for all cases of businesses industries. However, we had selected the model that satisfies our analytics and business goals.

Conclusion and Future Work
e problem of maintaining good employees with experience threatens the owners of companies. e issue of employee attrition can cost employers a lot as it takes a lot to compensate for their expertise and efficiency. Hence, different machine learning techniques have been implemented with an ensemble model to find the different causes of such important business problems. Furthermore, multiple performance measures have been executed to discover the most effective machine learning techniques, such as cumulative lift, lift, accuracy, and F1 score. In addition, different models fit statistic measures were proposed. Such measures are the Gini coefficient, misclassification rate, and average square error that will be the basis for choosing the best or top model to be deployed in the production environments. e outcomes indicated that the lower value reflected the perfection of the model. However, findings revealed that no model up until now could be considered ideal and perfect for each case of business context. Yet, our chosen model was pretty much optimal as per our requirements and adequately satisfied the intended goal.
Lastly, it has been suggested that further studies should be conducted on the topic to contribute to a better understanding of the topic and present new outcomes supporting or opposing the current study and other literature available on this particular domain.

Data Availability
e data that support the findings of this paper are openly available at the SAS (www.sas.com) library.

Conflicts of Interest
e authors declare that they have no conflicts of interest.