Research Article Customer Churn Modeling via the Grey Wolf Optimizer and Ensemble Neural Networks

The customer churn is one of the key challenges for enterprises, and market saturation and increased competition to maintain business position has caused companies to make all attempts to identify customers who are likely to leave and end their relationship with a company in a particular period to become the customer of another company. In recent years, many methods have been developed including data mining for predicting the customer churn and manners that customers are likely to behave in the future and therefore, taking action early to prevent their leaving. This study proposes a hybrid system based on fuzzy entropy criterion selection algorithm with similar classifiers, grey wolf optimization algorithm, and artificial neural network to predict the customer churn of those companies that suffer losses from losing customers over time. The research results are evaluated by other methods in the criteria of accuracy, recall, precision, and F_measure, and it is declared that the proposed method is superior over other methods.


Introduction
e expansion of business and commerce environment and the advent of various communication platforms such as the Internet has made it an easy experience for customers to be informed of similar services and products o ered by different companies in the shortest time. Accordingly, in the event of dissatisfaction, customers can switch to another company, which leads to the rise of customer churn of the prior company [1]. Managers in all types and sizes of organizations in recent years with the help of researchers in diverse elds have used di erent methods to predict and prevent the loss of their customers. One of the e cient methods that have been considered by researchers beside the expansion of arti cial intelligence to predict customer churn is the use of data mining methods including arti cial neural network methods, decision tree, logistic regression, support vector machine, and random forest [2].
Moreover, Amin et al. [3] showed that if companies can decrease the amount of customer losses by almost 5%, they can increase pro ts by 25 to 95%. Some of the most prominent problems that customer losses cause for companies are as follows [4].
(i) Financial losses due to the decline of pro ts from lost customers, as well as the high cost of new customers' absorption. (ii) Negative e ects on current customers and these may cause them to leave the company. (iii) Providing the opportunity for competing companies to expand their business by attracting those customers.
us, the main objective of the present study is to provide a hybrid method for predicting customer churn with high accuracy based on fuzzy entropy measures with similarity classifier, grey wolf optimizer (gwo), and ensemble neural network. us, the main questions of this research are as follows: (i) Is it possible to provide a hybrid method based on fuzzy entropy measures with similarity classifier, grey wolf optimizer, and ensemble neural network, which can predict customer churn with high accuracy? (ii) To what extent can the use of ensemble neural networks instead of conventional neural networks be effective in improving customer predictability?
In the following, in Section 2, the research background is examined. Section 3 describes the research method. In Section 4, the results are numbered and analyzed, and finally in Section 5 conclusions are made.

Literature Review
In recent years, the issue of predicting and preventing customer losses has become very important for organizations, and researchers have used various methods to develop systems. De Bock and Van den Poel [4] reviewed the application and evaluation of the performance of the hybrid classification method based on the generalized additive model in predicting customer churn. To compare and evaluate the efficiency of the hybrid model based on the generalized additive model, the bagging method, random forest algorithm, random sub-space algorithm, and logistic regression method have been used. Chen et al. [5] proposed a new framework for predicting customer churn using the longitudinal data method. ey applied support vector machine methods, boosting, neural networks, decision tree, random forest, logistic regression, and proportional hazard model. e dataset used in their study was from a food and telecommunication company. Sharma et al. [6] proposed a method for predicting customer churn using multilayer perceptron neural networks. e dataset used was from a U.S. mobile telecommunication company. e results indicate that neural network can be used as an efficient method in predicting customer churn, and telecommunication companies can use the proposed method for this aim. Coussement and De Bock [7] presented a comparison between single data mining methods and hybrid methods, which are actually a combination of several data mining methods in predicting customer churn.
ey compared methods such as generalized additive model, random forest algorithm, decision tree, and the results showed the superiority of hybrid methods. e data used in this study is related to 3729 customers of the online betting company. Tang et al. [8] examined the impact of using derived behavior information on customer attrition in the financial services industry. ey used orthogonal polynomial analysis method to obtain the derived information. Proportional hazards model has also been used to predict customer attrition.
e dataset used in this article is related to the customers of a bank. Keramati et al. [9] proposed a hybrid system based on feature selection methods, artificial neural networks, support vector machine, decision tree, and K-nearest neighbor to predict customer churn. e data about 3150 customers of one of the telephone operators in Iran that have been collected during 12 months Moeyersoms and Martens [10] used highly correlated customer information features to predict customer churn. In this study, the decision tree classification methods, support vector machine, and logistic regression methods are used to predict customer losses on customer datasets of different energy industries. Coussement et al. [11] evaluated the efficiency of data preprocessing methods on the efficiency of customer churn prediction methods. In this study, decision tree and logistic regression methods, Bayesian networks, support vector machine, and another method called random boosting gradient and bagging have been used to predict customer churn. e data used is related to 30104 customers of a European telecommunication company. Yu et al. [12] proposed a hybrid method based on error propagation neural networks and a particle swarm optimization algorithm to predict customer churn. e reason for using the particle swarm optimization algorithm is to obtain neural network parameters such as weights and bias values so that the data can be well trained in the neural network. e results of this study indicate that the neural network with particle swarm optimization training algorithm offers better performance in predicting customer churn than the neural network method with error replication algorithm.
Salvi et al. [13]; in a study examined the LSTM neural network and used it to predict the future trend of Brent oil prices based on the previous price of Brent oil. In this study, 4 types of errors have been calculated to check the accuracy of the model and errors. e mean absolute error (MAE) and Root Mean Square Error (RMSE) were 1.1962 and 1.9164, respectively.
Moitra et al. [14] attempted to use short-term memory neural network instead of convolutional neural network to predict crude oil price. Results were promising and showed more accurate forecasts for crude oil prices in the coming days, and a hybrid model was presented for forecasting crude oil price that used sophisticated network analysis and LSTM algorithms.
e research results showed that the model is more accurate and has more robustness and reliability. Jafarzadeh Ghoushchi et al. [15]; provided an extended approach to the diagnosis of tumour location in breast cancer using deep learning. is study develops a new machine learning approach based on modified deep learning (DL) to diagnose the tumour location in breast cancer. In this study, the data obtained from the databases (BCDRD01) are developed and resized and divided into datasets. A simple architecture is used for the first group of experiments, one of which utilizes a weighted function to counter the class imbalance.
e results indicate that convolutional neural networks (CNNs) are an appropriate option for the separation of breast cancer lesions.
Raju et al. [16]; have developed an approach to forecasting demand in the steel industry using group learning.
is study aims to introduce a robust framework for forecasting demand, including data preprocessing, data transformation and standardization, feature selection, crossvalidation, and regression ensemble framework. In order to maximize the determination coefficient () value and reduce the root-mean-square error (RMSE), hyperparameters are set using the grid search method. Using a steel industry dataset, all tests are carried out under identical experimental conditions. In this context, STACK1 (ELM + GBR + XGBR-SVR) and STACK2 (ELM + GBR + XGBR-LASSO) models provided better performance than other models. As it improves the performance of models and reduces the risk of decision-making, the ensemble method can be used to forecast the demand in a steel industry one month ahead. Ofori-Ntow et al. [17], have developed an hybrid ensemble intelligent model based on wavelet transform, swarm intelligence, and artificial neural network for electricity demand forecasting. In this study, a three-level hybrid ensemble short-term load forecasting method consisting of discrete wavelet transform (DWT), particle swarm optimization (PSO), and radial basis function neural network (RBFNN) is proposed. e DWT is applied to decompose the data to get a well-behaved requisite series for forecasting since the data becomes stable before using PSO. e statistical analysis revealed that the proposed method performed better based on MAPE, MAD, and RMSE emphasizing its great potential. Table 1 shows the literature review.

Methodology
In this research, a hybrid system based on fuzzy entropy measures with similarity classifier, grey wolf optimizer and ensemble neural networks to predict customer churn. As can be seen from the diagram of the proposed system in Figure 1, the proposed system consists of five main steps, which we will briefly review in detail below.

Collect Customer Data and Pre-Process the Data.
In the proposed method, the neural network machine learning tool is applied to predict the customer churn in the customer's classification stage. In other words, the neural network must learn the data of previous lost and non-lost customers of the company, which has been collected by the company during a certain period of time. According to patterns and relationships between customers, the customer churn and their loyalty can be predicted. In other words, it can be predicted which customers would probably leave the company in the future and which customers would remain loyal.
After collecting the customer data, preprocessing is performed on the data to prepare the data. e preprocessing step involves several different operations performed to prepare the data. One of the steps taken is to convert text or character data or string to numeric data. Normalization is another method of data preprocessing which is performed on data before it is used by machine learning algorithms to make predictions. Its purpose is to solve problems caused by large differences in different data properties in some computational processes. at is to say, the aim of normalizing the data is to prevent the dominance of numbers in the larger range to numbers in the smaller range [18]. Another method of preprocessing data is to deal with missing data. Samples of customer data containing missing data are removed from the dataset, by reason of not interfering with the neural network predictor modeling process by this kind of data.

Selection of Optimal Features and Removal of Redundant and Additional Features.
e feature selection process in the system is performed using the fuzzy entropy measures with similarity classifier.
is feature selection algorithm was proposed by Luukka [19]; in which the fuzzy entropy criterion is used to select the optimal features and remove the redundant features of the data. At this stage, the customer dataset is given to this algorithm, which specifies some properties that can be removed from the data in its output.
Obtain optimal structures using the grey wolf algorithm: e grey wolf algorithm is used to predict the customer churn to obtain two optimal structures for each of the ensemble neural networks. at is, the grey wolf algorithm starts at this stage and during an optimization process identifies two suitable structures or architectures, namely the number of layers and the number of neurons in each layer. One of the most important issues is how to encode the search agents in the grey wolf optimization algorithm. e grey wolf algorithm proposed by Mirjalili [20] is a continuous algorithm to solve continuous problems, while the problem of obtaining the appropriate structure for the Proposed a hybrid method to predict customer churn Neural networks and a particle swarm optimization algorithm neural network is a discrete problem. e solution must be a multi-element array with discrete numbers whose number represents the number of layers of the neural network and the value of each cell of the array represents the number of neurons in each layer. Figure 2 presents the solution of the grey wolf algorithm as an array. On the basis of Figure 2, the array, which consists of four elements, a neural network should be created that consists of four layers in the hidden layer, so that there are 10 neurons in the first layer of this neural network, 20 neurons in the second layer, 8 neurons in the third layer, and 12 neurons in the fourth. Another noteworthy point of this type of encoding is that if the value of one of the cells in the array is 0, that cell should be removed from the array; thus, the number of neural network layers will be reduces.
However, the grey wolf algorithm is a continuous algorithm, and when the search agents are initialized at the beginning of the optimization process, the values of each search agent will be continuous or decimal numbers. erefore, because the problem of selecting the optimal structures for the artificial neural network is a discrete problem, so after each iteration of the algorithm and obtaining the best solution by the problem search agents, the resulting solution, which is continuous, must be discretized using the mapping function. For this reason, a mapping function is used as (1) Data training to the second neural network Data training to the first neural network Obtain the final output of the ensemble neural network Classify and predict the loss of customers prone to leave in the future

Receive customer data
Perform preprocessing on data including normalization, deletion of lost data and ...

Feature selection with fuzzy entropy algorithm
Implementation of the gray wolf algorithm to obtain neural network structures Step 3: Get Two Neural network structure with GWO algorithm Step 4: Build a predictive model Step 1: Receive customer data and pre-process the data Step 2: Select the feature Step 5: Perform the classification    Discrete Dynamics in Nature and Society In the (1), X → (t) i,j represents the j th component of the i th problem search factor in the t th iteration of the grey wolf algorithm. In other words, X → (t) i,j represents the j th element of the i th grey wolf. e mod variable is a function for calculating the residual value, and the reason for using this function is to limit the maximum number of neurons in the range 0 to h n − 1.
e h n is the maximum value for the number of hidden layer neurons. Applying this function to any of the problem factors or any of the grey wolves that have continuous solutions will result in discrete solutions. For example, supposing that the value of the first element of one of the solutions to the problem is 0.15 and h n is equal to 21, the value 15 is obtained.
Another important point about this step of the method is the evaluation function of the search agents in the grey wolf algorithm. at is, in each iteration of the grey wolf algorithm, it is an essential to have a search factor evaluation function to determine which search agent received the best solution. Accordingly, an evaluation function is used to indicate the accuracy of the system for validation data and training data.
at is to say, in evaluating each of the population vectors that represent a structure for the neural network, the training data is trained into a neural network with a constructed structure, and after the training process, the accuracy of the neural network prediction is calculated for validation data. Each search engine that maximizes the value of the following relationship is selected as the best member of the population. Equation (2) shows the population members evaluation function.
In (2), the variables V ACC and T ACC indicate the accuracy of forecasting or classifying for the sample in validation and training datasets, respectively. It shows weighted means presenting the accuracy of recognition for validation data and training, so that the importance of accuracy of recognition of validation data is more than the training data.
In this step, two optimal structures must be obtained, so the first solution or the best structure is obtained by alpha wolf, which is the best search agent, and the second solution, or the second best structure is obtained by beta wolf, which represents the second best search engine.

Building Predictive Model with Ensemble Neural Network.
One of the most powerful and famous machine learning tools, namely artificial neural network, has been used to build a model for predicting customer churn. In the proposed system, instead of using a neural network, two neural networks were used to predict the loss of customers with high accuracy by combining the solutions of these two neural networks. erefore, at this stage, two multilayer feed neural networks, namely a feed neural network with a solution obtained by alpha wolf and a feed neural network with a solution obtained by beta wolf, were created. en the dataset was trained to the neural networks with new features obtained from the feature selection stage, and finally the final solution of the predictive model was calculated using (3), which is a weighted average. Output � 0.5 × net 1 + 0.5 × net 2. (3) In (3), net 1 and net 2 are the outputs of the first neural networks, i.e., the neural network made with the alpha wolf solution, and the second neural network, the neural network made with the solution obtained by the beta wolf, respectively. Output also represents the final solution of the neural network, which is obtained by combining the solutions of two neural networks with the same weighted average.

Classify and Predict the Customer Churn.
After building the classification model, the proposed system is ready to predict the customer churn. erefore, assuming providing a set of data about current customers of a company, the proposed system can classify and predict customers, who are likely to leave the company in the future. Once customers are identified as prone to leave, managers of organizations and companies can prevent the loss of their customers by taking preventive measures in accordance with the policies of their company.

Criteria for Evaluating the Proposed Method.
Predicting customer churn means classifying and categorizing customers into two classes or classes of customers prone to loss as well as customers loyal to the organization.
ere are various methods for evaluating the efficiency and performance of classification systems, among which criteria such as accuracy, recall, precision, F-measure, and area under curve can be referred. To calculate these classification criteria, a matrix called the configuration matrix should be applied, which represents the performance of classification systems in data classification. Figure 3 shows the configuration matrix for two-class problems.
As shown in Figure 3, the performance of classification systems in data classification using this matrix and with criteria called true positive (TP), true negative (TN), false positive (FP), false negative (FN) is displayed. Using the abovementioned criteria, which indicate the performance of classification methods, the performance of classification system can be calculated.
Discrete Dynamics in Nature and Society (i) Accuracy: One of the most famous and oldest evaluation criteria of classification systems is the classification accuracy criterion and it is the ratio of all items that are correctly classified. e accuracy of the classification is calculated as (ii) Recall: It is the accuracy of the classification system in correctly classifying members of class X who are correctly classified as members of class X and is calculated as equation (5). Another name for the recall evaluation criterion is sensitivity. For example, in the problem of predicting customer churn, if class X is the same as the customer class, this measure indicates the accuracy of the system in correctly classifying the customers.
(iii) Precision: is criterion indicates what percentage of the members identified as class X members actually belong to class X and is calculated as (iv) F-Measure: is criterion is a weighted average between the criteria of accuracy and recall and is used to determine the efficiency of classification systems. is criterion provides more accurate information than the normal average between the accuracy and recall criteria.
is criterion is calculated as (v) Area Under Curve: It indicates the area under the system performance characteristic diagram, and the closer the value of this number to a classification number 1, the more favorable the final performance of the classification method will be evaluated. To calculate the result of the customer churn prediction system in this research, after the system is implemented, it stores the results of its classification in the form of a confusion matrix, so using the results of this matrix, the prior mentioned evaluation criteria are calculated to review and analyze system performance.

Customer Data Collection.
A general dataset has been used to evaluate the efficiency of the proposed method for predicting customer churn. is dataset is for a U.S. telecommunication company and is available at https://www. kaggle.com/blastchar/telco-customer-churn. In order to implement the method of predicting customer churn in this research, MATLAB programming language version 2020a has been used. In this dataset, there are 7043 telecommunication subscribers, each of which has 21 information features. Out of 7043, 11 customer data have lost data in some features, so by deleting the information of this 11 missing data, the remains is a dataset that contains 7032 customer information. us, there is a problem with dimensions 21. Out of the total number of customers in this dataset, 1869 people are related to the lost customers, who have left the company after a while and 5174 customers are related to those, who have remained loyal to the company. Table 2 shows some of the features of this dataset.

Simulations Results.
After reading the data from the dataset, initial preprocessing was performed on them. is preprocessor involves converting the values of a string variable to numeric variables. Except for properties 3, 6, 19, and 20, which have numeric values, the values of all other properties have string values. So, the values of these variables are converted to numeric values in the preprocessing step. For example, in attribute 4, which has the values yes and no, the value yes is converted to 1 and the value no to 0.

Select the Feature Using the Fuzzy Entropy Algorithm.
In this step, before this algorithm is implemented, it is attempted to manually remove the first feature, Customer ID, from the dataset. e reason for this is that logically, an organization's customer identification number is of no importance in terms of classification, so removing this feature from the dataset will increase the number of features to 20. By running this algorithm, two other properties are removed, which are properties 1 and 5. Discrete Dynamics in Nature and Society

Obtain Two Optimal Structures Using Grey Wolf.
In order to obtain the appropriate number of layers and the number of neurons in each layer in the proposed method, the grey wolf algorithm is applied. At this stage, the data is randomly divided into three categories: training, validation, and testing. e first category contains 70% of the data used to train the neural network with the structures proposed by the grey wolf algorithm. First, the grey wolf optimization algorithm starts the optimization process and acquires a structure for the neural network.
en, a neural network is created with this structure and training data is trained to the neural network. After training the neural network with the Levenberg-Marquardt algorithm [21] as one of the best neural network training algorithms, the neural network performance is validated with the dataset.
For example, if the grey wolf algorithm solution is proposed in the first iteration by the best population factor, alpha wolf, it proposes a three-layer neural network with 5, 8, and 10 neurons in the hidden layers one to three, respectively. In this case, a neural network is constructed with such a structure, and the training data is trained to this neural network, and after the completion of the training process, the accuracy of the validation data is calculated based on the evaluation function and the competency of the search agent is determined. is process is performed in different iterations of the algorithm and continues until the number of iterations of the algorithm is exhausted.
After the number of iterations of the algorithm is exhausted, as a result of having used two neural networks in the proposed method, the two best algorithmic solutions are obtained by alpha and beta wolves, as the two structures designated for the neural networks, and two neural networks are created to build a data classifier model.
It is noteworthy that all the parameters of the grey wolf algorithm are similar to the parameters proposed in [20]. Only the number of iterations parameters in this algorithm is considered as 50 iterations and the number of search factors is considered as 30 factors. Additionally, the maximum number of layers considered is a neural network with three layers and also the maximum number of neurons in each layer is a maximum of 20 neurons. e following diagram shows the best structures obtained by the grey wolf algorithm in five different runs of the proposed system.
As can be seen in Figure 4, the grey wolf algorithm has acquired different structures for the neural network in various runs. e reason for this is that each time the algorithm is implemented, the training and validation data are randomly selected, and for this reason, the algorithm obtains different structures appropriate to the training data.

Build a Classifier Model.
After the grey wolf algorithm is implemented in the previous step, the solutions of alpha and beta wolves, which are recognized as the best search agents in the grey wolf algorithm, are as the structures of the feed neural networks. In this step, training of the neural network modules is performed using the Levenberg-Marquardt algorithm.

Predict the Customer Churn.
After building a model for predicting customer churn in the previous step, it is time to evaluate the efficiency and performance of the proposed system. At this stage, the system performance is evaluated using test data. To evaluate the performance of the proposed system at this stage, the confusion matrix and the criteria of accuracy, recall, precision as well as the area under curve are applied. Figures 5-8 show the results of running the Discrete Dynamics in Nature and Society proposed method regarding the criteria of accuracy, precision, recall, and F_measure, respectively. As shown in Figures 5-8, averages of 10 different runs of the proposed system in predicting customer churn in the criteria of accuracy, precision, recall, and F_measure are 80.84, 84.45, 91.08, and 87.64, respectively. Besides, the confusion matrix obtained from the best and worst performance of the proposed system on the test data is presented in Figures 9 and 10.
As can be seen from Figure 9, the accuracy of the proposed system in this performance for the test data is 81.4%. Also, it shows that the performance of the proposed system in the criteria of precision, recall, and F_measure is 84.46%, 91.92%, and 88.03%, respectively. Figure 10 shows the worst performance of the proposed system, and the accuracy in this run for test data is 0.80%. In addition, the performance of the proposed system in the criteria of precision, call and F_measure is 83.67%, 90.9%, and 87.14%, respectively.
Another criterion used to evaluate the performance of classifier systems is area under curve. e closer the area under the system performance characteristic curve diagram is 1, the better the system performance is considered. Figure 11 shows the system performance characteristic diagram in the 10 proposed system runs for the test data. Additionally, Figures 12 and 13 present the area under the system performance characteristic curve diagram in the best and worst performance for training and test data, respectively.
Afterwards, the results of the proposed hybrid system with some of the other methods examined on this dataset are compared. Table 3      A noteworthy point in these comparisons is that in these methods a sampling method is used and the results of the methods with and without the sampling method have been examined. However, the results in Table 3 for each of the methods are related to their best results from with and without sampling method.
As can be seen in Table 3, the results of the proposed method are better than other methods in assessing the accuracy, precision, recall, and F_measure criteria. e reasons for the superiority of the proposed method over other methods can be shown as follows (i) As a result of the use of grey wolf algorithm to select the appropriate structure and architecture for neural networks used in building the predictive model of the proposed system.         (ii) In addition, combining the solutions of the two neural networks and obtaining the final solution makes the proposed method perform better than other methods.
So, it can be used as an efficient method in predicting customer churn of organizations and companies that face the challenge of losing their customers and suffer great financial losses due to the loss of their customers.

Conclusion
In this research, it was attempted to propose a hybrid system based on fuzzy entropy measures with similarity classifier, grey wolf algorithm, and artificial neural network to predict the customer churn of companies that suffer from the loss of their customers and financial losses. e use of this system can play a decisive role in the survival of companies in the area of market competition. A company can use this system to identify those customers who might leave the company in the future for various reasons, and take preventive measures based on customer interests and policies.
At first, customer's data of the company was collected. en, various preprocessing methods were performed on the data, such as conversion of string data into numerical data, and data normalization so that the data could be prepared for the selection of features and training of the neural network. So, the feature selection process was performed using the fuzzy entropy measures with similarity classifier on the dataset. In the next step, the grey wolf optimization algorithm was started and during a process of optimization and different iterations, two optimal structures were obtained for the neural network (i.e., determining the number of layers and the number of neurons in each layer). ese structures were used in the phase of building the classifier model to predict customer churn. Finally, after building the classifier model, the proposed system was evaluated with test data.
e results of the proposed system performance in predicting customer churn were examined on a customer dataset of a telecommunication company. e results showed that the proposed system had a better performance in different evaluation criteria (accuracy, precision, recall, and F_measure) than other data mining methods. e contribution of this study is the use of the fuzzy entropy measures with similarity classier to select the optimal features, the grey wolf optimization algorithm to obtain optimal structures for neural networks, and ensemble neural networks to build a model for the aim of predicting customer churn. erefore, managers of various organizations including banks, insurance companies, telecommunications industry, stores, online games, online betting, various energy industries, food distributors, and dozens of other businesses can use the proposed system to identify those customers who are likely to leave, gain more profit by taking precautionary measures to retain those customers, and avoid other relevant losses such as financial losses.
Data Availability e data are contained within the article itself. Upon request, the software file will be provided.

Conflicts of Interest
e authors declare that they have no conflicts of interest. Discrete Dynamics in Nature and Society 11