Hierarchical Neural Regression Models for Customer Churn Prediction

As customers are the main assets of each industry, customer churn prediction is becoming a major task for companies to remain in competition with competitors. In the literature, the better applicability and efficiency of hierarchical data mining techniques has been reported. This paper considers three hierarchical models by combining four different data mining techniques for churn prediction, which are backpropagation artificial neural networks (ANN), self-organizing maps (SOM), alpha-cut fuzzy c-means (α-FCM), and Cox proportional hazards regression model. The hierarchical models are ANN + ANN + Cox, SOM + ANN + Cox, and α-FCM + ANN + Cox. In particular, the first component of the models aims to cluster data in two churner and nonchurner groups and also filter out unrepresentative data or outliers. Then, the clustered data as the outputs are used to assign customers to churner and nonchurner groups by the second technique. Finally, the correctly classified data are used to create Cox proportional hazards model. To evaluate the performance of the hierarchical models, an Iranian mobile dataset is considered. The experimental results show that the hierarchical models outperform the single Cox regression baseline model in terms of prediction accuracy, Types I and II errors, RMSE, and MAD metrics. In addition, the α-FCM + ANN + Cox model significantly performs better than the two other hierarchical models.


Introduction
In today's competitive world, customer churn management (CCM) is an important task for each service provider to build long-term and profitable relationships with specific customers [1,2].The service providers in telecommunication industry suffer from attracting valuable customers with competitors; this is known as customer churn.Recently, there have been many changes in the telecommunications industry, such as, loyalty program for more profitable customers [3].Loyal customers are the most fertile source of data for decision making.This data reflects the customers' actual behavior and those factors affect their loyalty.The potential value of customers can be evaluated by these data [3], also assessing the risk that they will stop paying their bills, and predicting their future needs [4].
Besides, because customer attrition will absolutely result in loss of incomes, customer churn management has received increasing attention in the whole marketing and management literature.Moreover, it has been proven that considerable impact on incomes is occurred by small change in retention rate [5].
The effective customer churn management for companies needs building more comprehensive and accurate churn prediction model.Recently, several customer churn prediction models have been presented in a number of domains such as telecommunications [6][7][8], retail markets [9,10], subscription management [11,12], banking service providers [13], and wireless commerce [14].Among previous studies in the literature, statistical and data mining techniques have been applied to build the prediction models.
Two main tasks of data mining techniques are describing remarkable pattern or relationship in the data and also predicting a conceptual model which data followed up [2].
In the literature, it has been proven that hybrid data mining approaches by combining clustering and classification techniques have better performance in comparison with single clustering and classification data mining techniques.Hybrid approaches are particularly combined of two learning stages, in which the first one is preprocessing the data and the second one is the final prediction output [7].Other hybrid data mining techniques for predicting customer churn model include using well-known metaheuristic algorithms (e.g., genetic algorithm) based on neural network which outperform traditional local search gradient descent/gradient ascent neural networks that use Rumelhart et al. [19] procedure for updating connection weights [20][21][22].
In addition to predicting the customer churn model and determining that which customer belongs to which class (i.e., churned and nonchurned classes), companies are eager to know when, why, and with what probability their customers try to switch their subscription.Having knowledge about those factors which significantly affect customers churn behavior is more important than just knowing classes of customers.These effective factors are needed for companies to plan their long-term strategies for decreasing customer churn rate and above all, scheduling and adopting best marketing strategies based on when and why their customers like to break up their relationship because some companies suffer from marketing expenses in some especial times while they are not aware of what their customers want.On the other hand, having knowledge about effective factors and probability of attrition enables companies to focus on those customers who are more likely to churn.This useful information can be extracted using survival analysis of customers.In order to determine the hazard probability function of the customers and the above-mentioned information, the Cox proportional hazard method is applied as a last part of hierarchical methods because the ANNs are not able to calculate the churn probability of the customers.Another reason for using the Cox regression model is our used data.The customer churn data consists of censored data.Censored data occurs when you know that a measurement exceeds some threshold, but you do not know by how much.So in this study, each customer who has not churned till the end of the experiment is considered as a right censored data.Therefore, the Cox regression model is conducted on the customer data to cope with censored data.
However, few papers studied hierarchical data mining techniques for customer churn prediction.Therefore, in this paper, some data mining techniques are presented to create the hierarchical model of customer churn prediction.The hierarchical methods are based on combining clustering, that is, alpha-cut fuzzy c-means (-FCM), self-organizing maps (SOM), and artificial neural network (ANN), classification techniques, that is, ANN, and survival analysis, that is, Cox proportional hazard regression model, which their combinations are -FCM + ANN + Cox, SOM + ANN + Cox, and ANN + ANN + Cox.To evaluate the performance of the hierarchical models, an Iranian mobile dataset is considered for comparison between the hierarchical models and the single Cox regression baseline model in terms of prediction accuracy, Types I and II errors, RMSE, and MAD metrics.It also should be mentioned that some other well-known techniques, such as Fuzzy ARTMAP [23] and LLMF [24], were used in designing some other hierarchical methods (e.g., SOM + Fuzzy ARTMAP + Cox, ANN + Fuzzy ARTMAP + Cox, ANN + LLMF + Cox, SOM + LLMF + Cox, and -FCM + LLMF + Cox), but just the above-mentioned hierarchical techniques are proposed and reported based on their better performance.Finally, some of contributions of this paper are as follows.
(i) Considering nonchurned customer as censored data and using Cox regression model as a first time in the literature in order to determine customers churn prediction.
(ii) Determining important factors affecting the customer churn in the Iranian telecommunication industry.
(iii) Determining the hazard and survival functions of each customer based on effective factors.
(iv) Proposing some new combination of data mining techniques containing ANN, SOM, -FCM, and Cox regression as hierarchical methods.
(v) Conducting the proposed hierarchical methods on a dataset of Iranian telephony market.
(vi) Comparing different proposed hierarchical methods.
The rest of our paper is organized as follows.In Section 2, we describe the proposed data mining techniques in this paper.Section 3 describes the research methodology, and Section 4 presents the experimental results.Finally, the conclusion is provided in Section 5.

Proposed Data Mining Techniques
In order to create effective and accurate customer churn prediction models, many data mining techniques have been considered over the past time in the marketing and management literature (e.g., [12,25]).The proposed data mining techniques are as follows.

Alpha-Cut Fuzzy C-Means Clustering. Clustering is an unsupervised learning technique that breaks down a set of patterns into groups (or clusters). Clustering technique refers
to the partitioning of a set of data object into clusters.In particular, no predefined classes are assumed [26].
Classical clustering partitions each observation is assigned to a single group (cluster), without considering the degree of distinction or similarity of the observation from all the other possible clusters.This type of clustering is often called hard or crisp clustering [5].Nevertheless, fuzzy clustering methods based on the fuzzy set theory and on the concept of membership functions have been developed.In the fuzzy clustering, observations are allowed to belong to more than one cluster with different degrees of membership.
Fuzzy clustering of an observation X into c clusters is characterized by c membership functions   as follows: Membership function is calculated based on the distance of observations from clusters' center.The well-known method of fuzzy clustering is the fuzzy c-means technique (FCM), initially proposed by Dunn [27].FCM applies two consecutive steps including (a) calculation of the clusters' center and (b) assigning the observations to these clusters' center using specific form of distance, in order to minimize a standard loss function (SLF) as follows: where cluster center   and membership function of observation i in cluster k are calculated by ( 3) and ( 4), respectively where   is the distance metric for observation i in cluster k.

Self-Organizing Maps.
A new form of a neural network architecture called self-organizing map (SOM) was proposed by Kohonen [28], which has proved extremely efficient when the high degree of dimensionality and complexity accurses in input data.SOM is used to find out relationships in a dataset and cluster data according to the similarity of data (i.e., similar expression patterns) where the nature of the classification cannot be predicted by the model creators, or there may be more than one method to cluster the characteristics of a dataset [29].Figure 1 shows an example of a 4 × 4 SOM.artificial neural networks (ANNs), and so on [30] in which artificial neural networks are the most recently applied methods in literature.

Artificial Neural
An ANN consists of some nodes and links between them.The ANN takes a number of input data and produces a single output data through an internal weighting system.ANNs can be categorized into single-layer perception or multilayer perception (MLP).The multilayer perception consists of multiple layers of simple, two taste, sigmoid processing nodes, or neurons that act together using internal weighted system.In addition, the neural network consists in one or more several intermediary layers between the input and output layers.Such intermediary layers are called hidden layers and nodes embedded in these layers are called hidden nodes.Figure 2 illustrates a multilayer neural network.[31], the Cox model is based on a modelling approach in order to analysing survival data.The purpose of the model is to simultaneously explore the effects of several variables on survival.The Cox model is a well-recognised statistical technique for analysing survival data.Survival analysis typically examines the relationship of the survival distribution to covariates.Most commonly, this examination entails the specification of a linear-like model for the log hazard.For example, a parametric model based on the exponential distribution may be written as follows:

Cox Proportional Hazards Model. According to the Cox and Oakes
or, equivalently, Equation ( 5) is a linear model for the log-hazard or a multiplicative model for the hazard.In (5), i is a subscript for observation, and the x's are the covariates.The constant  in this model represents a kind of log-baseline hazard, since log ℎ  () =  [or ℎ  () =   ] when all of the x's are zero.Equation ( 6) is similar to parametric regression models based on the other survival distributions.
where ( 8) is a semiparametric because while the baseline hazard can take any form, the covariates enter the model linearly.

Research Methodology
3.1.Data Set.For the purpose of this paper, we consider a CRM data set provided by an Iranian mobile operator.Specifically, the dataset contains 3,150 subscribers, including 495 churners and 2,655 nonchurners, from September 2008 to August 2009.In addition, the subscribers have to be mature customers who were with the mobile operator for at least 2 months.Churn was then calculated based on whether the subscriber left the company during the 10 remained months.Churned customer is defined as a customer who has not made any contact with the operator (e.g., making a call, charging a credit, changing subscription, etc.).

𝛼-FCM + ANN + Cox.
In the third hierarchical model, -FCM, which is a clustering approach, is used for data reduction task.In the fuzzy c-means (FCM) clustering algorithm, almost none of the data points have a membership value of 1. Besides, noise and outliers may cause difficulties in obtaining appropriate clustering results from the FCM algorithm.Therefore, many studies have been done about the FCM algorithm in the literature [32].Furthermore, studies about FCM can be divided into two categories.One is to extend the dissimilarity (or distance) measure d(  ,   ) between the data point   and the cluster center   in the FCM objective function by replacing the Euclidean distance with other types of metric measures [33].The other category is to extend the FCM objective function by adding a penalty term [34].One of the best methods for assigning a data point to exactly one cluster is that if the membership value   of the data point   in the ith cluster is larger than a given value , then the point   will exactly belong to the ith cluster with membership value of 1 and     = 0 for all i ̸ =   .In order to guarantee that no two of these c cluster cores will overlap, the value of  is set to interval [0.5, 1] [35].The cluster cores generated by FCM can be calculated by ( 9) which is equivalent to where m is the fuzziness index so its value is considered as 2. Interesting readers are referred to [35] for more detail.
Then, the corrected clustered data are used to train second ANN model in order to customer classification.Finally, the hazard function using Cox regression is predicted based on the corrected classified result from ANN model.

Evaluation Method.
To evaluate the proposed churn prediction models, prediction accuracy, and the Type I and II errors are considered.They can be measured by a confusion matrix shown in Table 1.The rate of prediction accuracy is defined as ( + )/( +  +  + ).
The Type I error is the error of not rejecting a null hypothesis when the alternative hypothesis is the true state of nature.In this paper, it means that the customer is not churned when the model has predicted that the hazard function of that customer is more than  (i.e.,  is the alpha cut in fuzzy c-means clustering method).On the other hand, the Type II error is defined as the error of rejecting a null hypothesis when it is the true state of nature.It means that the customer is churned when the model has predicted that the survival function of that customer is more than .
We also compare the performance of the proposed model with pure Cox proportional hazards model in predicting the churn or survival probability of the customers.The observed outcome for each customer in the sample is either churn or survival (i.e., still active) by the end of the study period.We compute the deviation between observed and predicted outcomes (i.e., the probability of churn or survival as predicted by the model) for both proposed and pure Cox model.The Root Mean Squared Error (RMSE) and Mean Absolute Deviation (MAD) are calculated for comparing both models as follows: where   ch and   Nch are the survival probability of churned customer i and non-churned customer j, respectively;  ch and  Nch are the number of churned and nonchurned customer, respectively;   ch is the deviation of churned customer i from zero (i.e.,   ch = (  ch −0)) and   Nch is the deviation of non-churned customer j from one (i.e.,   Nch = (1 −   Nch )), and  ch and  Nch are the mean of the deviation of churned and non-churned customers, respectively.

Experimental Results
4.1.The Baseline.In order to create the Cox model, 2350 and remained 800 numbers of data are used for training and testing the Cox model, respectively.Table 2 shows the prediction performance of the baseline Cox proportional hazards model based on type I and II errors, accuracy, RMSE, and MAD metrics.On average, the baseline Cox proportional hazards model provides about 84% accuracy meaning that in 128 cases of data, the Cox model was unable to correctly predict the survival and hazard probability based on value of alpha-cut 0.7.The type I and II errors were equal to 87 and 41 cases of incorrectly predicted data.The baseline Cox model also provides 0.083 and 0.098 as the RMSE and MAD error metrics, respectively.

ANN + ANN + Cox.
For the first hierarchical model based on combining two ANN models and Cox regression, the first ANN model performs the data reduction task.Therefore, we run the ANN model by a set of different hidden layer and learning epochs.The result of different combination of hidden layer and learning epochs is as Table 3 in which an ANN models with 16 and 12 hidden layer and 100 and 300 learning epochs are considered for two ANN model, respectively.Finally, the accuracy and other performance metrics for hierarchical ANN + ANN + Cox model are shown in Table 4.In order to show the high performance of -FCM + ANN + Cox hierarchical model, the accuracy, errors type I and II, RMSE, and MAD metrics are illustrated in Figures 3, 4, and 5, respectively.

Conclusion
As customers are the main competitive advantage of each industry, customer churn prediction is becoming a major task for companies to remain in competition with other industries.Therefore, building an effective customer churn prediction model, which provides an acceptable level of accuracy, has become a research problem for companies in  recent years.In the literature, the better applicability and efficiency of hierarchical data mining techniques in order to predict customer attrition by combining two or more techniques has been reported over a number of different domain problems.In this paper, we consider three different hierarchical data mining techniques based on combination of some neural networks and regression model to examine their performances for telecommunication industry.In particular, backpropagation artificial neural networks (ANN), selforganizing maps (SOM), alpha-cut fuzzy c-means, and Cox proportional hazard model are considered.Consequently, ANN + ANN + Cox, SOM + ANN + Cox, and -FCM + ANN + Cox hierarchical models are developed, in which the first component of the hierarchical models filter out unrepresentative data or outliers.Then, the corrected output clustered data are used to classify customer into churner and nonchurner groups.
To evaluate the performance of the hierarchical models, an Iranian mobile dataset is considered.The experimental For future work, other prediction techniques can be applied, such as support vector machines, genetic algorithms, logistic regression, and so forth.Finally, other domain datasets about churn prediction can be used for further comparison.

Figure 3 :
Figure 3: The accuracy of hierarchical models.

Figure 4 :
Figure 4: The Type I and II errors of hierarchical models.

Figure 5 :
Figure 5: The RMSE and MAD errors of hierarchical models.
Development 3.2.1.The Baseline.As the last part of all proposed hierarchical methods is Cox regression method and also the final aim of these hierarchical methods is determining better hazard and survival functions for customer churn prediction, therefore, we use the original dataset to create a Cox proportional hazards regression model as the baseline Cox model for comparison.as outliers since the ANN model cannot predict them accurately.Then, the correctly predicted data by the first ANN model are used to train the second ANN model as the classification model.Finally, the corrected classified data from second ANN are used by Cox regression to predict hazard function.
in which the first ANN performs the data reduction task and the second ANN for churn classification and the last Cox regression for hazard function prediction.As there is no 100% accuracy, there are a number of correctly and incorrectly predicted data from the training set by the first ANN model.Consequently, the incorrectly predicted data can be regarded

Table 2 :
The prediction performance of the baseline Cox proportional hazards model.We found that 4 * 4 SOM performs the best which can provide the highest rate of accuracy for two clusters, that is, churner and nonchurner clusters.Then, the accurate clustered data are used for training classifier ANN and the result of different hidden layer and learning epochs is as Table5in which an ANN model with 16 hidden layer and 200 epochs is considered as classifier model.Finally, the accuracy and other performance metrics for hierarchical SOM + ANN + Cox model are shown in Table6.

Table 7 .
Finally, Table8shows the performance metrics of -FCM + ANN + Cox hierarchical models.On average, the -FCM + ANN + Cox hierarchical model provides about 95.49% accuracy based on alpha-cut equal to 0.7 and ANN with 16 hidden layer and 100 learning epochs.The type I and II errors are equal to 21 and 12 cases of incorrectly predicted data.The -FCM + ANN + Cox hierarchical model also provides 0.031 and 0.042 as the RMSE and MAD error metrics, respectively.

Table 3 :
Prediction performance of ANN + ANN hierarchical models.

Table 4 :
Performance metrics of ANN + ANN + Cox hierarchical models.

Table 5 :
Prediction performance of ANN in SOM + ANN + Cox hierarchical models.

Table 6 :
Performance metrics of SOM + ANN + Cox hierarchical models.

Table 7 :
Prediction performance of ANN in -FCM + ANN + Cox hierarchical models.

Table 8 :
Performance metrics of -FCM + ANN + Cox hierarchical models.show that the hierarchical models outperform the single Cox regression baseline model in terms of prediction accuracy, types I and II errors, RMSE, and MAD metrics.In addition, the -FCM + ANN + Cox model significantly performs better than the SOM + ANN + Cox and ANN + ANN + Cox models. results