Improved Expectation Maximization Algorithm for Gaussian Mixed Model Using the Kernel Method

Fraud activities have contributed to heavy losses suffered by telecommunication companies. In this paper, we attempt to use Gaussian mixed model, which is a probabilistic model normally used in speech recognition to identify fraud calls in the telecommunication industry. We look at several issues encountered when calculating the maximum likelihood estimates of the Gaussian mixed model using an ExpectationMaximization algorithm. Firstly, we look at a mechanism for the determination of the initial number of Gaussian components and the choice of the initial values of the algorithm using the kernel method. We show via simulation that the technique improves the performance of the algorithm. Secondly, we developed a procedure for determining the order of theGaussianmixedmodel using the log-likelihood function and theAkaike information criteria. Finally, for illustration, we apply the improved algorithm to real telecommunication data.Themodifiedmethodwill pave theway to introduce a comprehensive method for detecting fraud calls in future work.


Introduction
Every year telecommunication companies register loses amounting to millions of dollars due to fraud activities.Examples of such activities given by [1,2] include the use of the customer's line in Premium Rate Service without their knowledge or autodialers with no intention to pay for the outgoing calls; PABX for international calls; an unregistered user with an assigned number accessing the network (such activity is called stolen line unknown); and international roaming manipulation.Vendors, seeing the above as an opportunity not to be missed, compete to provide data mining applications that can detect the said activities effectively using various methods such as OLAP, deviation based outlier detection, and hidden Markov model.The focus of this paper is Gaussian mixed model, henceforth, GMM.
A GMM is best known for providing a robust speaker representation for the difficult task of speaker identification on short-time speechspectra [3].Its function is extended to detect fraud activities on the number (as well as length) of domestic and international calls made on a daily basis during office, evening, and night hours.Reference [4] presented three approaches to fraud detection in communication networks: neural networks with supervised learning, probability density estimation methods, and Bayesian networks.Information describing a subscriber's behavior kept in toll tickets was used.For example, supervised learning used summary statistics over the whole observed time period (especially the number of times fraud activities were recorded in the data).The two latter approaches used a subscriber's daily behavior.To improve the fraud detection system, they recommended the combination of the three presented methods together with the incorporation of rule based systems.
The maximum likelihood estimation for a GMM is generally difficult to obtain directly, but it is made easier with the availability of the Expectation Maximization (EM) algorithm which was first introduced by [5].Since then, there has been a significant increase in its use especially in finding the maximum likelihood for probabilistic models.For example, [6,7] developed an online system for detecting fraud calls using a hierarchical switching generative model.The model is trained by using the EM algorithm on an incomplete data set and is further improved by using a gradient-based discriminative method.In this paper, we propose an improved EM algorithm for GMM in detecting fraud calls in telecommunication.
This paper is organized as follows.Section 2 gives an introduction to GMM.We then describe the EM algorithm for a GMM, the kernel method, and eventually the proposed modified EM algorithm for GMM in Section 3. In Section 4, we study the performance of the modified algorithm in estimating the parameters and the effect of overlapping areas of Gaussian components in GMM.In the next section, we propose graphical plots to identify the "best" number of components in a GMM.For illustration, an application of the improvement on a real data set is presented in Section 6.

Gaussian Mixed Model
Let x ∈   and  be the number of components where each component has its own prior probability   and probability density function with mean   and covariance Σ  ,  = 1, . . ., .A Gaussian mixed model is then given by where ∑  =1   = 1.We next define the likelihood function and the log-likelihood function by , respectively.The maximum likelihood estimation (m.l.e) method aims at finding θ that maximizes (X | ); see [8].The expression log(∑  =1   (x  |   , Σ  )) in (X | ) is difficult to compute.We use the Expectation Maximization (EM) algorithm to overcome this problem.

EM for Gaussian Mixed Model.
In a general setup of the EM algorithm given in [5], the authors considered an unobservable variable  in sample space X, which is indirectly observed through observed variable  in sample space Y. Assuming that ( | ) is the sampling density depending on the parameter  ∈ Ω, the corresponding family of sampling densities for , say ( | ), can be derived from where () is a subset of X under the mapping  → () from X to Y.
with the expectation assumed to exist for all pairs (  , ) and ( | ) > 0 for  ∈ Ω.According to [5], the EM iteration consists of two steps, namely, the E-step and the M-step.At the pth iteration with the estimate of  denoted by  () , the Estep will give the value of ( |  () ) and the M-step will find a new estimate of , say  (+1) , that maximizes ( |  () ).
The steps are repeated until convergence is achieved.
For the case of a GMM, we define ( where   ∈ {1, 2, . . ., } and   = , if the th sample is generated by the kth mixture component.It is simplified, by applying, amongst others, the Bayes formula ( | ) ∝ ( | )() where ( | ) is the posterior probability, ( | ) is the likelihood function, and () is the prior probability to the following equations (see [9,10]): where Hence, the EM iteration for a GMM is defined by the following.
The above steps (i.e., E-step and M-step) are repeated until convergence is achieved.

The Kernel Method.
The kernel method can be used to find the probability density estimate for univariate data; see, for example, [11].Let  < min(  ) − 3ℎ,  > max(  ) + 3ℎ,  = 2  ; let ℎ be the bandwidth for some integer ,  = ( − )/; and let   =  +  be the th grid point where  = 0, 1, . . .,  − 1.The density estimate at grid point   is represented by the following equation: where To compute f() at a grid of points, a method which makes use of the Fourier transform is employed.Let f() be the Fourier transform of the kernel density estimate f().It can be shown that is the convolution of the data with the kernel.
We will use the following algorithm by [11] to discretize the data to very fine grids and to find f() by convolving the data with the kernel.
Step C. Find the sequence { *  }, where Here, ℎ = 0.9 −1/5 , where  = min(sd, IQR/1.34),sd is the standard deviation, and IQR is the interquartile range.The IQR is chosen here by [11], who claimed that the bandwidth is useful for a wide range of densities.
Step D. Let   be the inverse discrete Fourier transform of  *  , that is, It can be shown that when  = 0, f(  ) ≈   .We then identify   where its density estimate, denoted by f(  ), is greater than those of its nearest neighbors  +1 and  −1 .In other words, f(  ) > f( +1 ) and f(  ) > f( −1 ); refer to Figure 1 where the vertical line that touches   and f(  ) shows the location of the peak.Note that we may obtain more than one maximum points, which means that the data may consist of more than one Gaussian distribution.These results form a very important component of the improved EM algorithm for GMM to be described next.

Improved EM Algorithm for GMM.
A number of authors highlighted the importance of identifying the right number, say , of components in a GMM and subsequently choosing good initial values for the model parameters   and  2  ,  = 1, 2, . . ., in the EM algorithm.Reference [12] noted the difficulty of using log-likelihood-ratio statistics to test the number of components and subsequently suggested using a nonparametric bootstrapping approach.Similarly, [13] pointed out the same concerns and introduced an algorithm called the stepwise-split-and-merge EM algorithm to solve the said problem.In addition, [14] investigated the possibility of using the minimization of the Kullback-Leiber distance between fitted mixture model and the true density as a method for estimating  where the said distance was estimated using cross validation.Reference [15] viewed the mixture distribution as a contaminated Gaussian density and proposed a recursive algorithm called the Gaussian Mixture Density Decomposition algorithm for identifying each Gaussian component in the mixture.Other works on this topic can also be found, for example, in [16,17].
In this paper, we propose an improved EM algorithm for GMM which can perform both tasks: identifying the initial number of components and providing automatic initial values for the EM algorithm.The full improved EM algorithm for a GMM is now presented.
Step 1.The kernel method as described in Section 3.2 is used to determine the number, say  0 , of components and also the corresponding means   of each component, where  = 1, 2, . . .,  0 .The initial estimates of the standard deviations   are set to unity while the prior weights   are set to be 1/ 0 .
Step 2. The EM algorithm for a GMM as described in Section 3.1 is executed to give the final estimates of parameters   ,   , and   ,  = 1, 2, . . .,  0 .The log-likelihood function  and Akaike information criteria (AIC) are calculated using the said parameters.
Step 2 is repeated for other possible number  of components with   = 0,   = 1 for the other  −  0 components, and   = 1/. Step

Simulation
We use simulation to investigate the performance of the proposed modified algorithm.

Study of Performance Based on the Log-Likelihood Function.
We first look at the performance of the standard method, called Method 1, followed by that of the modified method, called Method 2. For Method 1, in place of Step 1 of the modified method, we assign values zero and unity, respectively, to the means and variances of all components.We compare the performance by looking at the log-likelihood function via simulation study.Following [20], we consider two cases with two and one case with three components with the true values of the parameters given in Table 1.For each case, we generate 100 samples of size 1000 where the chosen sample size reflects the large size of data sets found in the telecommunication industry, the focus of our interest.We then apply Method 1 and Method 2 on the simulated data.For each case, for better quality viewing, we plot only 50 values of the loglikelihood function for both methods on the same plot, as given in Figure 2. It can be seen that, for Samples 1 and 3, the proposed Method 2 clearly outperforms the standard Method 1 with the values of the log-likelihood function corresponding to Method 2 being always larger than those of Method 1.However, we see that some values overlap for Sample 2, though the proposed Method 2 still generally performs better.In this case, the prior probabilities   are distinctly different from the chosen values of   in Sample 1 while other true values remain the same which leads to different percentages of overlapping of the Gaussian components in the GMM.Hence, we will investigate the performance of the improved EM algorithm in estimating the parameters of the GMM by taking into account the effect of different percentages of overlapping between the components observed in the data.

The Effects of Different Overlapping Percentages on
Performance.The main objective here is to investigate the performance of the modified EM algorithm for different overlapping percentages of the components in the GMM.For simplicity, we restrict our attention to two components so that  = ( 1 , . . .,  6 ) = ( 1 ,  2 ,  1 ,  2 ,  11 ,  22 ) are to be estimated.Data is simulated using the simulation scheme described in Section 4.1.
After performing Steps 1 and 2, we find   =   − θ , where   is the true value of the th parameter and θ is the EM estimate of the parameter,  = 1, 2, . . ., .The sample mean and standard deviation of   are computed using the formulas The estimates are considered good if  is close to zero, indicating small biases observed in the simulation results, and   is also close to zero, indicating that the parameter estimates are concentrated around their respective true values.We determine the area of overlapping between the two components for each model by using the misclassification concept given in [21], the details of which are provided in Appendix B. The formula to estimate the overlapping areas depends on the mean and standard deviation of the components.The effects of prior probabilities should not affect the estimates greatly as their sum equals unity.
We consider three cases for different combinations of parameter  which give different percentages of overlapping of the GMM components.The results are tabulated in Tables 2-4.Table 2 deals with case 1, where the true values of  1 = 0,  2 = 3.0, and √ 11 = √ 22 = 0.316 are fixed but the true values of  1 and  2 are varied.In all cases, the percentage of overlapping is 0% as the separation of the means is rather large with small values of dispersion.We can see that the values of the mean are close to zero with the small standard errors less than unity for all parameters considered.On the other hand, Table 3 gives the results for case 2 where  1 = 0,    2 = 1.0, √ 22 = 0.707, and √ 11 = 0.447 are fixed but  1 and  2 are varied to give 25% of overlapping.The bias is still considered small but generally larger than that for case 1.In addition, the values are also more dispersed here.Finally Table 4 shows the results of case 3 where  1 = 0,  2 = 0.25, √ 11 = 0.577, and √ 22 = 1.414 are fixed with 45% of overlapping.As expected, the results deteriorate when the percentage of overlapping increases.We conclude that the modified EM algorithm for GMM performs well when the percentages of overlapping are small, but its performance is affected when the percentages increase.More comprehensive simulation results can be obtained from the authors upon request.

Determination of the Final Number of Components in the GMM
In the last two steps of the modified algorithm, we intend to confirm that the choice of the initial number  0 of components in the GMM using kernel method is final.This can be done by considering extra components in the model.For that, as stated in Section 3.

Real Example: Phone Call Data
The call detail record, which was supplied by Telekom Malaysia Berhad (henceforth, TM), consists of calls made by customers that fell victim to fraud activities.Table 5 shows the format of the call detail record for each TM's customer.We performed several steps on the original data in order to have the data in a desired format, that is, group the real data according to service number, find the country that matches with the country code as well as dialed digits, and sort the real data according to seize time.The column entitled "Seize time" gives the time when the call was made; the fourth column details the duration of the calls in the following format: hour (hh), minute (mm), and second (ss); and the fifth column is the result of converting the information in the fourth column into day format.We consider real data consisting of the converted duration of each call made by Customer A (referring to the fifth column of Table 5), whose identity is not revealed to ensure confidentiality, on March 31, 2011.Seventeen (17) calls were made and the data are displayed in Figure 3. Step 1 of the improved EM algorithm for GMM identifies two initial components.The plots of the log-likelihood function and AIC in Figures 4(a) and 4(b) are the results from performing Steps 2, 3, and 4 of the improved EM algorithm for GMM, which reveal that the EM algorithm fails to achieve convergence when the number of components equals to five or above.It can also be seen that a GMM with 2 components is identified as the "best" model, since the inclusion of more components not only fails to increase the value of the loglikelihood but also fails to decrease the values of the AIC.The final EM estimates for the two-component GMM are â1 = 0.64, â2 = 0.36, μ1 = −0.66,μ2 = 1.17, σ11 = 0.07, and σ22 = 0.35, and they represent the behavior of calls made by Customer A on March 31, 2011.In a future paper, we will show  how the above information produced from the improved EM algorithm for GMM can be used in the process of detecting fraud activities in the telecommunication industry.

Conclusion
In this paper, we proposed a modified EM algorithm which can numerically identify the number of components of a GMM and estimate the parameters of the model using the kernel method.We showed via simulation that the performance of the algorithm is generally good but, as expected, is affected by increasing percentages of overlapping of the Gaussian components.We then used the line plots of the loglikelihood and AIC values to identify the final number of GMM components.They could clearly be determined via the concave-like shape of the AIC plot which indicates that the AIC decreases to a minimum value and then increases as the number of components increases.Finally, the modified EM algorithm for GMM was tested on real telecommunication data.The results serve as testimony to the effectiveness of the improved EM algorithm for GMM and should be useful when considering the problem of fraud calls faced by the telecommunication companies.

Figure 2 :
Figure 2: Plots of values of log-likelihood function.

Figure 3 :
Figure 3: Duration (in day format) is displayed in the histogram.

Figure 4 :
Figure 4: Plots of log-likelihood and AIC values.

Table 1 :
List of true values of a's, and 's, 's.
4. The log-likelihood function and AIC values for  = 1, 2, . . ., 10 are plotted.The final number of components   is chosen when adding extra components in the model does not significantly increase or decrease the values of the loglikelihood function and the AIC, respectively.
3, we repeat Step 2 for other possible number  of components, by setting   = 0,   = 1 for the other  −  0 components, and   = 1/.The final number of components   is determined when adding extra component neither increases the log-likelihood nor decreases the AIC values significantly.The changes can easily be seen on a line plot of the values.

Table 5 :
An extract from Customer A' call detail record.