Latent Clustering Models for Outlier Identification in Telecom Data

Collected telecom data traffic has boomed in recent years, due to the development of 4G mobile devices and other similar highspeedmachines.The ability to quickly identify unexpected traffic data in this stream is critical formobile carriers, as it can be caused by either fraudulent intrusion or technical problems. Clustering models can help to identify issues by showing patterns in network data, which can quickly catch anomalies and highlight previously unseen outliers. In this article, we develop and compare clustering models for telecom data, focusing on those that include time-stamp information management. Two main models are introduced, solved in detail, and analyzed: Gaussian Probabilistic Latent Semantic Analysis (GPLSA) and time-dependent Gaussian Mixture Models (time-GMM).These models are then compared with other different clustering models, such as Gaussian model and GMM (which do not contain time-stamp information). We perform computation on both sample and telecom traffic data to show that the efficiency and robustness of GPLSA make it the superior method to detect outliers and provide results automatically with low tuning parameters or expertise requirement.


Introduction
High-speed telecom connections have developed rapidly in recent years, which has resulted in a major increase in data flow through networks.Beyond the issues of storage and management of this flow of data, a major challenge is how to select and use this mass of material to better understand a network.The detection of behaviors that differ from normal traffic patterns is a critical element, since such discrepancies can reduce network efficiency or harm network infrastructures.And because those anomalies can be caused by either a technical equipment problem or a fraudulent intrusion in the network, it is important to identify them accurately and fix them promptly.Data-driven systems have been developed to detect anomalies using machine learning algorithms and can automatically extract information from raw data to promptly alert a network manager when an anomaly occurs.
The data collected in telecom networks contains values for different features (related to network resource and usage) as well as time stamps, and those values can be modeled and processed to seek and detect anomalies using unsupervised algorithms.The algorithms use unlabeled data and assume that information about which data elements are anomalies is unknown (since anomalies in traffic data are rare and may take many forms).They do not directly detect anomalies but instead separate and distinguish data structures and patterns in order to group data from which "zones of anomalies" are deduced.The main advantage of this methodology is the ability to quickly detect previously unseen or unexpected anomalies.
Another component to be taken into consideration for understanding wireless network data behavior is time stamps.This information is commonly collected when data are generated but is not widely used in classic anomaly detection processes.However, since network load fluctuates over the course of a day, adding time-stamp attributes in an evaluation model can allow us to discover periodic behaviors.For example, a normal value during a peak period may be an 2 Mobile Information Systems anomaly outside that period and thus remain undetected by an algorithm that does not take time stamps into account.
In this article, we use unsupervised models to detect anomalies.Specifically, we focus on algorithms combining both values and dates (time stamps) and introduce two new models to this end.The first one is the time-dependent Gaussian Mixture Model (time-GMM), which is a timedependent extension of GMM [1] which works by considering each period of time independently.The second one is Gaussian Probabilistic Latent Semantic Analysis (GPLSA), derived from Probabilistic Latent Semantic Analysis (PLSA) [2], which combines values and dates processing together in a unique machine learning algorithm.This latter algorithm is well known in text-mining and recommender systems areas but has been rarely used in other domains such as anomaly detection.In this research, we fully implement these two algorithms with R [3] and test their ability to find anomalies and to adapt to new patterns on both sample and traffic data.We also compare the robustness, complexity, and efficiency of these algorithms.
The rest of the article is organized as follows: in Section 2, we present an overview of techniques to identify anomalies, with an emphasis on unsupervised models.In Section 3, we show different unsupervised anomaly detection models (this section defines two previously introduced unsupervised models: GPLSA and time-GMM).In Section 4, those models are compared to a sample set to highlight the differences of behavior in a simple context.In Section 5, we discuss computations performed on real traffic network data.We finally, in Section 6, draw conclusions about adaptability and robustness of GPLSA.

Research Background
Anomaly detection is a broad topic with a large number of previously used techniques.For a broad overview of those methods, we refer to [4].
Previous research focuses mainly on unsupervised statistical based methods such as clustering methods to perform anomaly detection [5][6][7][8].A common assumption for statistical based methods is that the underlying distribution is Gaussian [9], although mixtures of parametric distributions, where normal-points anomalies correspond to two different distributions [10], are also possible.In clustering methods, the purpose is to separate data points and to group objects together that share similarities, and each group of objects is called a cluster.We usually define similarities between objects analytically.Many clustering algorithms that differ on how similarities between objects are measured (using distance measurement, density, or statistical distribution) exist but the most popular and simplest clustering technique is K-means clustering [11].
Advanced methods of detection combine statistical hypotheses and clustering, as seen in the Gaussian Mixture Model (GMM) [1].This method assumes that all data points are generated from a mixture of  Gaussian distributions; parameters are usually estimated through an Expectation-Maximization (EM) algorithm, where the aim is to iteratively increase likelihood of the set [12].Some studies have used GMM for anomaly detection problems, as described in [13][14][15].Selecting the number of clusters  is not easy: Although methods to automatically select a value of  do exist (a comparison between different algorithms is presented in [16]), the selection is usually chosen manually by researchers and refined after performing different computations for different values.
In telecom traffic data, time stamps are a component to be considered when seeking for traffic anomalies.This information, referred to as contextual attributes in [4], can dramatically change the results of anomaly detection.For example, a value can be considered normal in a certain context (in a peak period) but abnormal in another context (in off-peak periods), and the differentiation can only be made clear when each value has a time stamp associated with it.An overview of outlier detection for temporal data can be found in [17], which comprises ensemble methods (e.g., [18,19]), time-series models (e.g., with ARIMA or GARCH models in [20]), and correlation analysis [21,22].
Clustering methods for temporal anomaly detection can automatically take into account and separate different types of behavior from raw time-series data, which allows for some interesting results.One way to incorporate time stamps is to consider the original GMM (i.e., a mixture of  Gaussian distributions), but to weigh each distribution differently, depending on time.This method was first introduced for textmining [2,23] with a mixture of categorical distributions and named Probabilistic Latent Semantic Analysis (PLSA).Its actual form (with Gaussian distribution), GPLSA, is used for recommendation systems [24].No published article that applies GPLSA for anomaly detection has been found.
In the next section, we present five anomaly detection models for traffic data.The first three models are classic models: Gaussian model, time-dependent Gaussian, and GMM, which do not combine clustering and contextual detection and are expected to have several disadvantages.The two remaining models take clustering and time stamps into consideration: the fourth model is a time-dependent GMM, where a GMM is independently determined for each time slot; the fifth model is Gaussian Probabilistic Latent Semantic Analysis (GPLSA) model, which is solved by optimizing all parameters related to clusters and time in a unique algorithm.

Presentation of Models
In this section, five different models are defined: Gaussian, time-dependent Gaussian, GMM, time-dependent GMM, and GPLSA.We use the same following notations for all: (i)  is a traffic data set.This set contains  values indexed with . is usually large, that is, from one thousand to one hundred million.Each value is a vector of R  , where  is the number of features.Furthermore, each feature is assumed to be continuous.
(ii)  is the time-stamp set of classes.This set also contains  values.Since we are expecting a daily cycle, each value   corresponds to each hour of the day, consequently standing in {1, . . ., 24}.
(iii)  = (, ) are observed data.( For each model, the aim is to estimate parameters with maximum likelihood.When the direct calculation is intractable, an EM algorithm is used to find a local optimum (at least) of the likelihood.A usual hypothesis of independence is added, which is needed to compute the likelihood of the product over the set: (H) The set of triplets (  ,   ,   )  is an independent vector over the rows .Note that if the model does not consider  or , we remove this set in the hypothesis.
The different models are shown in Table 1, grouped according to their ability to consider time stamps and clustering.In the following, for each model, each hypothesis is listed on the form (X2), where X is current model paragraph followed by the hypothesis number.

Gaussian Model.
In the Gaussian model, the whole data set is assumed to come from a variable that follows a Gaussian distribution.Consequently, each part of the day has a similar behavior and there are no clusters.Mathematically (note that same letter is used for set and variable) the following occurs: (A1) Each variable   follows Gaussian distribution with mean and variance , Σ.Here,  is a -vector and Σ is a variance-covariance matrix of size .They are both independent of .
Parameters are easily estimated with empirical mean and variance.

Time-Dependent Gaussian Model.
A time component is added to this model, as opposed to the Gaussian model, which does not include a time component.Each time of the day is considered independently, following a particular Gaussian distribution.This allows us to take dependence of time into account: (B1) For each  ∈ {1, . . ., 24}, each conditional variable   such that   =  follows a Gaussian distribution with mean and variance   , Σ  .
As for the Gaussian model, parameters are estimated with empirical mean and variance for each class of dates.

Gaussian Mixture Model.
Compared to the Gaussian model, in the GMM, data is assumed to come from a mixture of Gaussian distributions rather than one single Gaussian distribution.The number of clusters  is fixed in advance.
Therefore, each record belongs to an unknown cluster.The task is to estimate both probability for each cluster and the parameters of each Gaussian distribution.To solve this problem, the following decomposition is done: ( The parameters can be successively updated with an EM algorithm (see [23] for details).

Mobile Information Systems
To solve this problem, the following decomposition is done (the assumption (E3) is used for the first factor of the sum): The EM algorithm can be adapted in this case to iteratively increase the likelihood and estimate parameters in order to obtain exact update formulas.The complete calculus to derive these formulas is given in the Appendix.We let (⋅ | , Σ) equal the density of a Gaussian with parameters  and Σ.Also, we define   as the set of indexes , where   = .
Step 2. For all , , compute the probability   =  knowing   =   ,   =   , and parameters Step 3.For all , , compute (here #  stands for the length of   ) Step 4. For all , , update  , with Step 5.For all , update the means with Step 6.For all , update the covariance matrix with (here  refers to the transpose) Step 7. Let  =  + 1 and repeat Steps 2 to 7 until convergence at a date .At that date, parameters are estimated.
Step 8.For each , the chosen cluster is  maximizing  () , .
Step 9.For each , the likelihood of this point for the estimated parameters is

Comparison of Models
All five models defined in Section 3 are implemented with R [3] into a framework that is able to perform computations and to show clustering and anomaly identification plots (using ggplot2 [25]).In this section, we apply our framework to a sample set to compare abilities to detect anomalies and check robustness of the methods.The sample set is built to highlight the difference of behaviors between models in a simple and understandable context.Consequently, only one sample feature is considered in addition to time-stamp dates.
In this set, we observe that time-GMM and GPLSA are able to detect anomalies within the set, and those methods are then potential candidates for anomaly detection in a time-dependent context.Furthermore, we show that GPLSA is more robust and allows a higher interpretation level of resulting clusters.

Sample Definition.
The sample is built by superposing the three following random sets: where  is independent random variables for each  sampled according to the continuous uniform distribution on [0, 1] and where  has a daily period.The range of the two first functions is 24 hours, whereas the third one is only defined from 0:00 to 15:00.Three anomalies are added on this set, defined, respectively, at 6:00, 12:00, and 18:00 with values −1.25, 0.5, and 1.65.The resulting set is shown in Figure 1.

Anomaly Identification.
All five models are trained and the likelihood of each point is computed for each model.Since we expect 3 anomalies to be found in this sample set, the 3 lowest likelihood values are defined as anomalies for each model.For the clustering process, the chosen number of clusters is  = 5.
The results are shown in Figure 1.In (a), the whole data set is modeled as one Gaussian distribution and we found no expected anomalies.In (b), each period is determined with a Gaussian distribution, and we only discovered the anomaly at 18:00.In (c), the whole set is clustered and we only discovered the anomaly at 6:00.Finally, in (d), the time-GMM and GPLSA models are trained and the same results obtained: the 3 anomalies were successively detected.
Thus, time-GMM and GPLSA are both able to detect expected anomalies contrary to other methods.

Comparison between Time-GMM and GPLSA.
The same anomalies have been detected with time-GMM and GPLSA.However, they are detected differently.We offer a summary of the comparison in Table 2.
First, GPLSA evaluates time stamps and values at once; that is, all parameters are estimated at the same time.Second, the number of clusters in each class is soft for GPLSA (i.e., it can be different to the specified number of clusters for some class of dates).This allows the model to automatically adapt the number of clusters depending on which cluster is needed in the model.In time-GMM, each class has a specified number of clusters.This is shown in Figure 2, where the first seven hours are plotted in identified clusters for time-GMM (a) and GPLSA (b).
Third, the model is trained with the whole data for GPLSA, whereas only a fraction of data is used for each time-GMM computation.If there is a limited number of data in a class of dates, this can cause a failure to correctly estimate time-GMM parameters.
Fourth, the number of parameters needed for estimation is ( + 2) ×  for GPLSA and (3 − 1) ×  for time-GMM (with  number of classes and  number of clusters, and in dimension  = 1).Consequently, there are fewer parameters to estimate with GPLSA.
On the whole, GPLSA implies a better interpretation level (first and second points) of resulting clusters over time-GMM, combined with a higher robustness (third and fourth points).

Results and Discussion
In this section, anomaly detection is performed on real traffic network data.Based on the comparison of models done in Section 4, we select GPLSA to deduce anomalies and compare results with time-GMM.In Section 5.1, the collected data set is described and preprocessed; then, we apply GPLSA and show the results in Section 5.2.This Section 5.2 specifically focuses on behavior observed after applying the algorithm.Those results are compared with time-GMM results in Section 5.3.Finally, Section 5.4 highlights the ability of GPLSA to perform anomaly detection.

Data Description and Preprocessing.
Data have been gathered from a Chinese mobile operator.They comprise a selection of 24 traffic features collected for 3,000 cells in the city of Wuxi, China.The features are only related to cell sites and do not give information about specific users.They represent, for example, the average number of users within a cell or the total data traffic for the last quarter of hour.The algorithm is trained over two weeks, with one value for each quarter of hour and for each cell.
We discarded the rows of data containing missing values.Only values and time stamps were taken into consideration for computations, and the identification number of cells was discarded.Some features only take nonnegative values and have a skewed behavior, and consequently, some features are preprocessed by applying the logarithm.To maintain interpretability, we do not apply feature normalization on variables.We expect that GPLSA can manage this set, even though some properties of the model are not verified, such as normality assumptions.

Computations and Results
. We used the GPLSA model for the feature corresponding to the "average number of users within cell" and selected  = 3 clusters.Anomalies are values with the lowest resulting likelihood, computed to get (on average) 2 alerts and 8 warnings each day.Visual results are shown on Figure 3.
In (a), the three clusters are identified, whereas, in (b), a different color is used for each class of dates.In (c), the different log-likelihood values are shown.Finally, in (d), the estimation of the probability  , to be in each cluster  knowing  =  is plotted.
Anomalies are shown in (a), (b), and (c) and the extreme values related to each class of dates are correctly detected.In (a) and (d), identified clusters are shown in three distinct colors.The probability to be in each cluster varies across class as expected, with a lower probability in the upper cluster during off-peak hours.Also, as shown in (a), the upper cluster has a symmetric shape and the mean value is relatively similar across dates.

Comparison with Time-GMM.
We compare results obtained in Section 5.2 with time-GMM, using the same number of clusters  = 3, and the same number of alerts and warnings each day.We show results on Figure 4.In (a), the three clusters are identified for each class  (between 1 and 24) and in (b), the different log-likelihood values are shown.
We observe that time-GMM correctly detects most of extreme values.Each class is related to a specific likelihood function and has its own way to represent data.We see that the cluster extents related to the highest values have a similar width for all classes on Figure 4(a) ( = 1 to 24).By comparing Figure 4(b) with Figure 3(c), we observe a larger "bump" (located in green during off-peak hours) for time-GMM.For these reasons, and contrary to GPLSA, anomalies are overrepresented in some classes (e.g., 3 warnings are detected for  = 8 for the first two days) whereas others do not contain anomalies for this time period ( = 6).Those results endorse the higher level of interpretation and robustness of GPLSA over time-GMM.

Discussion.
According to the results, GPLSA is able to detect anomalies in a time-dependent context.We identified global outliers (e.g., on Figure 3(b) at Apr. 15 16:00 in red) as well as context-dependent anomalies (e.g., at Apr. 15 5:00 in orange).Off-peak periods are taken into consideration, and unusual values specific to those periods detected.
Gaussian hypothesis on GPLSA is not really constraining.As shown in Figure 3(a), clusters are adaptable and try to fit Gaussian distributions.They are appropriate to represent the value distribution for each class of dates and cluster.
Cluster adaptation is shown in Figure 3(d).The three clusters represent different level of values.The upper cluster represents higher values, which are more probable during peak periods.The lower cluster represents lower values, with a roughly constant probability.The third cluster in the middle is also useful to obtain a good anomaly detection behavior (results with  = 2 clusters are unable to correctly detect anomalies).About anomaly detection itself, a threshold indicating the number of alerts to be detected can be set.This method of detection is static and relatively simple.Improving this method of detection is possible and straightforward through likelihood computations: inside a cell, an anomaly could be detected with a repetition of low likelihood scores.

Conclusion
In this paper, we present and compare unsupervised models to detect anomalies in wireless network traffic and demonstrated the robustness, interpretability, and ability of the GPLSA model to detect anomalies, as compared to other methods such as time-GMM.Anomaly detection was also performed and analyzed on real traffic data.We highlighted the adaptability of the GPLSA in this context to detect anomalies, even those with new patterns that are difficult to manually predict.As a result, mobile operators can have a versatile way to identify and detect anomalies, which would reduce the cost of possible aftermaths (e.g., network failure).
Improvement of this methodology could be operated.Currently, once the model is computed, anomaly detection is only based on punctual detection through likelihood values.A dynamic detection from consecutive values of likelihood could increase credibility of each alert and reduce the number of false alarms.
Furthermore, the model is only trained from a fixed data set in this research.But this could be extended by considering real-time stream data dealt with in an online context.Thus, new patterns could be updated quickly, to improve responsiveness and anomaly identification.We define at each iteration   ()   fl ( ()  , ,  ()  , Σ ()  ;  ∈ {1, . . ., } ,  ∈ {1, . . ., }) . (A.1) Estimated parameters  () are updated from  (−1) iteratively using the EM algorithm.The algorithm stops when convergence of the related likelihood is reached.We use our hypotheses to express useful probabilities using .We recall that (⋅ | , Σ) is density of a Gaussian with parameters  and Σ.Let   ∈ R  ,  ∈ {1, . . ., } and  ∈ {1, . . ., }.This probability follows a discrete multinomial distribution that is proportional to  , (where for each , the coefficients sum to 1 over all ).

B. Recall about EM
The chosen strategy to estimate parameters  is to find some parameters that maximize the marginal likelihood of the observed data , as defined by As the direct computations are intractable, we use EM to update parameters iteratively: (1) Set some initial parameters  (0) .For  from 0 until convergence, repeat the following steps (2) and ( 3).
( ) is that likelihood (; ) will increase or remain constant at each step [26].However, after convergence, the parameters can be stuck in a local maximum of the likelihood function.

C. Expectation Step of EM in the GPLSA Context
We assume we are in step , and we want to update  ()  , Σ ()  ,  () , for all  and .From (B.3) and using hypothesis (H), we get For the left term, since   = (  ,   ) and using equations (A.2) and (A.Then we define  () , as (  =  |   =   ,  () ), which is explicitly computable from (C.4).
Seen in the whole, Finally, we obtain an explicit formula for ( |  () ) which can be maximized.

D. Expectation Step of EM in the GPLSA Context
From the shape (C.6) of (⋅ |  () ), we can separate maximization of (  , Σ  ) for each  and weights ( , )  for each .
(1) For the Weights  , .For each fixed time stamp , we update ( , )  .These are considered all together since there is a constraint: the sum over  has to be 1.Finally, we compute the derivative with respect to  , .Here, we remember that ∑  =1  , = 1.To remove this constraint, we let  , = 1 − ( 1, + ⋅ ⋅ ⋅ +  −1, ).We rewrite this as  (( , )  ) = By computing the Hessian matrix, we find that the obtained extremum is the maximum value.
(2) For the Means and Variances (  , Σ  ).From (C.6), we can perform computations for each fixed cluster .Since some terms of this sum have no dependence on k, we have to maximize

Figure 1 :
Figure 1: Anomaly detection for 5 different models in the sample set defined in Section 4. The three values with the lowest likelihood are circled in orange.Each color represents a different time-stamp class (only 1 class for (a) and (c); 24 classes for (b) and (d)).

Figure 2 :
Figure 2: Identified clusters for 2 models in the sample set defined in Section 4 between 0:00 and 7:00.In (a), each class of one hour contains 5 clusters, and clusters are not related across hours.In (b), the whole set contains 5 clusters.

Figure 3 :
Figure 3: Anomaly detection with GPLSA from traffic data set presented in Section 5. Plots are restricted to two days in (a), (b), and (c).Red and orange points are related to the lowest likelihood values obtained, with an average of 2 red points and 8 orange points each day.

Figure 4 :
Figure 4: Anomaly detection with time-GMM from traffic data set as presented in Section 5. Plots are restricted to two days.Red and orange points are related to the lowest likelihood values obtained, with an average of 2 red points and 8 orange points each day.