Collected telecom data traffic has boomed in recent years, due to the development of 4G mobile devices and other similar high-speed machines. The ability to quickly identify unexpected traffic data in this stream is critical for mobile carriers, as it can be caused by either fraudulent intrusion or technical problems. Clustering models can help to identify issues by showing patterns in network data, which can quickly catch anomalies and highlight previously unseen outliers. In this article, we develop and compare clustering models for telecom data, focusing on those that include time-stamp information management. Two main models are introduced, solved in detail, and analyzed: Gaussian Probabilistic Latent Semantic Analysis (GPLSA) and time-dependent Gaussian Mixture Models (time-GMM). These models are then compared with other different clustering models, such as Gaussian model and GMM (which do not contain time-stamp information). We perform computation on both sample and telecom traffic data to show that the efficiency and robustness of GPLSA make it the superior method to detect outliers and provide results automatically with low tuning parameters or expertise requirement.
High-speed telecom connections have developed rapidly in recent years, which has resulted in a major increase in data flow through networks. Beyond the issues of storage and management of this flow of data, a major challenge is how to select and use this mass of material to better understand a network. The detection of behaviors that differ from normal traffic patterns is a critical element, since such discrepancies can reduce network efficiency or harm network infrastructures. And because those anomalies can be caused by either a technical equipment problem or a fraudulent intrusion in the network, it is important to identify them accurately and fix them promptly. Data-driven systems have been developed to detect anomalies using machine learning algorithms and can automatically extract information from raw data to promptly alert a network manager when an anomaly occurs.
The data collected in telecom networks contains values for different features (related to network resource and usage) as well as time stamps, and those values can be modeled and processed to seek and detect anomalies using unsupervised algorithms. The algorithms use unlabeled data and assume that information about which data elements are anomalies is unknown (since anomalies in traffic data are rare and may take many forms). They do not directly detect anomalies but instead separate and distinguish data structures and patterns in order to group data from which “zones of anomalies” are deduced. The main advantage of this methodology is the ability to quickly detect previously unseen or unexpected anomalies.
Another component to be taken into consideration for understanding wireless network data behavior is time stamps. This information is commonly collected when data are generated but is not widely used in classic anomaly detection processes. However, since network load fluctuates over the course of a day, adding time-stamp attributes in an evaluation model can allow us to discover periodic behaviors. For example, a normal value during a peak period may be an anomaly outside that period and thus remain undetected by an algorithm that does not take time stamps into account.
In this article, we use unsupervised models to detect anomalies. Specifically, we focus on algorithms combining both values and dates (time stamps) and introduce two new models to this end. The first one is the time-dependent Gaussian Mixture Model (time-GMM), which is a time-dependent extension of GMM [
The rest of the article is organized as follows: in Section
Anomaly detection is a broad topic with a large number of previously used techniques. For a broad overview of those methods, we refer to [
Previous research focuses mainly on unsupervised statistical based methods such as clustering methods to perform anomaly detection [
Advanced methods of detection combine statistical hypotheses and clustering, as seen in the Gaussian Mixture Model (GMM) [
In telecom traffic data, time stamps are a component to be considered when seeking for traffic anomalies. This information, referred to as contextual attributes in [
Clustering methods for temporal anomaly detection can automatically take into account and separate different types of behavior from raw time-series data, which allows for some interesting results. One way to incorporate time stamps is to consider the original GMM (i.e., a mixture of
In the next section, we present five anomaly detection models for traffic data. The first three models are classic models: Gaussian model, time-dependent Gaussian, and GMM, which do not combine clustering and contextual detection and are expected to have several disadvantages. The two remaining models take clustering and time stamps into consideration: the fourth model is a time-dependent GMM, where a GMM is independently determined for each time slot; the fifth model is Gaussian Probabilistic Latent Semantic Analysis (GPLSA) model, which is solved by optimizing all parameters related to clusters and time in a unique algorithm.
In this section, five different models are defined: Gaussian, time-dependent Gaussian, GMM, time-dependent GMM, and GPLSA. We use the same following notations for all: For clustering methods, we assume that each value is related to a fixed (although unknown) cluster, named
An example of traffic data retrieved is shown as follows:
For each model, the aim is to estimate parameters with maximum likelihood. When the direct calculation is intractable, an EM algorithm is used to find a local optimum (at least) of the likelihood. A usual hypothesis of independence is added, which is needed to compute the likelihood of the product over the set: The set of triplets
The different models are shown in Table
Anomaly detection methods compared.
No date | Date | |
---|---|---|
No clustering | Gaussian | Time-Gaussian |
Clustering | GMM | (i) Time-GMM |
In the Gaussian model, the whole data set is assumed to come from a variable that follows a Gaussian distribution. Consequently, each part of the day has a similar behavior and there are no clusters. Mathematically (note that same letter is used for set and variable) the following occurs: Each variable
Parameters are easily estimated with empirical mean and variance.
A time component is added to this model, as opposed to the Gaussian model, which does not include a time component. Each time of the day is considered independently, following a particular Gaussian distribution. This allows us to take dependence of time into account: For each
As for the Gaussian model, parameters are estimated with empirical mean and variance for each class of dates.
Compared to the Gaussian model, in the GMM, data is assumed to come from a mixture of Gaussian distributions rather than one single Gaussian distribution. The number of clusters Each record belongs to a cluster Each variable
Therefore, each record belongs to an unknown cluster. The task is to estimate both probability for each cluster and the parameters of each Gaussian distribution. To solve this problem, the following decomposition is done:
Combining the models described in Sections For each For each
The GPLSA model is based on the classic GMM but introduces a novel link between data values and time stamps. In time-GMM, the different classes of dates are considered independently, whereas GPLSA introduces dependence between latent clusters and time stamps but only within those two variables. That is, in knowing latent cluster For each Each variable For all
To solve this problem, the following decomposition is done (the assumption (E3) is used for the first factor of the sum):
At time
For all
For all
For all
For all
For all
Let
For each
For each
All five models defined in Section
In this set, we observe that time-GMM and GPLSA are able to detect anomalies within the set, and those methods are then potential candidates for anomaly detection in a time-dependent context. Furthermore, we show that GPLSA is more robust and allows a higher interpretation level of resulting clusters.
The sample is built by superposing the three following random sets:
Three anomalies are added on this set, defined, respectively, at 6:00, 12:00, and 18:00 with values −1.25, 0.5, and 1.65. The resulting set is shown in Figure
Anomaly detection for 5 different models in the sample set defined in Section
Gaussian
Time-Gaussian
GMM
Time-GMM and GPLSA
All five models are trained and the likelihood of each point is computed for each model. Since we expect 3 anomalies to be found in this sample set, the 3 lowest likelihood values are defined as anomalies for each model. For the clustering process, the chosen number of clusters is
The results are shown in Figure
Thus, time-GMM and GPLSA are both able to detect expected anomalies contrary to other methods.
The same anomalies have been detected with time-GMM and GPLSA. However, they are detected differently. We offer a summary of the comparison in Table
Comparison between time-GMM and GPLSA.
Time-GMM | GPLSA | |
---|---|---|
Cluster number | Fixed number of clusters at each date | Number of clusters can adapt to each date |
Cluster relations | No relation between clusters of each date | Homogeneity of clusters across dates |
| Low | High |
| ||
Data used | Only a part of data is used at each date | All data is used for each date |
Nb: of param. | | |
| Medium | High |
First, GPLSA evaluates time stamps and values at once; that is, all parameters are estimated at the same time. Consequently, consecutive dates can share similar clustering behaviors. With time-GMM, parameters are trained independently for each class of dates, and no relation exists between the clusters of different classes.
Second, the number of clusters in each class is soft for GPLSA (i.e., it can be different to the specified number of clusters for some class of dates). This allows the model to automatically adapt the number of clusters depending on which cluster is needed in the model. In time-GMM, each class has a specified number of clusters. This is shown in Figure
Identified clusters for 2 models in the sample set defined in Section
Time-GMM
GPLSA
Third, the model is trained with the whole data for GPLSA, whereas only a fraction of data is used for each time-GMM computation. If there is a limited number of data in a class of dates, this can cause a failure to correctly estimate time-GMM parameters.
Fourth, the number of parameters needed for estimation is
On the whole, GPLSA implies a better interpretation level (first and second points) of resulting clusters over time-GMM, combined with a higher robustness (third and fourth points).
In this section, anomaly detection is performed on real traffic network data. Based on the comparison of models done in Section
Data have been gathered from a Chinese mobile operator. They comprise a selection of 24 traffic features collected for 3,000 cells in the city of Wuxi, China. The features are only related to cell sites and do not give information about specific users. They represent, for example, the average number of users within a cell or the total data traffic for the last quarter of hour. The algorithm is trained over two weeks, with one value for each quarter of hour and for each cell.
We discarded the rows of data containing missing values. Only values and time stamps were taken into consideration for computations, and the identification number of cells was discarded. Some features only take nonnegative values and have a skewed behavior, and consequently, some features are preprocessed by applying the logarithm. To maintain interpretability, we do not apply feature normalization on variables. We expect that GPLSA can manage this set, even though some properties of the model are not verified, such as normality assumptions.
We used the GPLSA model for the feature corresponding to the “average number of users within cell” and selected
Anomaly detection with GPLSA from traffic data set presented in Section
Values as a function of dates, with clusters identified
Values as a function of dates, with classes identified
Log-likelihood of values of the set as a function of dates
Probability to be in a cluster knowing date class
In (a), the three clusters are identified, whereas, in (b), a different color is used for each class of dates. In (c), the different log-likelihood values are shown. Finally, in (d), the estimation of the probability
Anomalies are shown in (a), (b), and (c) and the extreme values related to each class of dates are correctly detected. In (a) and (d), identified clusters are shown in three distinct colors. The probability to be in each cluster varies across class as expected, with a lower probability in the upper cluster during off-peak hours. Also, as shown in (a), the upper cluster has a symmetric shape and the mean value is relatively similar across dates.
We compare results obtained in Section
Anomaly detection with time-GMM from traffic data set as presented in Section
Values as a function of dates, with clusters identified (cluster identification is independent for each time slot)
Log-likelihood of values of the set as a function of dates
We observe that time-GMM correctly detects most of extreme values. Each class is related to a specific likelihood function and has its own way to represent data. We see that the cluster extents related to the highest values have a similar width for all classes on Figure
According to the results, GPLSA is able to detect anomalies in a time-dependent context. We identified global outliers (e.g., on Figure
Gaussian hypothesis on GPLSA is not really constraining. As shown in Figure
Cluster adaptation is shown in Figure
About anomaly detection itself, a threshold indicating the number of alerts to be detected can be set. This method of detection is static and relatively simple. Improving this method of detection is possible and straightforward through likelihood computations: inside a cell, an anomaly could be detected with a repetition of low likelihood scores.
In this paper, we present and compare unsupervised models to detect anomalies in wireless network traffic and demonstrated the robustness, interpretability, and ability of the GPLSA model to detect anomalies, as compared to other methods such as time-GMM. Anomaly detection was also performed and analyzed on real traffic data. We highlighted the adaptability of the GPLSA in this context to detect anomalies, even those with new patterns that are difficult to manually predict. As a result, mobile operators can have a versatile way to identify and detect anomalies, which would reduce the cost of possible aftermaths (e.g., network failure).
Improvement of this methodology could be operated. Currently, once the model is computed, anomaly detection is only based on punctual detection through likelihood values. A dynamic detection from consecutive values of likelihood could increase credibility of each alert and reduce the number of false alarms.
Furthermore, the model is only trained from a fixed data set in this research. But this could be extended by considering real-time stream data dealt with in an online context. Thus, new patterns could be updated quickly, to improve responsiveness and anomaly identification.
We recall that
Observed values are
Latent values are
We recall the different hypotheses for GPLSA: The set of triplets For each Each variable For all
Unknown parameters of the model are grouped together into
Initial estimated parameters
We define at each iteration
We use our hypotheses to express useful probabilities using
From (E3), we know that
Also, from (E1),
The chosen strategy to estimate parameters
As the direct computations are intractable, we use EM to update parameters iteratively: Set some initial parameters Perform the expectation step (E step): Perform the maximization step (M step):
which can be rewritten as
A theoretical reason to update the expected value function
We assume we are in step
For the left term, since
For the right term, using (
Then we define
Seen in the whole,
From the shape (
For each
We let for all
Since
Finally, we compute the derivative with respect to
By differentiation,
If we want this value (
Now, using the constraint
This follows for all
By computing the Hessian matrix, we find that the obtained extremum is the maximum value.
We obtain the same formula as for GMM and then give the update rules with
The authors declare that there is no conflict of interests regarding the publication of this paper.