An Improved Clustering Method for Detection System of Public Security Events Based on Genetic Algorithm and Semisupervised Learning

The occurrence of series of events is always associated with the news report, social network, and Internet media. In this paper, a detecting system for public security events is designed, which carries out clustering operation to cluster relevant text data, in order to benefit relevant departments by evaluation and handling. Firstly, texts are mapped into three-dimensional space using the vector space model.Then, to overcome the shortcoming of the traditional clustering algorithm, an improved fuzzy c-means (FCM) algorithm based on adaptive genetic algorithm and semisupervised learning is proposed. In the proposed algorithm, adaptive genetic algorithm is employed to select optimal initial clustering centers.Meanwhile,motivated by semisupervised learning, guiding effect of prior knowledge is used to accelerate iterative process. Finally, simulation experiments are conducted from two aspects of qualitative analysis and quantitative analysis, which demonstrate that the proposed algorithm performs excellently in improving clustering centers, clustering results, and consuming time.


Introduction
With the increasing development of communication and transmission mode, a growing number of media reports and public opinion concerning public security events are transmitting through a variety of ways, which brings to these events a higher level of interest and transparency.Various kinds of public security events, such as bus bombing, violence caused by extreme nationalism, and chopping people in campus, create serious damage to the life and property of broad masses of the people and bring about extremely bad influence to social stability and national unity.
The occurrence of a series of events is not occasional.Some are probably fermented individual behaviors influenced by some social factors such as massive media report, as well as inspiration and emotional impact brought by propaganda of the Internet public opinion.Once events caused by individual behaviors form a pattern, dangers in it are no less than any group events.
It is undeniable that in contemporary society the widespread use of the Internet, mobile network, and social platforms, providing much convenience to living of people, yet becomes hotbed of spread and flooding of various kings of negative information to some extent.Timely finding events perhaps causing significant disruption that are spreading through the Internet and other channels is a significant research subject.
Whether trying to solve the problem of data rich and lack of knowledge, or monitoring network public opinion, we need the help of topic detection and tracking (TDT) technology [1].The research of TDT that addresses eventbased organization of broadcast news has always been an issue in the field of natural language processing or data mining from the beginning.The goal of TDT is to detect new to address this complexity [14].In [15], a hybrid fuzzy Kharmonic means (HFKHM) clustering algorithm based on improved possibilistic C-means clustering and K-harmonic means was presented, which solves the noise sensitivity problem of -harmonic means and improves the memberships of the improved possibilistic -Means clustering.In [16], Ding and Fu proposed a kernel-based fuzzy -means clustering algorithm to optimize fuzzy -means clustering, based on the genetic algorithm optimization which is combined with the improved genetic algorithm and the kernel technique.
In this paper, a system for detecting public security events is introduced to cater for the significance of this research, which integrates key technique of data mining and machine learning with field of discovery and tracking of hot topics.Then, the shortcomings of those single-pass-based methods, as well as problem of being prone to fall into local optimization of FCM algorithm, are formulated.To optimize the effectiveness of the system and to minimize deviation, an effective and practical framework of a modified FCM clustering method based on adaptive genetic algorithm and semisupervised learning (AG-SL-FCM) for public security events detection system is proposed.In the proposed algorithm, the adaptive genetic algorithm is employed to optimize the determination of initial clustering centers and applies guidance of prior knowledge in semisupervised learning to accelerate convergence of the algorithm.Specially, our main contributions are described in detail as follows.
(1) In order to realize the direction, accurate clustering for news reports is of significance.Therefore, a scheme of detection system for public security events is put forward, in which Chinese words are extracted approximate content and mapped as vector model in three-dimensional space.After preprocessing, clustering analysis operation is carried out on the system.
(2) With regard to the shortcomings of slow convergence speed and being prone to fall into local optimization, we divide the process of clustering analysis into two subproblems which are the determination of initial clustering centers and the iterative optimization of objective function.In the former one, the ability of global optimization of genetic algorithm is adopted to avoid it.In the latter one, idea of semisupervised learning is used to improve accuracy and speed the iterative process.
(3) The closed expression of detection scheme for the system, in which three keywords are used as three metric parameters, is derived.Then, through experiments, it can be proved that the system functions well with the use of the algorithm and the scheme proposed in this paper.
The rest of the paper is organized as follows.Section 2 presents the system model of the whole detection system.Section 3 describes the improved clustering algorithm in detail.Simulation results are given in Section 4. Finally, Section 5 provides the conclusion.

System Design and Model
In this section, we present detailed design of the detection system for public security events, as shown in Figure 1.Concrete steps are summarized as follows: (1) The search engine based on semantics is used to collect news report from various news sites, tweets from microblog, and public opinions from forums.(2) All of the above document streams are firstly partitioned into segments by a piece of software for words segmentation named ICTCLAS developed by Chinese Academy of Sciences which is one of the most popular tools used for words segmentation.(3) The TF-IDF method [17] is adopted to calculate the weight of each keyword.And we use classical vector space model (VSM) [18] to extract main points of each sample.Then, the samples are mapped into threedimensional space after dimensionality reduction.(4) The algorithms of data mining adopted to cluster the samples act as the core part of the system and the main content of our research.
In the detecting system, the main model for research of each section is introduced as follows.
2.1.Search Engine Model.Web crawlers, as the main composition of the search engine, refers to an application aiming to crawl as fast as possible and collect as many webpages associated with predefined themes as possible.They can operate block collection on the whole Internet and integrate sampling results of different sectors together to improve acquisition coverage rate and page utilization rate of the whole Internet.
To overcome the disadvantages of traditional general web crawler, the capture scheme employed in this paper is based on dynamic topic base [19].It can update the topic base intelligently during the crawling process, reduce the miscarriage justice of topic relevance of the anchor text, and filter URLs better.Meanwhile, in the process of search, the topic relativity judgement can be carried out on webpages and only needs to be based on the concept set not the whole Internet.

Text Representation Model.
As is known, texts, as character strings formed by symbols with specific meaning that are connected in order, possess two basic characteristics, one of which is the frequency of all the words that constitute texts, and the other is connection order among these words, meaning that texts can be expressed by frequency and interrelation of feature items.In order to represent sequence information of feature items in texts, it is inevitable to utilize directed pointer structure.Thus the whole text would become a complicated graph or net.And one vector is enough to represent frequency information of characteristic items in text.
VSM, firstly put forward by Dr. Salton in 1968, uses vector to represent texts, which has always been the most classic calculation for text representation [17,18].Its idea originates from the fact that all the texts and queries contain some independent property that is expressed by some characteristic items to reveal their contents and that can be regarded as one dimension of vector space; thus the texts and queries can be expressed as the collection of these attributes ignoring complicated relation among paragraphs, sentences, and words.Text is expressed as a vector as follows: where  is the number of characteristic items when extracting characteristics and   (  ) is the weight of the th characteristic items in document   .The frequency of characteristic words in a document is denoted as  and the number of documents in which characteristic words occur is denoted as .Thus, the formula of computing weight of characteristic words is deduced as follows (TF-IDF model): where   is the frequency of the th document occurring in   . is total number of documents.  is the number of documents in which the th characteristic item is occurring.Texts and queries can be expressed using a vector, respectively.And the similarity between them can be measured with distance among vectors.

Fuzzy C-Means Algorithm Model.
Fuzzy -means (FCM) [16], with a strict theoretical basis of fuzzy mathematics, is an unsupervised clustering algorithm based on a single objective function.Suppose the total number of samples waiting for clustering analysis is ,  is the number of clusters, and   denotes fuzzy membership of the th sample point belonging to the th cluster center.The membership degree matrix is given by the following formula: Each row element of matrix  represents membership of pixels belonging to each class.Adding fuzzy exponent, objective function, whose nature is the distance between each point and the clustering centers, is expressed as follows: where   is the Euclidean distance between the th sample   and the th clustering center   .The iterative cycle process in FCM algorithm is actually a process of seeking minimum value of   (, ).Typically, fuzzy exponent  takes the value of 2. The fuzzy membership degree set is supposed to be established as follows: The clustering center adjustment formula is defined as follows: The two steps mentioned above, evaluation of fuzzy membership and recomputation of cluster centers, should be executed several times until there is no change in cluster centers and are the necessary condition of (, ) tending to the maximum value which can be proved utilizing the Lagrange multiplier.

Proposed Framework
Our research combines idea of adaptive genetic algorithm and semisupervised learning with FCM algorithm to make an improved FCM algorithm: AG-SL-FCM algorithm.It can make full use of high efficiency of FCM algorithm, global search ability of genetic algorithm, and guidance of prior knowledge, which ensures efficiency and precision of clustering.

The Determination of Initial Clustering Centers Based
on Adaptive Genetic Algorithm.The basic idea of adaptive genetic algorithm (AGA) [20] can be summarized as follows: Firstly, it is supposed to produce the initial solution group; then, it is expected to select excellent individuals from solution group according to some index.Then, genetic operators are operated on them to generate a new generation of candidate group.Finally, this process is repeated until the satisfaction of some convergence index.Compared with many other optimization algorithms, genetic algorithm, possessing the obvious global search performance and the robustness of problem solving, has demonstrated its unique advantages and is thus widely used in problems with regard to combinatorial optimization, pattern recognition, machine learning, and image processing.
The formulated problem of being prone to fall into local optimum can be effectively resolved by AGA.As a directed random search technique, AGA adopts natural evolution strategy and works out solution by continually evolving a population of candidate solutions [21].And the evolution process mainly includes selection, crossover, and mutation.Therefore, a method of determination of the initial clustering centers based on AGA is proposed to work as an optimal global searching method of the system to resolve the problem.
The key components of the method are depicted as follows.
FCM algorithm is, respectively, used on the  subsets to split each subset into  categories and to compute their clustering centers .  is the clustering center of the th subset.( 1 ,  2 , . . .,   ) is the initial solution group of genetic algorithm; namely, the population size is  and individuals are clustering centers of each subset.The idea of genetic algorithm is used to optimize the population group for obtaining the clustering centers close to global optimum.
(1) Initial Population.Assume that the training set has  samples, number of clustering centers is (), and each sample is a -dimensional vector.These  samples are evenly split into  subsets in which the distribution of different categories is consistent with the original samples.The initial population is a set of potential solutions of which size of individuals is denoted as , and then the initial population  0 is represented as follows: where   , a matrix with dimension of  × , is the clustering center of the th subset (individual).
(2) Encoding.Population individuals constitute a matrix with dimension of  ×  arranged around clustering centers.In this paper, to avoid complexity and improve efficiency, real number coding strategy and concatenated code form are adopted to link  sets of parameters representing clustering centers, which contributes to shortening the length of chromosome and to improving ability of global optimization as well as convergence speed of the algorithm.The result of encoding clustering centers is represented as follows: where V  is th component of   .
(3) Fitness Calculation.Fitness is employed to evaluate the performance of each individual in the population, meaning that a better individual returns a higher fitness.As a performance measurement criterion, the fitness function establishes a mapping of value of the objective function for fitness, which plays an important role in the evolutionary process.In terms of FCM algorithm, optimal clustering results correspond to minimal value of objective function.Motivated by simulated annealing thinking [22], the fitness function carrying out the scale transformation is defined by where () is the annealing temperature function of the evolutional generation and  is the current generation number.In (10),  0 is the initial annealing temperature and equals 2 ⋅  max ;  max is the maximum number of generations;   denotes the annealing temperature coefficient whose value is slightly less than 1.With the number of evolutional generations increasing, the differences of the fitness between individuals in the same generation become more evident, thus providing more opportunities to extract better individuals.
(4) Survivor Selection.Selection strategy, playing an important role in algorithm performance, adopts the rule of survival of the fittest in the evolution theory.It says that high potential individuals (parents) will produce better ones (offspring).Individuals in the population are selected to undergo crossover and mutation operations for reproduction using the roulette wheel selection [23].This selection method is based on the distribution and obtained by (9).According to the theory of roulette wheel selection, a better individual should have a higher chance to be selected.Consequently, after survivor selection, the better individuals survived and the worse ones are eliminated.In this approach, elite individuals are retained so that they would not take part in crossover or variation operation, but directly enter next generation.For poor individuals, they also do not take part in crossover operation but take part in variation operation, with the probability of variation being higher than normal individuals.Then, computing probability distribution corresponding to fitness function and extracting other individuals of current group to attach crossover and variation operation on them are expected, for improving the average fitness of the group.The selection probability function is defined as follows: where () is fitness value of individual   .
(5) Crossover Operation.Crossover operation is employed to reproduce new individuals (offspring) by swapping segments of genetic information of two individuals (parents).In AGA, an adaptive crossover strategy is used to improve the search performance of the algorithm.It is realized by assigning a crossover probability to each crossover operation.The adaptive crossover probability of individuals is defined by where  max and  are, respectively, the maximum and the average fitness of the current generation;   denotes the higher fitness between the two selected individuals;   1 and   2 are coefficients fixed at the initialization.When   is higher than , the lower crossover probability is adopted to reduce the chance of destruction of excellent individual.When   is lower than , the higher crossover probability is used to raise the chance of emergence of new and improved individuals.
The crossover operation used in the proposed algorithm is arithmetic crossover [24].Assume that (  ,   ) is parent crossover pair and carried out crossover operation to produce the th bit of offspring   and   according to following formula: where the number of crossover bits among chromosomes is a random integer in the range of [1, 𝑐𝑘].
(6) Mutation Operation.Similar to the gene mutation in genetics, mutation operation is used to change some genetic information of an individual.Each individual in the population has a possibility of mutation.A low probability may hinder the production of new individuals, which is not conductive to the worse individual evolving.However, a high possibility of mutation reduces the searching performance of the random searching algorithm and may go against retaining the better.Thus, an adaptive mutation probability is employed to ensure the searching ability.The adaptive mutation probability of individual   is defined by where   is the fitness of the individual   ;   3 and   4 are predefined parameters.When the fitness of an individual is higher than the average fitness of the current generation, the lower mutation probability is adopted.And when the fitness of the individual is lower than the average fitness, the higher mutation probability is used.In the proposed algorithm, the mutation operation is Gaussian mutation [25].Each individual in the population should undergo the mutation operation.
Assume that   is a mutational individual and it carried out mutation operation according to the following formula to produce the th bit of offspring   : where  is a random integer except  in the range of [1, 𝑚].
And bit number of chromosome mutation is a random integer in the range of [1, 𝑐𝑘].
After the operation of selection, crossover, and mutation, a new population is generated.The same operations will be repeated until the maximum number of evolutional generations is reached.Finally, the clustering centers close to real optimal clustering centers are obtained.And the obtained clustering centers are initial value for FCM algorithm that is used on the original  samples to obtain global optimal clustering centers for the subsequent process of semisupervised iteration.
In summary, with genetic algorithm integrated into FCM algorithm, it is able to obtain basic steps of the determination of initial clustering centers in AG-SL-FCM algorithm: (1) Give some genetic algorithm parameters, such as the clustering number , the population size , the crossover probability, the variation probability, and maximum generation  max .
(2) Set up the evolution algebra counter  = 0.The train set with sample size of  is divided into  subsets of which clustering centers can be computed using fuzzy -means, resulting in  population individuals ().
(3) Set up individual objective function and carry out fitness evaluation.
(5) Compute fitness of offspring and add it to the parent group.Then, remove individuals with low fitness.
(6) If the setting evolutional generation is achieved, it is supposed to output individuals with the highest fitness in current population group and end the algorithm when reaching maximum generation  max .Otherwise, return to step (4) to continue evolving.

The Process of FCM Iteration Based on Semisupervised
Learning.In this paper, influence of prior knowledge is used to improve basic unsupervised FCM algorithm, which belongs to the domain of semisupervised learning. ( ∈ [0, 100]) is the proportion of known samples accounting for all samples.Then, this part of prior knowledge is as guidance signal to accelerate the convergence speed of cyclic iteration of the algorithm, thus obtaining the part of modified method based on semisupervised learning [26].
The main idea of semisupervised method is the process of utilizing information of known samples for optimization.By means of adding effect of this part of prior knowledge, convergence process of the algorithm is accelerated.
At first, the distance from samples of prior knowledge to initial clustering centers is denoted as Then, membership of each sample of prior knowledge is demonstrated as Thus the new objective function can be established where   is represented using Boolean logic as follows: In (17),    is the membership relation among this part of prior knowledge computed.  is the membership value of  = 0 in the whole cycle.The impact factor of prior knowledge  is proportional to the ratio of amount of prior knowledge  to the total number of samples.  , which is defined as the Mahalanobis distance [27] from sample   to clustering center   , which can be expressed as where   is the association correlation matrix among characteristic vector of each sample.We introduce Lagrange operator and then calculate the minimal value of objective function  without constraints; the membership degree matrix can be established as follows: where  is the impact factor of prior knowledge, which is proportional to the ratio of amount of prior knowledge  to the total number of samples.Adjustment formula of clustering centers is defined as follows: where  is the number of clusters and  is the fuzzy exponent and equals 2. Therefore, the whole AG-SL-FCM algorithm is described in Algorithm 1.

Simulation Results and Analysis
In order to verify the effectiveness of the detection system proposed in this paper, we compare the AG-SL-FCM algorithm with traditional FCM algorithm [16] through experiments.The simulation was carried out on the computer with the Intel Pentium dual CPU (3.20 GHz), and the proposed algorithm is simulated in MATLAB environment.At first, FCM algorithm and AG-SL-FCM algorithm are adopted to operate on the experimental dataset, respectively.Three keywords are defined as metric parameters, in which word 1 is violent terrorist attacks, word 2 is campus cutting, and word 3 is explosion.The obtained clustering centers are shown in Table 1.Through prior knowledge we can verify that the sum of variance of known samples in AG-SL-FCM is less.Therefore, after optimization of genetic algorithm, the distribution of clustering centers in AG-SL-FCM algorithm is more ideal.
The convergence and the iterative process for the two algorithms are shown in Figure 2. It indicates that the convergence speed of the AG-SL-FCM algorithm is only little slower than of FCM algorithm.When determining initial clustering centers, iterative process of genetic algorithm increases the time consumption.However, the guidance of prior knowledge in the optimization process of objective function is accelerated.Therefore, the iterative speed does not obviously slow down.
From Figures 3-7, the different colors are adopted to represent different topic categories, in which red represents violent terrorist attacks, green represents campus chopping, and blue represents explosion.It shows that three keywords are adopted as characteristic parameters to judge events categories.With comprehensive analysis of these three parameters, 2000 sample points of public security events are divided into three categories so as to be judged visually and correctly.Figures 3 and 4 show the clustering results of FCM algorithm and AG-SL-FCM algorithm, respectively, in threedimensional space, in which three coordinate axes represent the weight of three indicators computed through equations in Section 2. Experimental results show that obtained clustering centers of AG-SL-FCM algorithm are more ideal than FCM algorithm, which means that AG-SL-FCM algorithm can obtain better clustering results than FCM algorithm.
In Figures 5, 6, and 7, as the effectiveness of AG-SL-FCM has been testified in the previous experiments, further research is conducted on AG-SL-FCM in two-dimensional space.In Figures 6 and 7, it can be concluded that due to the complexity and semantic relevance of Chinese language (in Chinese language, the scope included by violent terrorist overlaps with the other two parameters partly) the classification of samples differs when using different indicators and that mainly on the basis of parameters "violent terrorist" and "explosion" we can get a relatively accurate determination of events types, while under the influence of indicators "violent terrorist" and "campus cutting" the detection results remain dislocated.In Figure 7, we can also conclude that in the axis of "violent terrorist attacks" when there are a huge number of points gathering in the range of 0.3 to 0.6, it maybe implies that some events have caused no small impact on public opinion, which is latent for inducing some unstable fact especially in such a country, China, that malignant events caused by the nationalism occurred from time to time.

Conclusions
In this paper, an improved clustering method based on genetic algorithm and semisupervised learning (AG-SL-FCM) is proposed for a detection system of public security events.In the proposed algorithm, adaptive genetic algorithm is employed to select optimal initial clustering centers.Meanwhile, motivated by semisupervised learning, guiding effect of prior knowledge is used to accelerate iterative process.Result of simulation reveals that compared with FCM algorithm the improved algorithm can not only effectively realize fuzzy clustering of data but also have relatively good convergence speed.With the use of this algorithm, the proposed detection system can analyze discrimination of types from massive news report and blog message, improve efficiency of detecting spread of public opinion associated with public security events, and provide relative departments with a reliable basis for preventing and handling of public security events.

Figure 1 :
Figure 1: Flow chart of the system.

Figure 2 :
Figure 2: The convergence process of objective function in FCM algorithm and AG-SL-FCM algorithm.

Figure 3 :Figure 4 :
Figure 3: The distribution of sample sets in AG-SL-FCM algorithm under the influence of the three parameters.

Figure 5 :Figure 6 :Figure 7 :
Figure 5: The distribution of sample sets in AG-SL-FCM algorithm under the influence of "violent terrorist attacks" and "campus chopping."

Table 1 :
The clustering center of FCM and AG-SL-FCM.