Railway Fault Text Clustering Method Using an Improved Dirichlet Multinomial Mixture Model

Railway signal equipment fault data (RSEFD) are one of the issues with in-depth trac big data analysis throughout the life cycle of intelligent transportation. In the course of daily operation and maintenance, the railway electrical maintenance department records equipment malfunction information in a natural language. e data have the characteristics of strong professionalism, short text, unbalanced category, and low eciency of manual analysis and processing. How to eectively mine the information contained in these fault texts to provide help for on-site operation and maintenance plays an important role. erefore, we propose a railway fault text clustering method using an improved Dirichlet multinomial mixture model called ICH-GSDMM. In this method, rst, the railway signal terminology thesaurus is established to overcome the inaccurate problem of RSEFD segmentation. Second, the traditional Chi square statistics is improved to overcome the learning diculties caused by the imbalance of RSEFD. Finally, the Gibbs sampling algorithm for Dirichlet multinomial mixture model (GSDMM) is modied using an improved chi-square statistical method (ICH) to overcome the symmetry problem of the word Dirichlet prior parameters in the traditional GSDMM. Compared to the traditional GSDMM model and the GSDMM model based on chi-square statistics (CH-GSDMM), the quantitative experimental results show that the GSDMM model based on improved chi-square statistics (ICH-GSDMM internal)’s evaluation index of clustering performance has greatly improved, and its external evaluation indices are also the best, with the exception of external index NMI of data set DS2. Simultaneously, the diagnostic accuracy of a select few categories in RSEFD has considerably improved, demonstrating its ecacy.

1. Introduction e intelligent transportation system (ITS) is the development trend of transportation system in the future and it has received more and more attention. Depth analysis of tra c big data in the whole life cycle is becoming one of several scienti c and technical problems in China's intelligent transportation, and at present, it is in a primary stage where the data is not wide enough and the application is not deep enough, and it has the problem that data integration and intelligence needs to be further improved. How to fully dig the value of massive data covering the entire life cycle of the transportation eld has become the basis research and it has promoted the construction of a new generation of intelligent transportation systems [1]. Railway signal fault data (RSEFD) are one part of the massive data of the whole life cycle of the transportation eld and it has received more and more attention.
Railway signal equipment generally refers to track circuits, signals, turnouts, and other equipment related to train operation, and these equipment are the basis for ensuring the safe operation of trains. In the daily operation process, maintenance personnel record the fault phenomenon, the handling process of equipment failure and fault diagnosis results in a natural language, and store the fault data in paper or electronic les in text. With the increase of railway mileage and operation, a large number of RSEFD have been accumulated.
ese data are stored in unstructured and textual form for a long time, and it is not conducive to computer processing and understanding [2]. Equipment maintenance workers must frequently learn from the processing experience of a significant number of existing equipment fault data, as well as manual inquiry and analysis of this fault data, during normal maintenance of railway signal equipment. is results in low data processing efficiency and low intelligence of data information [3,4].We effectively reduce the search space, improve the discovery efficiency, and mine a large amount of valuable fault identification and diagnosis information contained in the fault text [5,6] as well as established the association between fault feature words and fault classes that will make fault identification effective and similar situations handling easy in the future [7]. Railway personnel manually classify the severity and domain reflected by the textual semantics of railway faults based on professional knowledge [8]. Due to the unstructured structure of railway text data and the irregularity and randomness of personnel records [3], it is currently a challenge to extract accurate fault information from unstructured natural language. e topic model is a traditional text clustering method, which can well mine the semantic information of text and is widely used. As the most popular topic model [9], LDA is often used for text clustering, and it is successfully applied to long text clustering and the effect is successful. Short texts tend to have fewer words and data sparsity. Due to the lack of repeated words in short texts, it is a challenging task for the traditional LDA topic model to screen relevant feature words. Meanwhile, in short texts, the context is very limited, and semantic-based feature word extraction is challenging. When traditional topic modeling techniques are applied to RSEFD, it is necessary to consider the characteristics of short texts, and feature extraction algorithms that copy long texts are often ineffective.
GSDMM [10] can automatically deduce the number of clusters and it has a good balance between the completeness and homogeneity of clustering results, as well as a fast convergence speed, which is more effective than LDA to extract hidden topics from short texts [11]. e GSDMM model assumes that the parameter of words Dirichlet prior distribution is symmetric, that is, the same Dirichlet prior distribution is given to all words, and all words are treated equally when the model is generated. In practice, different words should have different clustering effects on topics, GSDMM should consider the influence of global weighted metrics for each word [12], and the parameters of Dirichlet prior distribution of each word should be different.
To address the challenges posed by the symmetric assumption of the parameter of words Dirichlet prior distribution of the GSDMM model, chi-square statistics is introduced. Chi-square statistic tests the significance of the relationship between the value of a variable and that class [13]. e importance of different words to different classes can be well distinguished by chi square statistic (CHI). e larger the chi-square statistic value of a feature item in a specific class, the more representative the word is for that class. Chi-square statistics have greatly improved the sparseness of feature words in short text datasets. However, chi-square statistics also have shortcomings. e traditional chi-square statistical algorithm does not take the uniform distribution of feature words within the class into account and ignores some features that rarely appear in the specified category but can well represent this category [14][15][16]. e imbalance of fault data categories affects the performance of feature extraction algorithms and also brings serious difficulties to most clustering models and classifier learning algorithms that assume a relatively balanced data distribution [7].
To solve the above problems, in order to further improve the mining quality of the hidden information of the fault text and improve the clustering effect of the railway signal fault text, this paper proposes an ICH-GSDMM model for railway fault text clustering, and the main contributions are summarized as follows: (1) A professional word segmentation dictionary in the field of railway signal is constructed. e natural language of signal fault text is highly specialized and general text segmentation tools are not effective for some professional vocabulary segmentation. e establishment of this dictionary effectively improves the word segmentation accuracy of signal fault text and provides a good basic environment for feature words to better represent text semantics and improve text clustering effect.
(2) A feature word extraction method based on prior knowledge of improved chi-square statistics is proposed. is method filters out the feature words of each category based on the relationship between the feature words and the categories, which effectively alleviates the problem of loose topics in short texts and greatly improves the problem of inaccurate feature word extraction caused by imbalanced data categories. e remainder of this paper is organized as follows. Section 2 reviews the literature on topic models and chisquare statistics. Section 3 explains feature word extraction algorithm with improved chi-square statistics. Section 4 elaborates the GSDMM and the ICH-GSDMM model. Section 5 is the experimental data and analysis. Section 6 summarizes the paper and proposes future work.

Related Work
How to remove hidden fault information from fault text for clustering and equipment fault type identification is the main work carried out in the field of railway signal fault text earlier. For example, the authors of [4] used the TF-IDF algorithm for feature word extraction, and then integrated multiple classifiers based on voting to achieve fault text classification learning. e authors of [17] applied Word2vec 2 Mathematical Problems in Engineering to generate word vectors and the SMOTE algorithm to balance the amount of data, and finally used convolutional neural networks to automatically classify faulty texts. e authors of [18] put forward a method for fault text classification based on Word2vec and parallel convolutional neural networks. Based on the high-speed rail signal equipment fault text, the authors of [3,19] adopted the PLSA model and the labeled-LDA topic model for feature extraction and fault text clustering respectively, so as to realize fault diagnosis of on-board equipment in high-speed rail signaling systems. In the study of [7], to classify the problematic text, the authors presented the syntactic feature extraction approach of enhanced chi-square statistics and the semantic feature extraction method of LDA topic model based on prior knowledge. e above method usually represents the text as a vector by calculating the word frequency or semantic information of the feature words in the fault text and then calculates the similarity and realizes clustering or classification. Topic modeling approaches make it possible to cluster enormous amounts of unlabeled data efficiently. It is an unsupervised machine learning model that belongs to the soft clustering method and can effectively extract semantic information in the text to mine the topic of clustered text. Each text is supposed to be a mixture of topics in the LDA model [20], with each topic consisting of a set of connected words that usually communicate some semantic information [9,21]. Since the railway signal fault text belongs to the short text domain, there are few repeated words in the short text, and the data are sparse, which lead to the unsatisfactory estimation of the topic distribution of the text and the topic distribution of words by LDA. e GSDMM proposed by the authors of [10] is more suitable for short text clustering. Compared with other topic clustering methods, the short text topic vectors generated by GSDMM are of better quality, the clustering results have good integrity and homogeneity, and the convergence speed is fast, and it can also deal with the sparse and high-dimensional problems of short texts. e GSDMM model is the Dirichlet multinomial mixture (DMM) model based on the folded Gibbs sampling algorithm, which assumes that each document can only be represented by one topic. e authors of [22] adopted the GSDMM method for short text clustering in the field of web services, and the performance study showed that GSDMM is a more effective clustering method compared to other traditional topic modeling methods. e authors of [23] first used the GSDMM topic model to generate the corresponding topic vector of the text, and then applied the AGNES algorithm to analyze the clustering effect of the topic vector. e research results showed that the GSDMM topic model method has better clustering quality for the service text. e authors of [24] proposed a FGSDMM + algorithm, which uses multiple runs of the folded Gibbs sampling algorithm to complete online text clustering. Compared with the GSDMM and FGSDMM algorithms, the final clustering performance shows that the FGSDMM + algorithm has better data clustering performance. e authors of [25] put forward an adaptive Dirichlet multinomial mixture clustering model (e-GSDMM), which utilizes a hyperparameter tuning algorithm to automatically capture temporal dynamics to obtain the temporal variation of topics and word distributions for short texts, the clustering results show that e-GSDMM outperforms existing GSDMM methods on short text streaming data. In summary, at present, there are few improvement studies on the assumption that the word Dirichlet prior distribution is symmetrical in the GSDMM model. e larger the chi-square statistic value of a feature item in a specific class, the more representative the word is for that class. Chi-square statistics are often used for feature selection [26,27]. Because basic chi-square statistics are insufficient, several researchers have improved them. e authors of [15] proposed a modified chi-square statistics for feature selection approach and confirmed its efficacy based on the word frequency of feature items and their distribution features between and among classes. Aiming at the problem of missing attributes in some classes in chi-square statistics, the authors of [28] balanced the screening of the number of feature words in each class by improving the chi-square statistical algorithm and combines the SVM classifier to modify the performance of the Arabic text classification model. e above research on chi-square statistics in text classification models also illustrates the effectiveness of chisquare statistics in the field of text classification. For above considerations, in this paper, a research on railway signal fault text clustering based on ICH-GSDMM is carried out.

Feature Extraction Based on Improved Chi-
Square Statistics e purpose of chi-square statistics reference is to effectively extract the fault feature words of each category and reduce the impact of fault category imbalance on text clustering.

Chi-Square Statistics.
Chi-square statistics (CH) is used to measure the degree of correlation between words and classes, and it is assumed that words w i and c i classes conform to a χ 2 distribution with a first degree of freedom. e higher the χ 2 statistic value of the entry w i for a certain category c i , the greater the correlation between it and the category, and the smaller the independence. e chi-square statistic is defined as [7] where N is the number of documents in the corpus, w i indicates that the word w i is not included, c i indicates other categories except class c i in the corpus, f(·, ·) shows the relevance between the word w i and class c i ,f(w i ) indicates the number of texts in the corpus that contain the word w i ,f(w i ) indicates the number of texts in the corpus that does not contain the word w i ,f(c i ) indicates the number of texts in the corpus that belong to class c i , and f(c i ) indicates the number of texts in the corpus that do not belong to class c i .

Improved Chi-Square Statistics.
We refer to the class with a small number of texts as the minority class, and the class with more texts as the majority class, for clarity. For traditional chi-square statistics, only the frequency of documents containing feature words is considered, and the frequency of each feature word in these documents is not considered, which has disadvantages for corpora with uneven data distribution. e notion of frequency is presented to overcome the problem of unreliable feature word extraction due to the tiny amount of text contained in the minority class. e ideas of interclass concentration and intraclass dispersion are developed to overcome the problem that standard chi-square statistics increases the weight of feature words that appear less frequently in this class but commonly exist in other classes [16].
To facilitate understanding, we define K as the number of categories of a corpus, and a category t of the feature word t appearing in the category C i is defined as the intraclass dispersion, df ij t is the frequency of the feature word t appearing in the text d ij , and cf i t is the frequency of the feature word t appearing in the category C i , which is calculated as follows formula: where cf t is the mean value of cf t i under all categories and the calculation is as follows: where tf i t is the interclass concentration of the feature word t in the category C i , and the calculation is as follows: e calculation of improved chi-square statistics (ICH) is as follows:

Feature Word Extraction.
is paper first selects a fixed number of words as important feature words representing category according to the ICH.
is filtering method effectively improves the feature words extraction quality of minority class and reduces the clustering problem due to class imbalance in the corpus. We define the improved chisquare statistic value of feature words as the ICH value, and the traditional chi-square statistic value of the feature words as the CHI value. e feature word extraction method based on ICH feature selection is as in Algorithm 1. e RSEFD set S, the fault term dictionary Ω, the fault category set C, and the threshold c is the number of important words in each category.
Algorithm 1 first initializes five empty sets, FS is the corpus set, which is used for the word set after data preprocessing, FI is the ICH value set, FI′ is the normalized FI set, Fw_c is the priori ICH value set, and FS′ is the important feature word set. (Line1-2). According to the fault term dictionary Ω, the corpus set FS is obtained after preprocessing the RSEFD set S, such as word segmentation and remove stop words (line3-4). en calculate ICH values for all words and each category in the corpus set FS according to formula (5), and store them in the ICH value set FI (line6-9). In order to facilitate the comparison of the relationship between different fault feature words and different categories, the ICH value of each word in the set FI is normalized according to the following formula (line 10): where K is the number of categories in the RSEFD set S, w i is a feature word in the corpus set FS, and c j (1≤j≤K) is a category in the maintenance data set S. Next, FI′ is filtered according to the threshold c to obtain the priori ICH value set Fw_c (line11-13). Finally, the important feature word set FS′ is obtained according to the priori ICH value set Fw_c (line14-15).

Clustering Algorithms
is section first introduces the traditional GSDMM model and its implementation algorithm and then explains the ICH-GSDMM model proposed in the text.

GSDMM Model.
GSDMM is a DMM model with the folded Gibbs sampling algorithm, and it is a probabilistic generative unsupervised model. Under the assumption of oneto-one correspondence between topics and documents, GSDMM adopts an iterative Gibbs sampling algorithm to approximate the model, and finally generates the topic distribution of documents. Figure 1 shows a graphical representation of the simulated process of DMM generating documents.
In the DMM model, α is the topic Dirichlet prior distribution parameter, β is the word Dirichlet prior distribution parameter, θ is the topic distribution matrix of the document, φ is the topic distribution matrix of the word, θ and φ satisfies θ|α ∼ Dir(α), where θ k,d is the probability distribution of document d on topic k, and all topic distributions of the same document d satisfies where φ k,w is the probability distribution of word w on topic k, and the topic distribution of all words w in the same document satisfies e topic distribution of each document d obeys the following: e process of document generation by DMM model can be described as follows: it rst selects a mixed cluster k from formula (8). en, it uses di erent algorithms to solve the model and nally get the probability that a topic k generates a document d as follows: GSDMM is an approximate solution algorithm model of the folded Gibbs sampling of DMM model. e approximate model of Gibbs sampling algorithm obtains θ and φ by continuously sampling di erent topics of a word according to formula (12), and nally we deduce the topic of each document.

ICH-GSDMM Model.
In this section, we explain the ICH-GSDMM model suggested in this paper, which introduces frequency, intraclass concentration, and interclass dispersion in the traditional chi-square statistics. First, the important feature words W imp of each classi cation are screened out according to the threshold c, and then, the ICH value of the important feature words of each category is mapped to [λ 1 , λ 2 ], and used as the Dirichlet prior distribution of these important words, namely, β 1 ′, and the Dirichlet prior distribution β 2 ′ of the remaining feature words are all as λ 1 .
In the ICH-GSDMM model, the probability of document d selecting cluster k is as follows:

Input:
Maintenance data set S, fault term dictionary Ω fault class set C, reshold c Output: feature word set FS′, Priori chi square set Fw_c begin (1) Initialize the parameters for si ∈ S do (4) FS FS ∪ Word Set by word segmentation in si according to Ω end (6) for w i ∈ FS do (7) for c j ∈ C do (8) FI i compute the χ 2 (w i , c j ) by formula (5) end (9) FI FI ∪ FI i end (10) FI′ Normalization of FI by formula (6)  (11) for c j ∈ C do (12) for w i ∈ FI′ do (13) Fw_c Rank (χ′ 2 (w i , c j ), c j , c) end end (14) for w i ∈ Fw_c do (15) FS′ FS′ ∪ w i end end ALGORITHM 1: ICH feature selection.

Mathematical Problems in Engineering
where χ′max is the maximum value in the Fw_c set and χ′min is the minimum value in the Fw_c set. Table 1 displays the symbols in the ICH-GSDMM model, and Algorithm 2 describes the main steps of the ICH-GSDMM model.
First, n z di , m z di , and n wi z di are initialized by line 1. en, the important feature word set FS′ and the priori chi square set Fw_c is obtained by calling algorithm 1. Next, the correction parameter β′ of the Dirichlet prior distribution in the GSDMM model is obtained according to formula (13) by line3.
e topic of each document in the corpus is then initialized (Lines 4-8). (line9-18) is the iterative calculation process of GSDMM based on the folded Gibbs sampling algorithm according to formula (13). Finally, the documenttopic distribution matrix of the corpus is obtained according to the ICH-GSDMM model.

Evaluation Metrics.
e evaluation indicators used to evaluate the performance of clustering algorithms can generally be divided into two categories: internal and external evaluations. Internal evaluation does not require ground-truth labels, and it evaluates the clustering effect by using some similarity measurement techniques to measure intraclass and interclass relationships. External evaluation requires ground truth labels, whether the clustering is reasonable is evaluated by analyzing the relationship between the clustering labels and the ground truth labels.

Internal Evaluation
(1) Silhouette Coefficient. e silhouette coefficient (SC) is used to measure the separation distance between clusters. e formula for a single cluster SC is as follows: where a i is the average distance of element i from other elements in the same category and b i is the average distance of elements that are closest to element i and belong to other different categories, N is the total number of elements in a cluster k. e mean value of SC k for each cluster k is the final silhouette coefficient score for all clusters with the following formula: e value of SC represents the quality of clustering performance, the higher the value, the better the clustering performance.
(2) Davidson Boding Coefficient (DBI). DBI calculates the distance between clusters and within clusters, and it is defined is as follows: where N is the number of categories of clusters, x i and x j are the ith and jth cluster centers, respectively, and σ i and σ j are the average distances from all points in the ith and jth clusters to the center point, respectively. DBI values reflect how similar texts are within the same and different clusters. e lower the DBI value, the better the clustering algorithm.

External Evaluation.
e external evaluation indices NMI, AMI, and ARI all require ground truth labels and cluster labels.
(1) Normalized Mutual Information. Normalized mutual information (NMI) is defined as follows: where X � {x 1 , x 2 , . . ., x N } is the cluster division after clustering and Y � {y 1 , y 2 ,. . ., y N } is the real category division. H(X) and H(Y) denote the entropy of X and Y, respectively, MI(X, Y) represents the mutual information calculation formula between X and Y.
(2) Adjust Mutual Information. Adjust mutual information (AMI) calculation formula is as follows:  6 Mathematical Problems in Engineering where E(.) is the expectation of MI(X, Y).

(3) Homogeneity (H)
n c,k n log n c,k n k , where H(C) is the category division entropy, H(C|K) is the conditional entropy of category division under the given clustering condition, n is the total number of texts in the corpus, n c is the number of texts in category c, and n k is the number of texts under cluster k. n c , k represents the number of texts in class c which is divided into cluster k.
Homogeneity expresses the goal that each cluster contains elements of only one true group. A cluster is perfectly homogeneous if all elements in a cluster have the same external label.

(4) Completeness (C)
e variable definitions of completeness are similar to homogeneity, and the definition of completeness is the conditional entropy of the cluster distribution given the external class labels. Completeness expresses the goal that all members with the same ground truth labels are assigned to one cluster.

Classification Correct Rate.
To compare classification accuracy, we introduce the classification correct rate (CCR) [31]. e formula for CCR is as follows: where n represents the total number of texts in the cluster and y′ d and y d represent the predicted class label of document d and the highest-ranked label among the predicted class labels, respectively. δ(.) is an indicator variable, when classifying a multilabel data set, we define δ(y d , y′ d ) � 1 if y′ d is in y d , and 0 otherwise. e larger the CCR value, the better the clustering performance. e introduction of classification accuracy can provide a good assessment of the performance of clustering models.

Experimental Data Set.
e experimental data set DS1 selected in this paper is a Chinese data set, which is a RSEFD set collected by a railway company in China from 2016 to 2020, with a total of 1527 samples. In order to better test the clustering performance of the ICH-GSDMM model put forward in this paper, the English data set DS2 is also introduced. e data set DS2 is provided by https://github. com/pokarats/gsdmm, with a total of 20000 records. Table 2 shows examples of data set DS1 and DS2. Table 3 describes each fault category of data set DS1 and its proportion in the whole data set. It can be seen from Table 1 that the RSEFD set DS1 is a typical imbalanced data set. Track circuit fault (i.e., C2) and Switch fault (i.e., C5) are the majority, LKJ fault (i.e., C3) and Cab signal fault (i.e., C6) are the minority class. e classification accuracy of any fault category plays a key role in ensuring the safety and efficiency of the railway system. e data set DS2 contains 20 categories and each category contains 1000 samples.

Experimental Setup and Results.
e experimental machine is configured with i7-10510u, 16.0GBRAM and win10.
Operating system and the program is written in Jupyter Notebook.
is section is described in two sections. e parameter settings for each topic modeling are described in the first section. In the second section, the clustering performances of GSDMM, CH-GSDMM, and ICH-GSDMM are evaluated and analyzed, respectively.

Parameter Setting.
e β value of different prior Dirichlet distributions affects the performance of GSDMM. According to the literature [10], when the β value is [0.08, 0.1], the GSDMM model has high homogeneity and integrity, so this paper selects β � 0.08.

Analysis and Discussion
(1) Internal Evaluation. Tables 4 and 5 indicate the SC, CH, and DBI results for the three topic models for 20, 40, and 60 number of clusters respectively. are the same as the NMI and AMI score of the GSDMM model, and the scores of the rest external evaluation index H and C in the ICH-GSDMM model are the best among the three models. In the data set DS2, overall external evaluation result of ICH-GSDMM is better than CH-GSDMM and GSDMM models.
(3) CCR Analysis. CCR value is the average of the CCR values of each category in data set DS1 and DS2. Table 10 shows the results of the CCR scores in the data sets DS1 and DS2. It can be seen that the CCR score of the ICH-GSDMM model is the highest at 0.614, followed by CH-GSDMM and GSDMM. e results of the CCR scores indicators for each class in the datasets DS1 and DS2 are shown in Figure 2.
In Figure 2(a), the data set DS1 contains 7 ground-truth labels, C0∼C6. Because C1 (ATP fault) has little correlation with other classes, its CCR value reaches 1.0, which is better than CH-GSDMM and GSDMM models. Except for the C2 class, the CCR scores of other classes of the data set DS1 in the ICH-GSDMM model are better than those of the GSDMM and CH-GSDMM models. e reason for the lower CCR score of class C2 may be that C2 (Track circuit fault) is a basic ground equipment system for railway signals, which belongs to the majority classes in the data set DS1, and it has a greater correlation with class C2, C3, C4, and C5. Compared with the GSDMM model, the CCR scores of the end end end (19) Return the result of topic distribution of each document Z ⇀ end ALGORITHM 2: e ICH-GSDMM algorithm.  From the analysis of CCR performance of each class in Figure 2, it can be seen that the overall performance of the ICH-GSDMM model among the three models is still the best.

(4) E ect Analysis of the Number of Clusters.
To research the e ect of the number of iterations on the number of clusters discovered by the ICH-GSDMM, CH-GSDMM, and GSDMM models, we set the initial cluster number parameter K of data set DS1 to 20, and the initial cluster number parameter K of data set DS2 to 40. Figure 3 displays the number of clusters discovered by ICH-GSDMM, CH-GSDMM, and GSDMM models at di erent iterations. Figure 3(a) displays that the number of clusters discovered by the ICH-GSDMM, CH-GSDMM, and GSDMM models decreases rapidly and remains stable after approximately 9, 15, and 7 iterations, respectively. e closest order to the actual number of clusters is the ICH-GSDMM, CH-GSDMM, and GSDMM models. Figure 3(b) shows that the number of clusters discovered by the ICH-GSDMM and CH-GSDMM models drops rapidly after about 6 iterations, while the GSDMM model drops rapidly after about 17 iterations, and the number of clusters nally discovered by the GSDMM model has the largest di erence from the actual number of categories in the data set DS2. Both the ICH-GSDMM and CH-GSDMM models discover the number of clusters faster, and the ICH-GSDMM model found the number of clusters closest to the actual number of clusters after about 28 iterations. e number of documents in data set DS2 is 92.34% larger than that in data set DS1, which may be the reason why the number of clusters discovered by the ICH-GSDMM model in Figure 3(b) did not remain stable for a long time.

Conclusion
Compared with traditional topic modeling techniques, the GSDMM model is more suitable for short text clustering. However, in the GSDMM model, the Dirichlet prior distribution of words is supposed to be symmetric, i.e., all words are given the same prior distribution. When the model is constructed, all words are treated equally, which is obviously not realistic. To solve this problem, we proposed the ICH-GSDMM model. e improved chi-square statistics (ICH) method is the introduction of frequency, intraclass concentration, and interclass dispersion in the traditional chi-square statistical (CH) method. e ICH-GSDMM model is based on the ICH method to generate the Dirichlet prior distribution of important words of each category in the corpus to modify the traditional GSDMM model. Finally, we evaluate the internal and external clustering performance of traditional GSDMM, CH-GSDMM models, and the proposed ICH-GSDMM model in this paper. e results indicate that the internal evaluation index of the ICH-GSDMM model has improved greatly. e external evaluation index has improved except for NMI in the data set DS1. For the imbalanced data set DS1, the classification accuracy rate of minority classes is significantly improved, which also verifies the effectiveness of the model. Future work will additional optimize the calculation method of the Dirichlet prior distribution of words in the GSDMM model and evaluate the impact of the number of important words in each category on the clustering effect to improve the ICH-GSDMM model and improve its external evaluation performance.

Data Availability
All data, models, and code generated or used during the study appear in the submitted article.

Conflicts of Interest
e authors declare that they have no conflicts of interest.