A Case Retrieval Strategy for Traffic Congestion Based on Cluster Analysis

In order to improve the retrieval eciency, this paper uses case-based reasoning (CBR) in the retrieval of trac congestion cases and tries to adopt the strategy of clustering case databases before retrieval so as to narrow the scope of case retrieval. In terms of case clustering, the k-means algorithm, with excellent performance in text clustering, is selected to cluster trac congestion edge cases. At the same time, considering that there is a certain similarity among the descriptions of trac congestion, the K-means algorithm is optimized to generate an accurate clustering. ­ose edge cases are clustered into microcase clusters of trac congestion and then divided into dierent trac congestion categories according to the distance of cluster center. Experimental results show that the clustered case base is divided into several microcase bases, which improves the accuracy and shortens the retrieval time in the process of retrieval and provides a new idea for the retrieval method in the process of case-based reasoning.


Introduction
A segmentation method for retrieval is proposed in Reference [1]. First, similar case groups of di erent levels are formed according to the importance of events, and then the degree of similarity is calculated according to the new event levels and related case groups. e method of clustering associated cases is used to improve the success of case retrieval to a certain extent in Reference [2]. An intracase crossover algorithm is proposed to improve the processing e ect of parallel data and the e ciency of case retrieval in Reference [3]. A cleaning algorithm for regression ltering is put forward in Reference [4], which shortens the time for case retrieval. In Reference [5], the optimization method of the GRNN neural network is used to improve the e ciency of CBR retrieval, realize the self-learning and self-growth of eld problem diagnosis, and e ectively avoid the problems of low matching degree and slow convergence speed of traditional CBR algorithms. A sememe-based set similarity matching algorithm (CMSBS) is proposed in Reference [6], which is used to analyze cases with high similarity to the current case. Experiments show that the algorithm has better performance in terms of matching cases and matching accuracy. In Reference [7], the case similarity calculation methods of 5 di erent attributes are analyzed, and a mode of combining subjective weights and objective weights is put forward. A combination of local and global similarity calculation methods for di erent types of tra c congestion is adopted in Reference [8]. At the same time, the updating and preserving mode of the tra c congestion case database is proposed. In Reference [9], a tra c emergency decisionmaking method is designed. At the same time, the case database for tra c-aided decision-making is established, the calculation method of similarity in global-local features is designed, and a case retrieval strategy is given. e weighted information degree to model the tra c route horizontally is used, and a new method for sampling the weighted competition value for a single demand level is proposed in Reference [10]. In Reference [11], a microsimulation to characterize the ow interaction is created by using the toolchain sumo-jade, ensuring that the emergency vehicles arrive as quickly as possible. In Reference [12], a hierarchical structure for representing historical cases is developed. Reference [13] evaluates the strategy of optimizing the performance of the road network by combining real-time traffic information with predicted traffic information and adopting a heuristic dynamic traffic assignment (DTA) model combined with case-based reasoning technology for instance detection. Reference [14]uses case-based reasoning to calculate the shortest path of traffic and get the optimal solution.
Case-based reasoning, used in traffic safety, has also been widely studied but mainly focused on rail transit or large road networks. erefore, the combination model of rulebased reasoning and case-based reasoning is mostly used in those research studies for regulation of data analysis. Most of them related to urban road congestion are about congestion prediction, and there are relatively few research studies on the timely dredging of congestion and even fewer on the decision support system of urban road traffic congestion dredging by using case-based reasoning. Especially for the application of case retrieval, the methods are often complicated. In this paper, a retrieval strategy based on text clustering is used to improve the retrieval link of case-based reasoning. Experiments also show that this method has a certain superiority and feasibility.

Enumeration Property Calculation.
Enumeration data are unstructured data and mainly perform Boolean calculations. e value can be 0 or 1, where 1 means being the same and 0 means being different. Let the k attribute of C i and C j be an enumeration attribute, so

Numerical Attribute
Calculation. e distance between two different cases in the traffic congestion database is reflected by the difference of the same numerical attribute in the two cases. e similarity calculation is as follows: In the formula, max k and min k represent k's maximum value and minimum value, respectively, in the case.

Attribute Calculation of Numerical Interval Type.
Numerical interval data could be considered as the fuzzy interval. Suppose G ik as the K attribute of C i , which is a numerical interval type, then G ik is represented as the number of fuzzy intervals [G − ik , G + ik ], where G − ik and G + ik are the lower and upper limits of the intervals, respectively. Similarly, if the number of fuzzy intervals of the K attribute , then the similarity calculation of the K attribute of C i and C j is as follows: In this formula, D(C ih − C jk ) expresses the K attribute of C i and C j , the average Euclidean distance:, e similarity calculation has been divided into two steps: when the attributes of different data types are completed, the calculation of similarity among cases is considered the following step. Firstly, the improved algorithm of cluster analysis is used to cluster more than 660 cases in the database. Clustering was carried out according to the traffic congestion cause index of the attribute value.

Selection of Case Library Samples.
In a CBR (case-based reasoning) system, the case library, as an important component of the system, is represented in the form of a set. On the assumption that case library C � (c1, c2, c3,. . ., cn) is a nonempty finite set, which is composed of n cases and ∃c i (1 ≤ i ≤ n) represents 1 case of the case set. e case library can be classified into m grid units, and ∀c is regarded as 1 grid unit; each grid unit is of the same size, and there is no critical case between grids. But, when starting to classify these grid units, the critical cases within every case are not taken into consideration, and the cases are only classified generally. us, inaccuracy is caused by clustering afterwards. Taking this point into account, this paper adopts a new K-means algorithm to cluster, namely, introduce Mincluster into the critical case and reclassify and cluster for feature value of the critical case, that is, classify the critical case into m grid clustering after the 2nd time clustering, in order to make the target case better find cases that are of more similarity, and perform case treatment to obtain a case optimal solution.
During case retrieval, set the target case, and select the case that is most similar by retrieving the matching degree with elements in Set-C, thus ascertaining the answering case. Meanwhile, store the target case in the case library. e more similarity between ci in case library C and the target case, the better ci answers. Hence, users need to try their best to find among the source cases the most similar case to the target one. A user can calculate the weight of a case in the case library according to the user's feedback on cases and ascertain the best solution based on weight. For selection of the case clustering initial value, on the premise of grid division, put cases of higher weight into the same grid cell pi 2 Mathematical Problems in Engineering (1 ≤ i ≤ m), and perform 2 nd time refining and clustering through the improved K-means algorithm, thus obtaining cases of higher weight, and classify pi to obtain pi', then store it in the cases of higher weight after 2 nd time clustering.
Definition 1. Some of the source cases in the case library are of higher similarity to the target case, which is > the specified threshold value sim, which means they are the very source cases similar to the target case. Adopt a quadruple to represent the cases: A � (case, area case, tackle case, sim case).
Here, the case represents any one of the cases in Case-C; the area case is the set of all elements in the cluster which takes case as a sample case. Hence, the case set which is similar to the target case can be regarded as the area, where the distance between the target case ≤ sim case; tackle case is the set of answering cases, and the element in a tackle case is represented through two-tuples, T � (t case, count), among which t case is the answering case, while count is the frequency of t case being answered; sim is the case set which is in conformity with the definition. Since the similar cases whose output needs to meet the definition, the similarity between the target case and the answering case is ensured, and the weight of case ci is represented as follows: Among which, count (case) is the sum-up of the count for all the elements in the tackle case; count (global case) is the sum-up of the frequency in all the answering cases in Set-C. Finally, calculate out the weight value q of ci. e purpose of case clustering is to divide cases into several grids and store cases of similarity in each grid. When a target case is mapped to a certain area, and its similarity is found to be relatively higher, it is the very case cluster for target case solution generating.
First, calculate the similarity between all elements in case library C, generally using the Euclidean distance formula as follows: Represent the similarity of all elements in C through the following similarity matrix: s � sim 11 sim 12 · · · sim 1n sim 21 sim 22 · · · sim 2n Among which, 0 ≤ simij ≤ 1, when i � j, simij � 1, and when i ≠j, simij < 1. In the matrix, the i th row or i th column is all the similarities between ci and other cases.
If perform retrieval and matching use the target case for each case in case library C, more time will be cost, hence, it needs to be performed that, clustering of similarity for the cases in case library, and classifying cases of more similarity into one grid, with the following rules to be followed.
Rule 1: Combine the two cases, if the similarity between cases is greater than the specified threshold value.
If the two cases sci and scj exceed the specified threshold value sim, the two cases are regarded as the same, and combine sci and scj to be one case.
Rule 2: ere is no need to store if the case density in grid unit is greater than the specified threshold value.
Set the density threshold value as P. P is the maximum quantity of the stored cases in the area, namely, the density of the cases in the grid is controlled by P; if the density of the cases in the grid is saturated, newly added cases will not be stored, thus ensuring the case quantity and misrepresentations inside the grid.
Rule 3: If the quantity ratio of the noise case in the case library exceeds S, start clustering; if there is no intersection between cases, stop clustering. e effect of the clustering inside the grid unit is shown in Figure 1.

Source Case 2nd Time Clustering
e traditional Kmeans algorithm can be described as follows: randomly select K elements from the set to be clustered as the initial sample according to the given clustering quantity K, through continuous iteration adjust centroid, thus completing clustering. But, because there are some common features between each case in the case library, namely, the limit between cases is relatively fuzzy, hence, the result of retrieval is strongly dependent on the target case. erefore, the effect of case clustering in a case library should be inclusive of the elements, which are relevant to the target case as much as possible, so as to improve the success rate of retrieval results. e K-means algorithm transfers data between each station and occupies abundant network resources, so the limit between data cluster is not clear. Meanwhile, there will still be internal data breach during data transferring. us, based on K-means, introducing Min-cluster can not only tremendously improve the efficiency of data clustering but also reduce the possibility of data breach. e improved algorithm interprets the system framework of the K-means algorithm from another perspective, regards the main station as the central point, and the center points of the k clusters, which have already been classified as margin point, and such system framework is regarded as the center-margin structure. In this system framework, each marginalized node only deals with the partial data near this node and analyzes the data, which has already been treated, and then directly submits the analysis result to the center point, which performs the 2 nd time treatment and analysis at the center point, finally, obtaining the result of data clustering.
e system framework is shown in Figure 2. Because there is no data interaction between each marginalized node in this system framework, each marginalized node only communicates with the center point.
ere is no meta data transferring in the whole system. us, the tremendous loss of meta data during transferring is reduced. Meanwhile, the breach of the meta data during transferring is prevented. Hence, data clustering e ciency is greatly improved. e case library has already been divided into several grids by the grid clustering at the 1st time in the case library, and cases of more similarity are stored in each grid. But there are still some marginal cases between each grid, which cannot be classi ed into the corresponding case library. en, perform 2nd time clustering for the marginal case in grids using the improved K-means algorithm, and again reduce noise of case in grid.

e 2nd Time Clustering Algorithm.
In a distributed clustering environment, considering the di erence between each node, generally there is a time di erence in original data clustering, adopt the K-means algorithm to perform data clustering generally. But, the smaller the quantity of the nodes selected in the K-means algorithm is, the more unstable the result of clustering will be, and the accumulative e ect of this clustering instability exists at each marginalized node, nally, it will lead to inaccuracy of data, which is transferred to the center node. en, to avoid this situation, introduce Min-cluster at the marginalized node and cluster the original data.

Theorem 1. Min-cluster created from clustering is the subset of the source case.
Use reduction to absurdity, and assume that there are n cases: C1, C2, . . ., Cn in 1 grid, this classi cation is adjacent to the C′ case. All the original data points in C1, C2,. . ., Cn are relatively far away from the C′ case. While, the process of clustering of original data in C1, C2,. . ., Cn meets the de nition of Min-cluster; hence, these Min-clusters are relatively far away from the centroid of the C′ case, thus, the 2nd time clustering process of Min-cluster will not be taken into the C′ case. Under the same principle, the other cases can be proved, the theorem is proved.
Take the marginal case Sn as an example. Assume that the original data set in this case is N.
(1) Step 1: Select k random data as the initial center point among N data, and based on the established center point, naturally form k Min-clusters; each Mincluster is 1d + 3 dimensional vector with the form as (CF1 x , n, class_id). (2) Step 2: Calculate the distance of all data points to k center points, select the cluster and add it, which is the shortest distance, thus forming Min-cluster.

Mathematical Problems in Engineering
e Min-clusters which are formed at marginal cases, will finally be transferred to the case center nodes, which have already been formed for fusion.
ere are different weight values between Min-clusters, namely, the more nodes included in Min-cluster cases, the higher the weight value, and the bigger the possibility to be a classification center. Furthermore, there may be superposition in Min-cluster cases; hence, each Min-cluster case is not equal, and calculation cannot be performed using a general cluster algorithm. Hence, considering that the data clustering for case center point adopt the K-means clustering algorithm based on weight value, take the centroid of Min-cluster as the data object of center node, and distribute different weights according to the n value of each microcluster. e algorithm steps are as follows: Input: the Min-cluster set {C1, C2, . . ., Cm} from m cases, among which each Min-cluster set Cj includes k' Minclusters {cj1, cj2,. . ., cjm} after the clustering of this node.
Output: the result of clustering of the whole case set. (1) Step 1: make treatment for m * k′ Min-cluster, select the Min-clusters, which are equal at center, and adjust the value of n; (2) Step 2: select k clusters, which are of high weight and of relatively large distance between each other, as the initial clustering center; the distribution of data clustering is not even; if centroids A, B of nodes C1, C2 do not superpose but are very near, then the weight of C1, C2 is equivalent. If the selected initial centroid is according to the method of weight sequencing, it will cause a cluster with A and B as centroids, and the result of clustering will not be accurate. Hence, this paper sets threshold value and ensures proper centerfold on the premise of performing weight sequencing. (3) First, for m * k ′ Min-clusters, perform weight sequencing; second, calculate the average value of the distance between any of the 2 Min-clusters.
(4) ird, perform sequencing according to weight and take the Min-cluster, which ranks 1 as the 1st initial centroid; then, compare the calculated new Min-cluster with the set threshold value. e selected Min-cluster can only be regarded as the initial centroid when their distance is > threshold value (5) Finally, select k clusters, which are of high weight value and with large distance between each other, as the initial clustering center. (6) Step 3: distribute Min-cluster to the newest cases according to distance, and update the quantity of the original data in case center and case center.
(7) Because Min-cluster itself is a small-scale data set and is different from the data source, which was included in the case previously, gather the Mincluster as "original data," and calculate its geometric mean to ascertain the center of the classification, instead of only calculating the average value of data points. According to CF1 x * n, which is Min-cluster center point multiplied by the data quantity included in Min-cluster, record the data quantity, obtain the result, and average it to be the center after case updating. ere is (8) Step 4: If the final case center does not change, proceed to Step 5; otherwise, return back to Step 3. (9) Step 5: output clustering results.
After the clustering inside the grid for the 2nd time, the cases in the grid are made accurate further. Compare the similarity between cases by clustering marginal cases, reclassify the cases inside the grid, ascertain the center of cases again, enrich the conditions for case retrieval, and compare the target case better as shown in Figure 3.
From Figure 2, it can be detained that marginal cases after 2nd time clustering reduced tremendously, and source cases which are more complete and independent to each other are formed inside the grid generally.

Case Retrieval Strategy
e success rate of case solving of CBR intelligent system depends on the quantity and similarity matching of cases in the case library to a large extent. Based on the clustering for 2 times in the case library (as shown above), map the target case to one of the grid units, then, to the utmost, retrieve the matched case target in the grid, based on the result of the matching of cases in sample set S. e case retrieval process is shown in Figure 3.
e case retrieval process is as follows: (1) e newly created target case matches with the elements in the sample case set S sequentially, and calculates the similarity between cases sim 1 , sim 2 , . . . , sim n .
(2) Compare sim 1 , sim 2 , . . . , sim n with the similar threshold value, extract the set S′, which is up to the sample case. If S′ is empty, the target case is stored and is marked as noise case; if S′ is not empty, extract the sample case s', which is the most similar among sim 1 , sim 2 , . . . , sim n . (3) Store s' to temp list and extract elements of s' to match with the target case, acquire the most similar solution set simcase', and sequence according to degree of similarity, and then output. (4) Judge and ascertain whether or not the recommendation is successful according to users' feedback information. If successful, judge the selected cases in the Mathematical Problems in Engineering target case and temp list and ascertain whether or not storage conditions are met. If Rule-1 and Rule-2 are met, store the target case; if failed, store the target case on the premise that Rule-1 is met; otherwise, do not store. (5) If there is no case stored in grid cell after 2nd time clustering, nish retrieval; otherwise, judge whether or not Rule-3 is met, nish if met; otherwise, again cluster, return to 3.

Experimental Analysis
e weight calculation of tra c congestion feature attributes can be applied to the retrieval idea of web search engines. e traditional method has been abandoned. is study tried to take text classi cation as an example, mainly taking the spatial vector model (SVM) as the representation of text.
Firstly, the text is divided into morphemes (word segmentation), and then the selection of eigenvalues and the calculation of the weight of eigenvalues are carried out. Finally, a set of multidimensional tra c congestion feature attribute vectors could be formed.
Second, a table of the attributes of tra c congestion cases is established to integrate the attributes of various tra c congestion cases and is divided into di erent options. Table 1 is formed by analyzing the tra c congestion text data, which was collected by the research team members from an economic development zone of a city.
All the cases go through data prepossessing from the database, then the indicators are integrated and decomposed. e table of characteristic statistics of tra c congestion cases has been established (shown in . In this table, attributes are presented as multiple contents, which are diversi ed (for example, plane intersections show di erent shapes of intersections) or visibility on hazy days, as shown in Table 5.
ere are 70 feature items, which were decomposed from the cases. e computer used in the experiment is con gured with a 3.5 GHz Pentium IV CPU, 4 G memory, 250 G and 7200 to IDE hard disk.
According to the attribute content of tra c congestion, these contents can be divided into seven categories, which can be represented as F (S a , S b , S c , S d , S e , S f , S g , S h ), where each element represents the following attributes, and the data type of each category is shown in Table 4. e causes and types of tra c congestion have been expounded in detail. In the database of tra c congestion cases, the causes of tra c congestion can be taken as the focus of the rst clustering, and then frequent congestion and occasional congestion are taken as the secondary clustering focus. After the cluster simpli cation, according to the data types given by attributes, we adopt the method of combining local similarity calculation with global similarity calculation. In the calculation of local similarity, di erent methods are adopted for di erent data types, considering di erent data types.
In the expression of case knowledge, an eigenvector has been established for the eigenvalue attribute of each case, calculating the angle between the two eigenvectors by using the law of cosines. All the weight of the feature value is positive, so the two feature vectors between the cosine values are between 0 and 1. If the cosine value between two feature vectors is close to 1, namely, the two vectors' angle is smaller, the two eigenvectors represented the closer feature value. Conversely, if the cosine value is close to zero, the angle is greater, and the correlation between the two cases is smaller. rough the previous elaboration, a presentational feature vector has been established for each case, and the angle between the two feature vectors can be calculated by the law of cosines. e formula is as follows: e experiment compared the cases of unclustered system 1 and clustered system 2. All cases were divided into 8 sets, and each set was clustered according to K 4. Each set was arranged from low to high according to the number of clustered cases. A judgment analysis was made on the

Mathematical Problems in Engineering 7
retrieval time and success rate, respectively. e average retrieval time is taken 10 times for each collection. e retrieval results are shown in Tables 3 and 2.
By comparing the above charts, it can be found that the retrieval test is carried out on the 8 cases to be tested and is only selected from the test case base. e case from system 2 (clustered) shows a linear and slow increase in the retrieval time as the number of retrieved cases increases. In addition, the retrieval time of system 1 (unclustered) is almost the same as that of the system with clustering, and the success rate of the system with clustering has always been higher and more stable. e retrieval success rate of system 1 is not only lower than system 2 but also less stable than System_2.

Conclusion
is paper proposes a traffic congestion case retrieval strategy based on cluster analysis. rough the research on the relevant algorithms of clustering analysis, the shortcomings of the K-means algorithm in clustering are improved.
en introduce the concept of Min-cluster, and regard marginal cases as Min-cluster, perform clustering at the margin, select neighboring cases based on the clustering effect, take cases with more matching similarity as the new center point, directly transfer data to the new case, and then adjust the centerfold of the new case. us, the quantity of cases at case margins inside the grid is tremendously reduced, so the chances of success in the target case retrieval are greatly improved, and it has been proved through a test that the success rate of the case library retrieval after 2nd time clustering is also greatly improved. It improves the success rate of target case retrieval, expands the scope of case solutions in the decision-making system, and enhances the reliability and flexibility of decision-making selection. e next step is to further optimize the case set structure and the relevant parameters and to improve the learning ability of the system.
Colleagues and authors try to apply the optimized algorithm to the daily management of traffic congestion relief. Experiments show that the clustering traffic congestion case set has improved the retrieval accuracy and time.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.