K-Means Cluster for Seismicity Partitioning and Geological Structure Interpretation , with Application to the Yongshaba Mine ( China )

1School of Resources and Safety Engineering, Central South University, Changsha, China 2Department of Building Structures and Geotechnical Engineering, University of Seville, Seville, Spain 3State Key Laboratory of Coal Mine Disaster Dynamics and Control, Chongqing University, Chongqing, China 4State Key Laboratory of Resources and Safe Mining, CUMT, Xuzhou, China 5College of Resources and Environmental Science, Chongqing University, Chongqing, China


Introduction
Microseismic monitoring has been widely used in mining [1][2][3], where seismicity partitioning is an important step in geological structure interpretation and seismic hazard assessment [4].Although this task can be implemented by an expert visually, the expert knowledge-based cluster is nonquantitative, subjective, and hard to interpret for a large amount of data [5].Intelligent cluster analysis, taking advantages of spatial, temporal, and other features of seismic events, is more highlighted due to its advances in extracting useful knowledge and finding hidden information from numerous seismic data [4].Motivated by these facts, various cluster methods have been introduced/proposed for seismicity partitioning, among which -Means cluster is the most popular one.However, it is still hard to determine the -Means cluster number, and few works have been done to compare it with other cluster methods.
The general outline of this paper is shown in Figure 1.Firstly, the -Means cluster, taking advantages of seismic event location features (, , ), was applied to cluster 1516 seismic events ( > −1.5) obtained from the Yongshaba mine (China).Then, to test the effectiveness of the -Means cluster, the Gaussian mixture model (GMM) and the self-organizing maps (SOM) were selected as comparisons.In addition, the Silhouette and Krzanowski-Lai-(KL-) combined S-KL index was proposed to obtain possible optimum cluster numbers and to compare the cluster methods.
Results show that the -Means cluster obtains a best cluster "quality" with higher S-KL indexes on the whole and continuous seismic clustering areas.The optimal number of clusters for detailed geological structure interpretation is confirmed as eleven clusters, and seismic hazard assessment shows that C5 and C7 ( = 11) have a high mean moment magnitude (  ), and C1, C2, C3, and C4 ( = 11) have a relatively high   , where special attention is needed when mining.In addition, C7 ( = 11) is the most shear-related area with a mean S-wave to P-wave energy ratio ( Es/Ep ) of 41.21, and we found that two areas probably have faults or caves, and two faults may be falsely inferred by mine geologists.

State of the Art
The commonly used seismicity partitioning methods include the -Means cluster, the hierarchical cluster, the self-organizing maps (SOM), the fuzzy cluster, the Gaussian mixture model (GMM), the density-based clustering algorithm (DBSCAN), and some other cluster means, which have been listed in Table 1.Hierarchical cluster has been applied for seismicity partitioning since the 1990s, and Davis and Frohlich [11], Frohlich and Davis [12], Wardlaw et al. [13], Hudyma [14], Hudyma and Potvin [15], Hashemi and Mehdizadeh [16], and Hashemi and Karimi [17] selected the hierarchical clustering technique using single-link analysis/Ward's method to evaluate spatial and temporal properties of earthquake catalogues.However, the hierarchical cluster has a tendency to form seismic events into linear groups [12].SOM is another widely used seismic cluster method, and Zamani and Hashemi [5] introduced a self-organized tectonic zoning for Iran, and then Zamani et al. [18] chose Wilk's Lambda criterion and a relative discrepancy of Wilk's Lambda to determine the optimum tectonic zoning number; and Mojarab et al. [19] discussed the effect of SOM input parameters.Fuzzy clusters have also been used in seismic clusters, Ansari et al. [4] and Benitez et al. [21] used the Gath and Giva (GG) fuzzy cluster for Iran and South West Colombia, respectively; while Monem and Hashemy [20] applied the fuzzy -means (FCM) and the Gustafson-Kessel (GK) clusters to the Ghazvin canal irrigation network.In addition, the density-based algorithms [22][23][24] and the TriGen-based method [25] have also been applied for the seismic cluster.A comparison of using different seismic clusters can be found in Leśniak and Isakow [26] and Konstantaras et al. [27].
-Means cluster analysis can be applied to the hypocentral distribution of observed earthquakes in any region [6], which has been widely used in seismic cluster.Weatherill and Burton [6] implemented the -Means cluster to provide a catalogue of earthquakes for Aegea, and results demonstrate that from 20 to 30 clusters are the most appropriate number in capturing the hypocentral distribution and fault type.Rehman et al. [7] applied the -Means cluster to estimate seismic hazard and risk in Pakistan and found that 19 clusters were the best to include the sufficient knowledge of subsurface geological structure and geophysical information.Morales-Esteban et al. [8] proposed an adaptive Mahalanobis -Means algorithm to study the seismic catalogues of the Croatia and Iberian Peninsula, and the result showed the ability of this method to discover not only circular but also elliptical shapes.Ramdani et al. [9] took advantages of earthquake centroids of the -Means cluster to find subduction evidence beneath the Gibraltar Arc and Andean regions.Besheli et al. [10] used the -Means cluster to partition Iran into six main clusters and to study the importance of foreshock precursors for different zones.
It can be seen that most seismic clusters are focused on earthquake data, and few works have been done on mine seismic clusters.Therefore, the cluster adaptively to mine seismic events should be discussed.Though the -Means cluster has been widely used now, it is still hard to determine the -Means cluster number and few works have been done to compare it with other cluster methods.In addition, Table 1 shows that most optimum cluster numbers are below 20, and the Silhouette index and the Krzanowski-Lai (KL) index are the most frequently used indexes to evaluate cluster "quality."In this paper, the -Means cluster was applied to partition 1516 seismic events ( > −1.5) obtained from the Yongshaba mine (China) and the GMM and SOM clusters were selected as comparisons, and the cluster results with cluster numbers from 2 to 20 have been evaluated by the Silhouette and Krzanowski-Lai-(KL-) combined S-KL index.

𝐾-Means
Cluster.The -Means cluster is a hard partitioning algorithm proposed by Hartigan and Wong [28].A dataset of  seismic events with  dimensions consists of an  ×  matrix (1), and the -Means partitions the dataset into  clusters with each element allocated to a particular cluster.
where each row represents one seismic event and each column represents one variable.

Shock and Vibration
The -Means cluster is an iterative process with  initial cluster centroids, and each datum is allocated to its nearest cluster centroid.Then, the mean of each group is calculated to obtain a new cluster centroid.The procedure terminates when no datum changes clusters or the iteration number reaches a preset maximum.The details of the -Means cluster are as follows.
Step 2. Calculate the Euclidean distance between each datum and each cluster centroid, and the data is allocated to its nearest cluster centroid.
Step 4. Has any datum changed its cluster?
(2) No: output cluster each datum allocates to and list the cluster centroids.
Step 5. Has it reached the maximum iteration number?
(2) Yes: output cluster each datum allocates to and list the cluster centroids.

Cluster
Validity Indexes.Assessment of the "quality" of a partition is an important consideration in cluster analysis.
From Section 2, it is known that the Silhouette and the KL indexes are the most widely used, and their brief introductions are given as follows.

Silhouette Index.
The Silhouette provides a measure of separation quality between clusters, and the higher the Silhouette index, the better the cluster results [29].For an event  belonging to cluster   , the Silhouette width is defined as where   is the average distance between event  and other events in cluster   and   is the minimum distance between event  and events in cluster X −   .The average Silhouette width   and the Silhouette index are defined by ( 3) and (4), respectively,

Krzanowski-Lai Index.
The KL index produces the traces of the within-clusters matrices for two consecutive clusters, and the higher the KL index, the better the cluster results [30].The KL index is defined as where DIFF() = ( − 1) 2/  −1 −  2/   and  is the number of features in the dataset. is the sum of all point-to-point distances of observations within each cluster, summed across all clusters.

Application of 𝐾-Means Cluster
4.1.Data Set.We used 1516 seismic events as the cluster dataset, which were obtained from the Yongshaba mine (26 ∘ 38  N, 106 ∘ 37  E) from 01/01/2014 to 04/30/2014, with moment magnitude larger than −1.5 (Note: the Institute of Mine Seismology (IMS) system of the Yongshaba mine can monitor a seismic event whose moment magnitude is as small as −3; however, events too small usually have a large location error due to unclear P arrivals and less triggered sensors.In this paper, we chose a moment magnitude larger than −1.5 for seismic cluster).The distribution of geological structures and sensors of Yongshaba mine is shown in Figure 2(a).It shows that the IMS system contains 28 sensors, including two triaxial sensors (T1 and T2) and twenty-six uniaxial sensors (sensors from 1 to 26), and the sensors have a sampling frequency of 6000 Hz.There are seven main faults, namely, F331, F310, F316, F313, F350, F302, and F309, in which the F350 is inferred by mine geologists.In addition, there are 18 small faults, which are marked from I to XVIII, and the faults V, VI, X, and XI are inferred by mine geologists.The symbols A and B represent two caves.
Figure 2(b) shows the two view distributions ( and  views) of the seismic dataset.The color and size of the circle represent seismic moment magnitude.It can be seen that the main seismic moment magnitude is 0.0∼0.5, then 0.5∼1.0, and −0.5∼0.0.There are some seismic moment magnitudes smaller than −0.5 (It can be interpreted as that there are many small moment magnitude seismic events not recorded by the system, though the system can monitor a seismic moment magnitude as small as −3) and few seismic moment magnitudes are larger than 1.0.
It is easy to see that the main faults are distributed in the north, then in the south, and fewer faults in the middle, which are in a good agreement with the distribution of seismic events.Therefore, we believe that there should be a good relationship between seismic events and geological structures which needs further study.

Possible Cluster Number Selection.
To compare the cluster "quality" and select possible cluster numbers, the most commonly used indexes Silhouette and KL are selected, and their values of -Means, GMM, and SOM clusters are shown in Figure 3, in which the seismic event location (, , ) is selected as cluster input variables.It is easy to note that the Silhouette values within 4 clusters are higher than those of cluster numbers larger than 4 and that the -Means cluster has a higher Silhouette index than that of the GMM and the SOM clusters on the whole.The three clusters all have a high KL value when  = 2, and there are some local maximum KL values.Therefore, it is hard to determine possible cluster numbers by considering both Silhouette and KL indexes.To solve this problem, a Silhouette and KL combined S-KL index (6) is proposed, which takes advantage of the product of the normalized Silhouette index and KL index, and the normalization is calculated by (7).The relatively large S-KL values correspond to possible optimum cluster numbers.Generally, the -Means cluster has higher S-KL indexes than those of the GMM and the SOM clusters, which means that the -Means obtains a best cluster "quality" among the three clusters, and the cluster numbers 2, 3, 6, 9, 11 and 15 are selected as possible cluster numbers for further study.

S-KL
4.3.-Means Cluster Results.Seismicity partitioning with few clusters produces a simple and general classification defining the most basic structure of the area under investigation; however, it is hard to interpret a detailed geological structure, while for a too large cluster number, there may be some clusters that are hard to interpret [7].Therefore, a balanced cluster number should be selected as the optimum cluster number to interpret the detailed geological structure.Five typical variables, number of seismic events (), maximum moment magnitude ( max ), mean moment magnitude (  ), mean S-wave to P-wave energy ratio ( Es/Ep ), and number of events with moment magnitude >0.5 ( 0.5 ), are selected for seismic hazard assessment and seismic source mechanism analysis, where  Es/Ep is a measure to evaluate the seismic source mechanism, and the higher the  Es/Ep is, the more the geological structure is shear-related [14].
The cluster results and typical variables of 2, 3, 6, 9, 11, and 15 clusters are shown in Figure 4, and the clustering process diagram and geological structure interpretation of 11 clusters are shown in Figure 5 for better understanding.Figures 4  and 5 show that the -Means has a stable cluster result (there are some similar zones for different cluster numbers), and the number of events in each cluster has no significant difference, which indicates that the -Means has a good cluster performance.
When the cluster number is less than 6, the cluster results are in a good agreement with regional faults (the cluster results match with known geological structures).In the twoclass division (Figure 4(a)), the cluster divides the dataset into north zone (C1, concentrated faults) and south zone (C2, less faults), and C1 has many more  0.5 events and a higher   than that of the C2; that is to say that the north zone is more active and dangerous than the south zone.Then the dataset was divided into north zone (C1), middle zone (C2), and south zone (C3), looking for a relation between the clustering and the existing geology.C2 has a nearly equal  0.5 to C3, though there are much more seismic events in C3, so C2 should also be paid attention when mining.In addition, C2 is more shear-related than C3.For the six-partition division (Figure 4(c)), the clusters C1, C2, and C3 with concentrated faults have a higher   and more  0.5 than C4, C5, and C6, and C5 and C6 are less shear-related.
When the cluster number is larger than 9, the cluster results mainly correspond to the detailed geological structures (most clusters for  = 9 just have few faults except the cluster C4).For the nine-partition division (Figure 4(d)), C2, C3, and C4 have relatively large  Es/Ep and   , while C1 and C5 have a relatively large  Es/Ep or   .C6 has a relatively large  Es/Ep , while C7 and C9 have both low  Es/Ep and   .However, there still are many faults in C4, and a more detailed cluster should be studied.The 11-partition division (Figure 4(d)) separates C4 ( = 9) into clusters C5, C6, and C7; a detailed geological structure interpretation is shown in Figure 5.It can be seen that C7 has the highest  Es/Ep and   , with values of 0.33 and 41.21, respectively; thus special attention should be paid to the C7 area.Further study shows that there should be faults or caves in areas a and b, and the cluster analysis also confirms that there should be faults X∼XII, F350a, and F350b.However, there are probably no faults for VI and the bottom F309, which needs further study, while for  = 15, the clusters C3, C4, C5, and C6 are so connected, which makes it hard for geological structure interpretation, and the clusters C14 and C15 contain areas too large to find detailed geological structure information.In conclusion, the authors believe that the 11 clusters are the best for a detailed geological structure interpretation.

Cluster Comparisons
For a better comparison of the -Means, GMM, and SOM clusters, the same cluster number of  = 11 is selected and a possible optimum cluster number near 11 is also discussed in this analysis ( = 13 for the GMM cluster and  = 12 for the SOM cluster), and the cluster results are shown in Figure 6.
It can be observed that there are some discontinuous zones for both GMM and SOM clusters.For the GMM cluster: (1)  = 11, C1 surrounds C2 and C3, and C6 is isolated, while C8 and C9 are mixed together; (2)  = 13, C1 also surrounds C2 and C3, and C8 is isolated.For  = 11 and  = 12 of the SOM cluster, the clusters C1, C2, C3, and C4 are mixed together, while C7 is isolated.In addition, there are small seismic event number clusters for both GMM and SOM clusters.For example, C6 and C11 ( = 11) for the GMM cluster have small seismic event numbers.In conclusion, the results of the GMM and SOM clusters cannot be considered satisfactory, and they partition some areas with no meaningful information.

Discussions
The above analysis has shown that the -Means cluster is effective in seismicity partitioning for geological structure interpretation.The GMM cluster is on the basis that the  clusters follow the Gauss distributions, and its aim is to make the probability function maximum.The GMM cluster can be affected by the seed number, and the GMM cluster results ( = 11) of seed = 10, 25, 50, and 100 are shown in Figure 7(b).It is easy to observe that all the four GMM cluster results have discontinuous zones or cross-section, such as the C1, C2, C9, and C10 when seed number is equal to 10.This can be interpreted as that the neighboring seismic event locations sometimes are not in the Gauss distribution and     where  is the number of seismic events,  max is the maximum moment magnitude,   is the mean moment magnitude,  Es/Ep is the mean S-wave to P-wave energy ratio, and  0.5 is the number of events with moment magnitude >0.5.
the close seismic events may not be subjected to the same Gauss distribution.The SOM converts high dimension input space into a low dimension representation (typically two dimension), which has a very good image of visualizing.The SOM cluster can be affected by the learning rate, ordering Epochs, and lattice width and height.The lattice width × height is set to 1 × 11; then we randomly selected ten combinations of learning rate (0.1, 0.3, 0.5, 0.7, and 0.9) and ordering Epochs (2000, 3000, 5000, and 10000) to test the SOM cluster ( = 11).However, we obtain the same result with Figure 6(c); this may be due to the fact that the input variable just has a few dimensions (three dimensions), which makes the SOM cluster find an optimum result with different parameters.However, research has shown that the SOM may  only be valid for zoning high seismic activity areas [25], and, for low seismic activity zones, it may have bad cluster results.The -Means cluster can be affected by the initial clustering centers, and the typical -Means cluster results ( = 11) are shown in Figure 7(a).It is easy to see that most -Means cluster zones are the same, though there may be some difference using different initial clustering centers: there may be some seismic events that are not in the same cluster, for example, C7 and C8 between Figures 7(a1) and 7(a2); the clusters may be divided into different clusters or combined into one cluster, for example, C1 and C2 in Figure 7(a1) are equal to C1 in Figure 7(a3), and C10 in Figure 7(a1) is equal to C6 and C10 in Figure 7(a3).However, there is no assumption of Gauss distribution for the -Means cluster, and it partitions closer events to one cluster.This makes the -Means cluster obtain continuous cluster zones and help us Shock and Vibration be eleven clusters, and we found that two areas probably have faults or caves, and two faults may be falsely inferred by mine geologists.Furthermore, seismic hazard assessment shows that C5 and C7 ( = 11) have a high mean moment magnitude (  ), and C1, C2, C3, and C4 ( = 11) have a relatively high   , where special attention is needed when mining.In addition, C7 ( = 11) is the most shear-related area, with a mean S-wave to P-wave energy ratio ( Es/Ep ) of 41.21.In conclusion, the -Means cluster provides an effective way for mine seismicity partitioning, geological structure interpretation, and seismic hazard assessment.

Additional Points
Highlights. (i) The -Means cluster has been applied for seismicity partitioning, geological structure interpretation, and seismic hazard assessment.(ii) The possible optimum cluster number is determined by the proposed Silhouette and Krzanowski-Lai-(KL-) combined S-KL index.(iii) The -Means cluster has a better seismicity partitioning "quality" for geological structure interpretation than that of the Gaussian mixture model (GMM) and the self-organizing maps (SOM).

Figure 1 :
Figure 1: General outline of this paper.

Figure 2 :
Figure 2: Distributions of geological structures, sensors, and seismic events of the Yongshaba mine.(a) The distributions of geological structures and sensors.The solid lines represent determined faults, while the dashed lines represent inferred faults by mine geologists.The symbols A and B represent two caves, and the triangles represent the location of sensors.The numbers near the sensors are the sensor numbers, T1 and T2 are triaxial sensors, and the sensors from 1 to 26 are uniaxial ones.(b) The distributions of seismic events.The color and the size of the circle represent the seismic moment magnitude.

Figure 3 :
Figure 3: Cluster indexes for cluster "quality" comparison and possible cluster number selection of the -Means cluster, the GMM cluster, and the SOM cluster.(a) Silhouette index; (b) KL index; (c) S-KL index.

Figure 4 :
Figure 4: Cluster results and typical variables of -Means with different cluster numbers.(a)  = 2; (b)  = 3; (c)  = 6; (d)  = 9; (e)  = 11; (f)  = 15.The solid lines represent determined faults, while the dashed lines represent inferred faults by mine geologists.The symbols A and B represent two caves, and the triangles represent the location of the sensors.The statistics under (a)-(f) are the typical variables for each cluster, where  is the number of seismic events,  max is the maximum moment magnitude,   is the mean moment magnitude,  Es/Ep is the mean S-wave to P-wave energy ratio, and  0.5 is the number of events with moment magnitude >0.5.

Figure 5 :Figure 6 :
Figure 5: -Means clustering process diagram and geological structure interpretation of 11 clusters for Figure 4.

Table 1 :
Summary of typical seismicity partitioning methods.