Search Space Reduction for Determination of Earthquake Source Parameters Using PCA and k -Means Clustering

,


Introduction
In the past decades, interferometric synthetic aperture radar (InSAR) has been a powerful method to acquire geophysical features such as surface deformation or topography by comparing the phases of at least two complex-valued SAR images obtained from different location or time.Since SAR provides high-resolution images, InSAR can measure the surface deformation with centimetric or even millimetric accuracy.This accurate deformation map enables the observation of ocean and ground surface changes, the measurement of ice drift and glacier elevations, and the analysis of seismic deformation or volcanic activities [1].
Surface deformation acquired from InSAR measurements provides essential information for studying earthquakes and volcanic activities.To derive the characteristics of an earthquake, we can estimate the source geometries of an earthquake by performing parameter inversion that minimizes the L2 norm of residuals between measured and modeled displacement calculated from a dislocation model [2].In many cases, the root-mean-square error (RMSE) values of residuals are used for the evaluation of the inversion result.
A dislocation model provides the mathematically calculated coseismic displacement of the earth's surface.The most popular dislocation model is a rectangular dislocation model (Okada model) [3] that assumes isotropic homogeneous half-space and finite rectangular source.Other dislocation models that provide closed analytical expression include the prolate spheroid model (Yang model) [4] and the spherical source model (Mogi model) [5].
In general, since the dislocation model cannot express the surface displacement linearly, estimating source geometries is a nonlinear inverse problem that can be solved by a nonlinear least square method.The nonlinear least square method iteratively searches local minima based on the partial derivatives of the objective function that indicates the misfit between observed and synthetic data.Hence, the solution may vary according to the starting point of parameters, and thus, we cannot assure that the solution is a global minimum [6].To avoid local minima and describe uncertainties, the Monte-Carlo restarts are often used to solve the problem [7].
In solving a nonlinear least square problem with the Monte-Carlo restarts, we should designate a search space by setting boundary conditions of each parameter to pick random starting points.Since a nonlinear least square method iteratively explores the parameter search space, the search space size significantly affects the accuracy and execution time of the algorithm.Previous studies [8,9] used the initial range of search space relying on the focal mechanism from seismologists.However, the accuracy of seismological studies can influence the result of an inverse problem.Besides, a dislocation model with many source parameters describing fault leads to bad data visualization and thus analyzing the distribution of the results becomes quite tricky.
This paper presents a new search space reduction algorithm based on machine learning techniques to come up with the challenges mentioned above.Our proposed method proceeds as follows: (1) Initially, assume a wide range of search space and perform the Monte-Carlo restarts of five thousand iterations with randomly initialized starting points (2) Then, choose the samples that preserve maximum variance and have a low RMSE and project them to the two-dimensional (2D) space using PCA (3) Decide hyperparameter k based on the distribution of data on the 2D space and cluster samples using k -means clustering (4) Choose the cluster with the lowest mean RMSE and reduce the search space using the cluster (5) Perform additional Monte-Carlo restarts of five thousand iterations in the reduced search space and determine the final source parameters Our search space reduction algorithm was compared with the conventional approach that uses geological studies to define the search space in the same study area, the 2017 Pohang earthquake.The evaluation was performed in terms of the size of search space, RMSE calculated from determined source parameters, a 95% confidence interval of each source parameter, and the average iteration number of a nonlinear least square.
This paper is organized as follows.Section 2 introduces the related work and background of our approach.In Section 3, our machine learning-based search space reduction algorithm is described in detail.Section 4 presents the results and discussion of our approach in the study area, Pohang, Korea.Finally, Section 5 concludes the paper.

Related Work
2.1.Interferometric SAR and Earthquake Source Parameter Determination.InSAR is the method that extracts geophysical features such as surface topography and deformation by comparing the phase offsets of at least two complex-valued SAR images [1].Comparing SAR images measured at different time instants, InSAR can retrieve the surface displacement at specific time instants, e.g., the surface deformation induced from earthquakes or volcanic activities, landslides [10], thaw-derived slope failure [11], and glacier elevations and changes [12].However, conventional InSAR has a limitation that it can measure only the line-of-sight (LOS) component of displacement [13].To overcome the limitation and measure 3D surface displacement accurately, the offset tracking method [14] and stacking InSAR and multiple aperture interferometry (MAI) [15] were proposed.
The source geometries of an earthquake can also be estimated by performing parameter inversion that finds the best approximate source parameters having the minimum misfit between the measured displacement and the synthetic 3D surface displacement of a theoretical dislocation model.Typical dislocation models are as follows.Mogi's model [5] creates a 2D displacement map in radial and tangential direction with four input parameters assuming ideal semiinfinite elastic crust and spherical source.The Yang model [4] generates a 3D displacement map in east, north, and vertical directions with eight source parameters assuming finite prolate spheroid source in an elastic half-space.Okada's model [3] builds a 3D displacement map similar to the Yang model, but it assumes a finite rectangular source and uses ten source parameters.All these dislocation models are nonlinear because the relation between the source parameters and surface displacement cannot be expressed as linear equations.
To estimate the source parameters using a nonlinear dislocation model, we can use the nonlinear least square method that minimizes the Euclidean norm of residuals between the measured and the modeled displacement.Let us denote the dislocation model as G, source parameters as m, and observed displacement as d, respectively.Then, the objective function of the nonlinear least square method is formulated as Equation (1) [6].
Source parameters m can have boundary constraints.As for the parameter indicating an angle, the range of parameters is physically limited.For example, the parameter Strike that means the horizontal angle of the fault should be on  3 Journal of Sensors the range ½0 °, +360 °.Also, seismological studies such as the focal mechanism can be used to establish additional boundary conditions.Equation (2) shows the objective function of a nonlinear least square with lower bound l and upper bound u.
The nonlinear least square method iteratively explores the search space from a specific starting point to local minima, so the method cannot derive global minima.To avoid local minima and describe uncertainties of parameters, we can use the Monte-Carlo restarts that generate many initial starting points and then solve the nonlinear least square problem for each starting point.Obtaining results from multiple starting points helps to find the global minimum and show the probabilistic descriptions such as standard deviation and histogram.Many studies employed the Monte-Carlo restarts to obtain the best-fit source parameters of earthquakes or volcanic activities [8,9,[16][17][18].Besides, Bayesian estimation of source parameters can be used to determine the source parameters [19,20].
Our approach used stacking InSAR and multiaperture InSAR (MAI) [15] and assumed a wide initial search space.Then, we performed Monte-Carlo restarts for randomly initialized starting points to get data that includes the evidence of the initial search space.

Machine Learning Applications for Remote Sensing.
Machine learning has been used widely in remote sensing [21].It can be classified into supervised learning and unsupervised learning by whether a given set of training data includes the desired solution called labels, or not.In supervised learning, the label y corresponding to the data x exists, and the problem that the data x is provided without the label is called unsupervised learning.Typical supervised learning algorithms include the support vector machine (SVM), random forest (RF), artificial neural network (ANN), and k -nearest neighbor (KNN).In remote sensing, these algorithms have been used for estimating PM2.5 concentrations [22], modeling aboveground biomass of maize [23], and land cover classification [24].
Unsupervised learning techniques have also been used broadly in remote sensing.Principal component analysis (PCA), one of the representatives among them, reduces the dimension of the given dataset.PCA finds a lower dimensional hyperplane that preserves the maximum variance to maintain the maximum amount of original information [25].In remote sensing, PCA has been used to process hyperspectral images acquired from airborne or spaceborne   k-means clustering is another popular unsupervised learning algorithm that groups data into the k clusters.The parameter k is a hyperparameter that should be defined manually, and data are grouped by minimizing the sum of the squared error of distances between the centroid of the cluster and the data points in the cluster [31].In this paper, we used PCA to project the data in the initial search space to 2D latent space and performed k-means clustering to cluster the data.

Proposed Approach
This paper is aimed at reducing the parameter search space for the determination of earthquake source parameters with    5 Journal of Sensors better computation time efficiency and data visualization.Figure 1 outlines our proposed search space reduction algorithm.Figure 1(a) illustrates the workflow that determines the source parameters using the Monte-Carlo restarts, and Figure 1(b) presents the proposed method that reduces the parameter search space included in Figure 1(a).
As the first step of our approach, we set a wide initial search space.Second, we perform the Monte-Carlo restarts of five thousand iterations to associate the RMSE and source parameters in the initial search space.In performing Monte-Carlo restarts, we randomly subsample measured deformation to reduce computation time.Third, we reduce the parameter search space using the PCA and the k-means clustering.In doing so, we first apply PCA to choose the samples that preserve the maximum variance of the original information with a low RMSE and perform the k-means clustering with varying k ranging from 2 to 5. Finally, we reduce the search space based on the data in the cluster having the lowest mean RMSE.
3.1.Datasets and Dislocation Model.Our approach is applied to the 2017 Pohang earthquake [32].To acquire an accurate 3D surface deformation, we applied the stacking InSAR and the MAI methods [15] to four SAR images obtained from the CSK and ALOS2.The size of the measured displacement is 18:84 × 15:54 km, and the pixel size is 30 m 2 .
Our study used the Okada model [3] as the dislocation model G.The Okada model assumes a finite rectangular source and isotropic homogeneous half-space.Table 1 describes the physical source parameters defining a rectangular source, and Figure 2 shows the fault geometry of the Okada model, respectively.The parameter E indicates the horizontal, i.e., E-W direction, distance of the upper-left point of the satellite image and fault, and N indicates the vertical one.The upper-left point of the measured deformation map is located at 36 °10 ′ 57 ″ N, 129 °17 ′ 03 ″ E. We fix the value of parameter Open, i.e., tensile component, to 0 and optimize the other nine parameters to determine the source parameter values.

Monte-Carlo
Restarts in the Initial Search Space.As the first step, our algorithm conducts the Monte-Carlo restarts for a wide initial search space described in Table 2. Since the parameters Strike, Dip, and Rake represent angles, the lower and upper bounds for them are defined as the whole range physically available.Other parameters illustrating the length unit of the fault geometry are defined as large enough to cover the maximum movement of the fault possible in the study area.
Then, the Monte-Carlo restarts of five thousand iterations are performed to extract the ½data point, RMSE pairs that will be used in the next machine learning step.Figure 3 shows the procedure of the Monte-Carlo step.We randomly subsample 20% of measured deformation data d samp and define loose termination tolerance of the nonlinear least method to save the computational burden.In solving the nonlinear least square problem, we aim to minimize the objective function kGðmÞ samp − d samp k 2 to retrieve the best-  Journal of Sensors fit parameters with minimum residuals.As a result, optimization result m * i and corresponding RMSE are calculated in each iteration.The nonlinear least square problem is solved by using the lsqnonlin() function of the MATLAB R2019a with the trust-region-reflective algorithm [33].Table 3 specifies the option parameters used in the function.Option parameters not specified are set to the default values.

Search Space Reduction Using Machine Learning
Approach.This study uses the PCA and the k-means clustering to reduce the search space.At first, we perform the PCA to reduce the dimensionality of the Monte-Carlo results, as described in Figure 4. Before the PCA fitting, the results obtained from the Monte-Carlo step are sorted in the ascending order of the RMSE value to select well-optimized   7 Journal of Sensors samples.Then, we perform the PCA fitting iteration for the sorted data in the index range of ½1, i where 1 ≤ i ≤ 5000 and calculate the explained variance ratio for the PCA results.The explained variance ratio is the ratio of the variance of projected data on the reduced dimensions to the total variance of the original information.Also, the explained variance ratio is regarded as a metric to evaluate the usefulness of the principal components extracted by the PCA, and thus a higher explained variance ratio means the information loss is kept smaller.For this reason, our algorithm chooses the samples showing the highest explained variance ratio for further processing.
PCA enables visualization of high-dimensional data in the low-dimensional space while preserving information of the original data as much as possible.In general, a 2D scatter matrix is used to visualize high-dimensional data, but each element of the scatter matrix represents only partial information.Furthermore, since the number of two-dimensional scatter plots of p-dimensional parameters is p , visualizing high-dimensional data using a scatter matrix is cumbersome [34].For example, visualizing the results of Monte-Carlo restarts by a scatter matrix requires 36 plots for the Okada model with nine parameters.To overcome this problem, this paper uses the PCA visualizing the ninedimensional data on the 2D hyperplane preserving the original information so that we can analyze the ½data, RMSE pairs with just one 2D scatter plot.
For the data projected to the 2D space, we cluster the data using the k-means clustering.k-means clustering identifies k number of centroids and allocates data points to the nearest cluster, where k is a hyperparameter that a user should define manually.Our algorithm applies k-means clustering to the data on the 2D plane for varying k to find the most appropriate k that well divides projected samples with similar RMSE and position to each other.Once deciding the parameter k, we reduce the search space for each source parameter using the data cluster showing the lowest mean RMSE for the given k.
If the PCA preserves most of the variance of the Monte-Carlo results and all source parameters are on the same scale, the distances between original data points in the latent 2D space can be directly assessed [35].It means that the k -means clustering using the projected data can group the data in the adjacent location of the original nine-dimensional space because this algorithm minimizes the squared error between the centroid of the cluster and data points in the cluster.In summary, by using the PCA and the k-means clustering, we can extract the similarity of the nine-dimensional data on the 2D space with better data visualization.
Before the PCA fitting, each parameter is scaled to have a standard normal distribution Nð0, 1Þ.PCA finds the principal components maximizing the variance.If parameters are unscaled, the principal component loading vector will be biased to a parameter showing a large variance.For example, the PCA fitting for the unscaled Monte-Carlo results may lead the principal component to have a significant loading for the Strike parameter that has the largest parameter search space among the source parameters of the Okada model.Furthermore, the distances between data points in the 2D space cannot be assessed.In this paper, we implemented our machine learning-based algorithm using the Scikit-learn package [36] in Python.

Experimental Results and Discussion
To evaluate the effectiveness of our machine learning-based search space reduction algorithm, this paper applied the proposed machine learning-based algorithm to the 2017 Pohang earthquake case [32].The evaluation was performed in terms of the size of search space, RMSE values derived from determined source parameters, a 95% confidence interval of each source parameter, and the number of iterations for each nonlinear least square of the Monte-Carlo restarts.The results were compared with those of the conventional approach, which refers to the seismological studies.

Principal Component Analysis and k-Means Clustering.
As mentioned in Section 3, we first conducted dimensionality reduction using PCA from nine-dimensional source parameters of the Okada model to the 2D space.Figure 5 illustrates the explained variance ratio for the number of samples used for PCA fitting.As described in Section 3, the data samples were presorted in the ascending order of the RMSE values.The result presents that the samples with high RMSE values interfere with the dimensionality reduction.The result also shows that the number of samples equals 193 produces the highest explained variance ratio, i.e., 96.3%.It means that the projection results with only 193 samples maintain most of the variance of the original data while keeping a low RMSE value, so the relation between those samples in the 2D space can be directly evaluated.
Table 4 shows the principal components (PCs) of each parameter, and Figure 6 presents the data projected into 2D space, respectively.With PCA, we can easily visualize the distribution of samples with only one 2D scatter plot.As shown in Figure 6, the samples with low RMSE values are clustered to the left side, while samples with high RMSE values are distributed from the center of the plot to the right side.   5 also shows the mean RMSE and standard deviation of each cluster obtained for k = 3.

Search Space Reduction.
We then reduced the search space using the minimum and maximum values of data in each cluster.Tables 6 and 7 show the search space derived from the seismological studies [37] and three reduced search spaces determined from our proposed algorithm for each source parameter.Figure 8 illustrates the results that compare the reduced search spaces derived from our approach with the empirically defined search space referring to seismological approaches.In Figure 8, the red bar shows empirically defined search space, and the others illustrate the search spaces determined by the three clusters.The blue horizontal line denotes the determined best-fit source parameters acquired by Monte-Carlo restarts of ten thousand iterations in the empirically defined search space.Figure 8 shows that, for all the parameters, the reduced search spaces for Cluster 1 include the source parameters determined from the Monte-Carlo restarts on the empirical parameter search.In other words, the reduced search space having the lowest mean RMSE is in good agreement with the conventional research results based on seismological studies while effectively reducing the search space.For this reason, the rest of the experiments were conducted using the search space from Cluster 1.
Table 8 summarizes the reduction of search space for each source parameter after applying the proposed approach.From the results, the search space size for each parameter decreased by about 55.1~98.1%,compared with the search space from the conventional approach.
4.3.Determination of Source Parameters.As the final step, we conducted additional five thousand iterations of Monte-Carlo restarts in the reduced search space based on Cluster 1 and determined the source parameters as the best-fit parameter of the results of the Monte-Carlo restarts with a 95% confidence interval.Table 9 summarizes the source parameters determined and RMSE of our approach and the conventional approach.The results from two approaches seem similar to each other.However, it can be remarked that our machine learning approach produces more reliable results than the conventional one because a 95% confidence interval of our approach is much smaller than the conventional approach.The interval size reduction was about 60~80.5% for each parameter.
Figure 9 illustrates the distribution of the Monte-Carlo results derived from reduced search space using our machine learning-based approach.Two blue vertical lines of each histogram represent the lower and upper bounds of each parameter, and the red vertical line indicates the best-fit source parameter.In the reduced search space, all parameters Figure 10 compares the synthetic model's deformation maps using the parameters acquired from our approach with those obtained from InSAR measurements.Each row of Figure 10 describes the deformation of E-W, N-S, and depth direction component.The first column and the second column in Figure 10 show the measured surface deformation map for each direction component using InSAR and modeled deformation map from the Okada model.The third column shows a residual between the measured displacements and the modeled displacements.It can be concluded that the determination of source parameters using the Okada model and our reduced search space fit the measured displacement correctly because RMSE values for E-W, N-S, and depth components were 0.1863, 0.1544, and 0.0829 cm.4.4.Computational Efficiency.For the computational efficiency validation, the average iterations of a nonlinear least  As shown in Table 10, the average iterations of a nonlinear square for the reduced search space produced by our approach are much smaller than that of the conventional approach.

Conclusions and Future Works
This paper presented a new search space reduction algorithm using machine learning techniques for the earthquake source parameter determination.Our algorithm proceeded in the order of the Monte-Carlo restarts, PCA, k-means clustering, and search space reduction.PCA leads to better visualization of the Monte-Carlo results using only one 2D scatter plot, so we can find the hyperparameter k of the k-means clustering that groups data for them to have similar RMSE and close location to each other.We compared the proposed approach with the conventional approach that defines the parameter search space by referring to seismological studies in terms of size of the search space, RMSE, a 95% confidence interval of Monte-Carlo results, and the average iteration number of nonlinear least square from the Monte-Carlo restarts.The experimental results for the 2017 Pohang earthquake test case showed that our approach achieves significant reductions in search space size and a 95% confidence interval size for all source parameters, i.e., about 55.1~98.1% and 60~80.5%,respectively.Also, the average iteration number of a nonlinear least square from our reduced search space was much smaller than that of a conventionally defined search space.Consequently, the results demonstrated that the proposed approach effectively reduces the parameter search space size and the computational burden.
Our future challenge is to develop the automatic search space reduction algorithm that uses the hierarchical clustering algorithm for determining the number of clusters.

Figure 2 :
Figure 2: Fault geometry of the Okada model.

Figure 3 :
Figure 3: Data acquisition in the initial search space using the Monte-Carlo restarts.
results, i = 1, n = 5000 i == n?Sort data in the ascending order of RMSE Fit the PCA using data of the 1~ i th row Calculate the explained variance ratio of the PCA

Figure 7 :
Figure 7: Results of k-means clustering for varying k.

Figure 8 :
Figure 8: Comparison of search spaces between conventional approach and our approach.

8
Journal of Sensors Then, we performed k-means clustering for k = 2 ~5 on the 2D hyperplane.Figures 7(a)-7(d) illustrate the k-means clustering results for each k, respectively.It was observed that the result for k = 2 could not distinguish the samples in the upper center with medium RMSE from those in the bottom right side with high RMSE.The results for k = 4, 5 also divide low-RMSE samples in the left corner.As a result, in our test case, data samples are best partitioned when k = 3.Table

Figure 10 :
Figure 10: Final results of source modeling.

Table 1 :
Source parameters of the Okada model.

Table 2 :
Initial search space of source parameters.

Table 4 :
Principal components of each parameter.

Table 5 :
Statistics of each cluster.

Table 7 :
Reduced search space of three clusters.

Table 8 :
Search space size derived from conventional seismological studies and our proposed algorithm.

Table 6 :
Search space determined from seismological approaches.

Table 9 :
Determined parameters and RMSE of conventional approach and our approach with a 95% confidence interval.

Table 10 :
Average number of iterations for nonlinear least square from the Monte-Carlo restarts.