Self-Tuning Inference Model for Settlement in Shield Tunneling: A Case Study of the Taipei Mass Rapid Transit System’s Songshan Line

Constructing tunnels in urban spaces usually uses shield tunneling. Because of numerous uncertainties related to underground construction, appropriate monitoring systems are required to prevent disasters from happening. Tis study collected the set-tlement monitoring data for Tender CG291 of the Songshan Line of the Taipei Mass Rapid Transit (MRT) system and considered that infuential factors were examined to identify the correlations between predictor variables and settlement outcomes. An inference model based on symbiotic organisms search-least squares support vector machine (SOS-LSSVM) was proposed and trained on the collected data. Moreover, because the dataset used for this study contained far less data at the alert level than at the safe level, the class of the dataset was imbalanced, which could compromise the classifcation accuracy. Tis study also employed the probability distribution data balance sampling methods to enhance the forecast accuracy. Te results showed that the SOS-LSSVM exhibited the most favorable accuracy compared to four other artifcial intelligence-based inference models. Terefore, the proposed model can serve as an early warning reference in tunnel design and construction work.


Introduction
Te rapid development of urban areas in recent decades has led municipal authorities around to world to relocate urban transportation infrastructures underground [1][2][3][4]. In Taiwan, underground public transport systems have been developed and expanded in major cities such as Taipei, Taoyuan, and Kaohsiung. In Taipei alone, all heavy-load transportation routes through the city have been relocated underground and fve major metro lines with a total operating mileage of 131.2 km have been constructed and are in operation since 1996. Construction work underground is mostly conducted in small and confned spaces subject to numerous uncertainties, making work here much more difcult than aboveground [5][6][7]. Besides, with the exception of subway stations, most of the below-ground subway system infrastructure is built using the shield tunneling method, in which a tunnel boring machine (TBM) simultaneously excavates the soil ahead, removes the excavated material, and installs a supporting shield structure to stabilize the newly excavated tunnel section. However, underground construction is not only challenging but also risky. During shield tunneling, factors such as changes in stress, tail void closures, disturbed soil compaction, and lining segment deformation can displace lateral soil layers, leading to the ground settling, bulging, or experiencing lateral displacement [8]. Terefore, while the TBM is in operation, a safety monitoring system must be active. Tis system collects site data and supervises TBM maneuvers to prevent excessive ground settlement, which can damage existing urban infrastructure and buildings and trigger disastrous accidents [3,5,[9][10][11].
However, data on settlement generated by the safety monitoring system alert users to settling that has already occurred and are thus useful only for developing and implementing postdeformation remedies that prevent a situation from worsening [2,11]. Shield tunneling safety would beneft greatly from a database created with limited monitoring data and soil layer parameters that may be used to predict settlement conditions, provide early warnings of deformation, and increase reaction times [1,[12][13][14]. With this aim in mind, the settlement monitoring data for Tender CG291 of the Songshan Line of the Taipei MRT system were collected in this study. In this monitoring data, safe-level data entries far outnumber alert-level data entries, creating an imbalanced dataset. Classifcation models based on ordinary classifcation techniques can result in serious bias in class forecasting when imbalanced processing data [15] which renders inference models based on artifcial intelligence (AI) unable to classify scarce data with accuracy. For this reason, efectively processing imbalanced data to prevent forecasting bias is critical for AI-based inference models.
Few researchers have developed AI models for use as autonomous integrated systems to predict ground settlement in tunnel construction. Tus, in this study, a novel advancement of symbiotic organisms search-least squares support vector machine (SOS-LSSVM) and data balancing methods is proposed to help predict settlement and help project decision-makers to prevent geotechnical disasters. Te developed model is at the forefront of eforts to integrate metaheuristics, AI techniques, and data balancing methods to automatically and accurately predict shield-tunnel settlement. For this purpose, factors infuencing settlement were investigated, and historical monitoring data were gathered for the training on AI. Tis prediction model is expected to be useful for design and construction agencies in predicting settlement, thereby helping them adopt preventive measures against settlement. Tus, the objectives of this study are as follows: (1) Identifying infuential factors for settlement in shield tunneling: the literature on settlement estimation was reviewed for possible infuential factors, which were further tested using statistical methods by the software SPSS. (2) Conducting resampling for imbalanced data: two methods were applied against imbalanced data, probability distribution data balance sampling (PDDBS), and synthetic minority oversampling technique (SMOTE). (3) Establishing a model for settlement prediction: the proposed settlement prediction model for shield tunneling was developed using symbiotic organisms search-least squares support vector machine (SOS-LSSVM). (4) Verifying the efectiveness of the proposed model: the prediction results of SOS-LSSVM and another four AI-based models were compared to determine the best performer based on prediction accuracy. Also, the receiver operating characteristic (ROC) curve and the area under the curve (AUC) were used to evaluate the classifcation accuracy of the data balanced by PDDBS and SMOTE. Tus, the proposed model has been verifed to solve the data imbalance problem efectively.
In this study, SOS-LSSVM was integrated with the data balance sampling method to create a shield-tunnel settlement prediction system optimized to help prevent groundsettlement-related disasters during tunnel construction. Te system, based in the construction control center, utilizes automatically collected and wirelessly transmitted monitoring data to forecast tunnel settlement status in real-time. When predicted settlement levels exceed the warning value, engineers may take appropriate actions to prevent disaster.

Causes of Settlement in Shield
Tunneling. In Taiwan, shield tunneling has been in use for over 31 years since its debut in 1976; through the years, TBMs have seen considerable improvements, from the most primitive open-face manual types to the later mechanical, slurry pressure balanced, and earth pressure balanced types. Because of the lack of slurry deposit yards or facilities, the Rapid Transit System in Taipei mostly employs earth pressure balanced TBMs, except for the Xindian Line (CH22), which uses two slurry pressure balanced machines. Shield tunneling would result in ground settlement having negative impacts on the adjacent structures [5,16]. Te soil layer and surface displacements caused by shield tunneling are related to the type and diameter of the TBM, excavation depth, site condition, soil properties, and groundwater level. When a TBM is advancing, if the thrust force against the tunnel face is lower than the static earth pressure of the soil layers, the soil releases its stresses along the tunnel face and rushes toward the tunnel face because the soil layers are under active earth pressure. Tis leads to ground loss and results in settlement. If the thrust force is equal to the static earth pressure of the soil layers, the tunnel face becomes static. Furthermore, if the thrust force is greater than the static earth pressure of the soil layers, the soil along the tunnel face is pressed forward, causing the ground to bulge. Ground settlement during shield TBM tunneling develops in the following steps: (1) before and during tunnel face excavation, (2) during the passage of the shield skin plate, and (3) after installation of segmental lining and backfll grouting [17].
According to previous studies, various factors contribute to ground settlements, such as geometrical, geological (e.g., the strength characteristics and the overconsolidation ratio of the soil), and shield operational parameters [4,7,8,[11][12][13][14]18]. Fargnoli et al. summarized that face support pressure, grouting pressure, machine stoppage time, and installation time for one-ring tunnel lining were essential parameters to predict surface settlement [2]. Luo et al. also indicated that the groundwater condition is an important factor because shield tunneling would cause pore water pressure variation [18]. Te fll factor of grouting and grouting pressure was identifed as the most afecting parameters when applying an AI-based algorithm to predict settlements [14]. 2 Structural Control and Health Monitoring

AI-Based Algorithm Applications for Predicting
Settlements. Establishing a settlement prediction model is necessary for underground construction safety. Analytical, empirical, and numerical methods were proposed to predict settlement and other tunnel deformations. Te most important weakness of such proposed methods is that they fail to consider all parameters contributing to the settlement (e.g., ground condition, operational parameters, and tunnel geometry) [14]. Also, because the process around shield TBM tunneling is complicated, most of the studies could not provide statistically meaningful relationships between the volume loss and operation parameters [17].
Recently, some researchers have successfully used AI-based algorithms to establish a model for predicting the settlement induced by shield tunneling, such as artifcial neural networks (ANNs), fuzzy logic (FL), support vector machine (SVM), and gene expression programming (GEP) [7,14]. Wang et al. successfully applied an adaptive relevance vector machine (aRVM) to predict real-time settlement development [9]. Bouayad and Emeriault proposed a methodology that combines the principal component analysis (PCA) with an adaptive neuro-fuzzy-based inference system (ANFIS) to model the nonlinear relationship between ground surface settlements induced by an earth pressure-balanced TBM [7].
Symbiotic organisms search-least squares SVM (SOS-LSSVM) was developed by Cheng and Proyogo [19] and proved to be reliable in prediction tasks [20][21][22]. SOS-LSSVM uses an advanced metaheuristic to search optimal parameters and identify the correlations between input and output variables from the historical case data to establish inference models. Previous studies also identifed that the SOS method exhibited excellent performance [19,23,24]. In addition to SOS-LSSVM, this study also applied backpropagation neural network (BPNN), least squares support vector machine (LSSVM), evolutionary least squares support vector machine inference model (ELSIM) [25,26], and SVM to estimate the settlements for comparison.

Strategies against Data
Imbalance. Data imbalance refers to one class of samples in a dataset overwhelming another class; this has serious consequences in classifcation. Generally, the term "minority" (MI) is used to refer to the class of scarce samples in the dataset and "majority" (MA) for the dominant class [27]. For example, when a dataset contains 95% majority class samples and 5% minority class samples, an inference model will tend to classify all of the samples as the majority class and achieve 95% accuracy; however, its accuracy for the minority class will be 0%. Tis bias is caused by the characteristics and limitations of AI, which requires a large amount of evenly distributed data for training and testing to achieve satisfactory forecasting results.
Once the distribution of imbalanced data is skewed, an AI-based inference model trained on them will also produce skewed results accordingly. Te major measures to solve the data imbalance problem are undersampling and oversampling. Besides, this study also introduces a sampling method that utilizes probability distribution to balance data and improve the classifcation accuracy.

Undersampling.
Undersampling is a technique that decreases the number of MA samples for the balance of a training dataset. It reduces the number of MA samples until the MA class is the same in size as the MI class. Undersampling is superior to oversampling for the training of imbalanced data; however, this approach can eliminate some potentially useful training samples; hence, it lowers the performance of the classifer.
Excessive MA samples could be eliminated through random selection to balance out the two classes. To avoid uncertainty pertaining to random undersampling, Kubat and Matwin proposed an alternative undersampling approach that they considered more appropriate. To mitigate data imbalance, they removed the redundant data in the MA class, followed by removing the borderline samples close to the boundary of the MA and MI classes as well as the noisy data [28].

Oversampling.
Oversampling increases the number of MI samples for the balance of a training dataset. It increases the number of MI samples until the MI class is of the same size as the MA class. As an approach against data imbalance, it is highly popular, and it is efective for the training of imbalanced data. However, because oversampling introduces some high-precision samples into the dataset, the result is often a lengthy training time or even overtraining.
In addition to random oversampling, the synthetic minority oversampling technique (SMOTE) was used in this study. Unlike random oversampling, which duplicates the MI class to expand the sample size, SMOTE generates synthetic samples by adopting linear interpolations between two near samples. Specifcally, SMOTE identifes and calculates the diference between MI samples using the nearest one, then multiplies the diference using a random value between 0 and 1, and then adds it to the MI class via the generation of a new MI sample class.

Establishing Settlement Inference Model for Shield Tunneling
Tis chapter addresses how the infuential critical factors for settlement in shield tunneling were identifed. Tese factors serve as the input variables for the proposed model, which uses SOS-LSSVM and relies on historical case data for training and testing to determine the optimal mapping of input and output variables, thereby predicting the settlement of tunnels. Te fowchart is illustrated in Figure 1.
Step 1. Identify infuential preliminary factors Review studies on shield tunneling and list the reasons that are attributed as the cause for settlement. Te ones that are mentioned more frequently will be identifed as preliminary infuential factors. Ten, implement SPSS on the preliminary infuential factors to determine the factors to be included.
Step 2. Collect and establish the case dataset Collect case data according to the required input and output variables and thus establish a complete case dataset that provides the input data.

Structural Control and Health Monitoring
Step 3. Balance the dataset A total of 999 data were collected for the present study, of which 75 were of alert level; therefore, the data were imbalanced. To overcome this problem, this study proposed a new data balancing method: probability distribution data balance sampling (PDDBS). Tere are two types of probability distribution data balance sampling (PDDBS): PDDBS oversampling and PDDBS median sampling, as shown in Figure 2. PDDBS oversampling balances a dataset by increasing the MI samples to the same amount of MA samples. By contrast, PDDBS median sampling simultaneously increases MI samples and decreases MA samples to the median total sample size to achieve balance in the dataset [29].
(1) PDDBS oversampling procedure (Figure 2 Step a: select one type of attribute data from the dataset and calculate its sample size and R (MI). Te number of samples of R (MI) that must be added to the MI class is determined as follows: Step b: divide the MI class n i (MI) into k intervals.
Step c: calculate the probability of an interval, as shown in Figure 3. Te conversion equation for the normal distribution of the sample X is Te probability of an interval is Step d: calculate the number of samples S that must be increased in an interval ( Figure 3). Te formula for S is Step e: generate the values and add them to the MI class. Te formula to increase S samples in Step d is Step f: examine whether the sample sizes are balanced. Examine if the classes in the dataset are equal in size. If not, they will require balancing again; if they are, the dataset is considered balanced. (2) PDDBS median sampling procedure (Figure 2 Step a: select one type of attribute data from the dataset and calculate its sample size, R (MI), and R (MA). Te number of samples R (MI) that must be added to the MI class is Te number of samples R (MA) that must be detracted from the MA class is Step b: divide the MI class n i (MI) into k intervals and the MA class n i (MA) into k2 intervals. Te number k1 can be calculated as Te number k2 can be calculated as Step c: calculate the probability of an interval as shown in Figures 4 and 5. Te conversion equation for the normal distribution of sample X is Te probability of an interval is Step d: calculate the number of small-class samples S1 and reduce the number of multiclass samples S2 for each interval. Te equation for the number of S1 ( Figure 4) that must be increased for the small number of samples in each interval is S1 � P × R(MI).
Te equation for the number of S2 that must be reduced for multiple types of samples in each interval is Step e: generate values and add samples. Te equation for increasing the number of S1 samples in step d is which directly reduces the S2 samples calculated in step d from the multiclass samples.
Step f: confrm that the sample sizes are balanced and the classes in the dataset are equal in size. If not, they must be balanced again.

Structural Control and Health Monitoring
In the preceding equations, i � influential factors, and i � (1, 2, 3, . . . , m), X ij � the i − th factor in the j − th case, μ X i � the mean sample value of factor i, samples to be subtracted from MA, S1 � the number of increased samples, S2 � the number of decreased samples, X ij (L) � minimum value in the set, X ij (U) � maximum value in the set, r(0 ∼ 1) � random variable ranging from 0 to 1, and n i (C) � median of the sample size of factor i.
Four methods, including PDDBS oversampling, PDDBS median sampling, SMOTE oversampling, and SMOTE median sampling, were implemented to thoroughly examine their performance in dealing with imbalanced classifcation based on their respective advantages and disadvantages [30][31][32]. PDDBS provides larger numbers of replicated minority samples but increases the likelihood of overftting, while SMOTE reduces the risk of overftting but tends to exclude helpful information. Also, while oversampling minimizes information loss and generates equal numbers of minority and majority class samples, the process may overft the classifer. Finally, although the use of the median in median sampling to punish diferences in nominal features associated with typical diferences in continuous eigenvalues provides an efective theoretical model to remove noise and redundant samples, its sampling performance on the same datasets may be poor.
Step 4. Establish the inference model and compare the forecast results Feed the case data to SOS-LSSVM to establish the inference model.

Step 5. Results
Te prediction accuracy of the proposed model was compared with other AI-based inference models to determine its forecasting ability. Furthermore, the best model's ROC and AUC with various balancing dataset methods were compared to examine the classifcation performance.
Step 6. System development and implementation In this step, the proposed model is developed and implemented into an integrated system that engineers may use in smart decision-making related to preventing and resolving ground-settlement problems.    Structural Control and Health Monitoring Table 1, the dataset had ten possible infuential factors (X 1 − X 10 ). Based on the fndings of previous shield tunneling studies [33,34], soil shear strength is the primary factor afecting tunnel settlement, with lower strength values associated with a higher risk of settlement. Te parameters of soil shear strength may be derived from cohesion force (C) and internal friction angle (φ). Te groundwater level variation during shield tunnel construction [35] is another factor afecting tunnel settlement, with Liu et al. (2023) fnding a positive correlation between the groundwater level and downward movement in the tunnel [36]. Tunnel geometry (e.g., depth of the tunnel center line and tunneling distance) [34], chamber pressure, total thrust force, tunneling speed, backfll quantity, excavated soil quantity, and water pressure [32] are also signifcant factors in tunnel settlement.

Identifying Infuential Factors. As listed in
Although SOS-LSSVM can process large quantities of data, if the factors are not positively correlated with the output, they can interfere with the output in training, resulting in excessive errors. Terefore, an objective method was required to analyze the correlation between the factors and settlement, to help select signifcantly correlated parameters for the inference model. In statistics, the term "correlation" refers to the strength and direction of the linear relationship between two variables; hence, it also indicates the degree to which the two variables are mutually independent. In this study, SPSS 22.0 was used, which employed Pearson's correlation coefcient, Kendall's tau-b, and Spearman's rho, to analyze the 10 factors and determine the correlation between the input and output variables.
Te results of the correlation analyses of ten infuential factors are presented in Table 2, with eight of the 10 factors accepted and two rejected as input variables. Te depth of the tunnel center line factor met the requirements of the correlation test from the Pearson test only, with the results of Kendall's tau-b and Spearman's rho tests showing no signifcant correlation (ρ > 0.01). With regard to the quantity of excavated soil factor, none of the correlation tests identifed a signifcant correlation with soil settlement at a 0.01 level of signifcance. Based on these fndings, the depth of the tunnel center line and the quantity of excavated soil were not included as input variables in the model. Te eight other factors showed strong correlations (ρ < 0.01) with soil settlement and were thus included as input variables. Te factors that exhibited a signifcant correlation of more than twice the aforementioned three methods at a signifcance level of 0.01 (two-tailed) were selected for the models.

Case Study and Data Collection.
Te datasets used in this study included data from construction projects implemented by the Taipei Mass Rapid Transit (MRT) system in Taiwan. A total of 999 settlement monitoring records were collected, with each covering ten input variables and one output variable. Tender CG291 of the Taipei Mass Rapid Transit system was used as a case study covering G16 Zhongshan Station to G14 Beimen Station, as shown in Figure 6. Te shield tunnel employed a two-section confguration with a total length of 1861 meters. Te frst section extends from Beimen Station (G14) to Tianshui Road Extension Station (G15) and the second extends from Tianshui Road Extension Station (G15) to Zhongshan Station (G16). In terms of geological properties, the route is in what is classifed as Zone T2 (Tamsui River Zone 2), and the tunnel is primarily in the strata between Songshan Formation 3 and Songshan Formation 4. Te strata are evenly layered, and the groundwater level is approximately 2.7-3.5 m underground (EL 99.2-100 m). Te monitoring system used was the settlement reference point-shallow subsurface type (SSI). Te monitoring setups used are summarized in Table 3. Table 4, 75 of the 999 samples in the dataset were at the alert level, while the remaining 924 were at the safe level, making the dataset imbalanced. Tus, the dataset was balanced separately using PDDBS and SMOTE by modifying the number of samples in the majority and minority classes. Table 5 shows the number of modifed samples by method. Te process of balancing the dataset changed the majority: minority ratio from 12.32 to ≈1 by increasing the number of minority samples and reducing the number of majority samples. PDDBS oversampling and PDDBS median sampling, respectively, generated 924 and 499 samples in the minority class, while SMOTE oversampling and SMOTE median sampling, respectively, generated 918 and 495 samples in the minority class. Based on the balanced datasets, SOS-LSSVM was then applied to predict the settlement.

Data Testing Using SOS-LSSVM.
After PDDBS and SMOTE were applied to balance the dataset, the SOS-LSSVM was used to test the balanced dataset. Ten, the two data balance methods were compared with each other based on the same algorithm. Furthermore, in addition to SOS-LSSVM, this study also provides the estimation results produced by BPNN, LSSVM, ELSIM, and SVM for comparison.

Data Preprocessing.
Data preprocessing involves scaling the entire dataset, which signifcantly afects the model outcomes. Preprocessing is required before training the model to scale the data to an equivalent range. Although SOS-LSSVM is able to process large quantities of data and identify the nonlinear mapping of input and output values, learning speed and accuracy are seriously compromised when applied to very large variable ranges. Terefore, before training a method for processing input and output values, these values must be determined to prevent the model from becoming temporarily unstable or failing to converge. In this study, a normalization method was used to transform data inputs into a 0-1 range using linear scaling. Te following equation shows the normalization function applied to the datasets:

Structural Control and Health Monitoring
where X norm is the normalized value, X is the actual value, and X max and X min are the maximum and minimum values, respectively.

Tenfold Cross-Validation.
Tenfold cross-validation, recommended to reduce bias and obtain reliable accuracy in statistical analysis [37], was implemented in this study to divide the datasets randomly into ten folds of approximately equal size to evaluate the learning model's performance.
Ninety percent of the dataset was used for training and 10% was used as validation data for testing. Te process was repeated ten times, and the fnal result was calculated using the average of the tenfold results.

Inference Settlement Evaluation and Error Indices.
Tis study proposes SOS-LSSVM as the inference model, which only requires setting up the number of iterations for SOS and the range of LSSVM parameters. Te LSSVM serves Groundwater level Input Te level at which the soil is saturated X 3 Depth of tunnel center line Input Te distance between the tunnel center line and the soil surface X 4 Tunneling distance Input Te distance of the segmental joint set in the shield tunnel X 5 Chamber pressure Input Te pressure applied to counterbalance the lateral earth X 6 Total thrust force Input A reactive force exerted on the shield tunneling X 7 Tunneling speed Input Te speed of the cutting head on total torque X 8 Backfll quantity Input Amount of material difused behind the segmental lining X 9 Quantity of excavated soil Input Amount of soil excavated during the tunneling process X 10 Water pressure Input Te pressure of the water distributed on the tunnel wall Y Settlement Output Subsidence due to tunneling activity     as a supervised-learning-based predictor to accurately build the input and output variables relationship. Te SOS algorithm is used as a metaheuristic search to fnd the optimal parameters of LSSVM. Tis hybrid system enhanced the learning process through the mutualism, commensalism, and parasitism phases. Te search process stops when the stopping condition is fulflled. Te model will proceed to the next iteration if the criteria remain unsatisfed. Te training and testing data were fed to SOS-LSSVM; and the output values were then denormalized for the mean performance index, which indicated the accuracy of the forecasts. Error indices are detailed in this section. Errors are inevitable in forecasts; therefore, efective indices are required to appraise errors; this is to determine the accuracy of an inference model and how the inference model fares are compared with other inference models. In this study, fve such indices were used (as listed in Table 6): mean absolute percent error (MAPE), correlation coefcient (R), mean absolute error (MAE), root mean squared error (RMSE), and reference index (RI). RI served as the index for comprehensive evaluation. Such performance measures allowed for more accurate results and a fairer test [24].
Tis study also used ROC curves and AUC values to evaluate the accuracy of classifcation. Te ROC curve is a coordinate-based diagram for the sensitivity of a classifer. It has gained increasing popularity in machine learning and data mining. Te basic concepts of the ROC curve are the following four scenarios: (1) true positive (TP), (2) false positive (FP), (3) true negative (TN), and (4) false negative (FN).
Only the true positive rate (TPR) and false positive rate (FPR) are required for plotting the ROC curve. TPR stands for the rate of actual positive samples being accurately identifed as such. On the other hand, FPR stands for the rate of actual negative samples being erroneously identifed as positive. Consequently, the ROC space defnes FPR as the xaxis and TPR as the y-axis, and the ROC curve is made up of FPR and TPR coordinate points. No doubt, the perfect classifcation point is 0 and 1, and the AUC value is exactly the area under the curve of the ROC.

Training and Testing
Results. Trough the training and testing of SOS-LSSVM, the following information was obtained. Te SOS-LSSVM training and testing results of the dataset, both before and after being balanced, were compared to determine the accuracy of SOS-LSSVM. Te results of BPNN, LSSVM, ELSIM, and SVM were also presented as references to illustrate the relative accuracy of SOS-LSSVM. Te training and testing results of SOS-LSSVM with the various indices are shown in Figure 7. Both PDDBS and SMOTE methods increased the accuracy of settlement prediction. Compared with the original data, the MAPE, MAE, RMSE, and R values were improved by both the methods. Overall, this study provides an alternative approach to data balancing that enhances the accuracy of SOS-LSSVM. Te average value of the performance evaluation is shown in Table 7. In terms of MAPE and MAE, SMOTE median sampling was superior, achieving the smallest error values in both training and testing. SMOTE oversampling achieved the highest correlation (R) value for testing (0.988), while PDDBS oversampling achieved the best training performance, with a correlation value (R) and RMSE of 0.996 and 0.7852, respectively. It should be noted that although the original data exhibited acceptable performance of settlement estimation, the accuracy of classifcation determining the safe or alert level of the settlement condition was still unknown.

Comparison of Inference Models and Resampling
Methods. To determine whether SOS-LSSVM could be superior to other AI-based inference models, the dataset was balanced separately with PDDBS and SMOTE and then used to train BPNN, LSSVM, ELSIM, and SVM. Subsequently, the corresponding RI values were calculated for comparison, and the performance of each algorithm was then ranked accordingly. Based on Tables 8 and 9, the RI values indicated that SOS-LSSVM was superior to other models. Te comparison revealed that the RI value of SOS-LSSVM was superior to the other AI-based inference models, regardless of whether the data were left unbalanced or balanced by either PDDBS or SMOTE.
Unexpectedly, RI values also indicated that SMOTE was superior. Te RI values of data balanced by SMOTE appear to be superior to PDDBS. Tis is because RI is an index for general performance instead of a specifc index for classifcation accuracy. Moreover, this highlights the requirement for introducing a ROC curve and AUC to evaluate the performance of SOS-LSSVM in classifying the level of settlement status (i.e., safe or alert).
Te ROC curves are shown in Figures 8 and 9, and the average FPR, TPR, and AUC values are shown in Table 10. Judging from the average AUC values, if the original imbalanced settlement data were used, the classifcation accuracy would be the lowest. Both PDDBS oversampling and PDDBS median sampling achieved a slightly higher classifcation accuracy than SMOTE in training and testing. Also, both the sampling methods achieved a better oversampling performance than the median sampling method. Tese results support a positive relationship between the amount of sample data added in the minority class and the classifcation accuracy. Tus, the proposed resampling method is a competitive alternative to currently popular approaches that both solve the problem of data imbalance and enhance the accuracy of AI-based forecasting.

System Development and Implementation.
Te application of the developed shield-tunnel settlement prediction system is shown in Figure 10. In this system, data loggers with built-in sensors record settlement data regularly, which are transferred to a computer storage system in the control center via wireless Internet. Te stored data are then preprocessed and transformed for further analysis. Te operating process of the SOS-LSSVM system features a graphical user interface that allows users to interact easily with the algorithm. Te system automatically trains a model and performs accurate prediction analysis using the input data.
y is the actual value. y′ is the predicted value. n is the number of data samples.

Structural Control and Health Monitoring
Te results provide engineers with a decision-making tool for creating project-specifc, real-time monitoring solutions. System implementation may be integrated into regular tunneling management, providing a centralized and a handy platform integrated with state-of-the-art technologies, including data mining techniques and artifcial intelligence.

. Conclusions
Settlement monitoring and estimation are essential for underground construction. Tis study applied SOS-LSSVM as the basis for an inference model for settlement in shield tunneling of the Taipei Mass Rapid Transit system, using historical case data for training and infuential factors of settlement as input variables. Tis study contributes to developing an AI system integrated with metaheuristic and data balancing methods to predict ground settlement and facilitate appropriate response measures to prevent urban geotechnical hazards. Te proposed model remarkably outperforms other AI models (BPNN, LSSVM, ELSIM, and SVM) and accurately predicts settlement to help engineers anticipate settlement status over the course of tunnel construction projects.
PDDBS and SMOTE methods were applied to solve the data imbalance problem. Tenfold cross-validation was used to evaluate the performance of the developed model, showing that applying SOS-LSSVM to data balancing using PDDBS and SMOTE achieved the highest RI value in both training and testing. In addition, the ROC curve and AUC were used in this study to assess combinations of SOS-LSSVM with various balancing dataset methods (PDDBS and SMOTE) in terms of their respective abilities to accurately classify settlement status as either safe or alert. A comparison of average AUC values demonstrated that the classifcation accuracy of PDDBS was higher than that of SMOTE. In addition, the accuracy of PDDBS oversampling was found to be superior to PDDBS median sampling. Tese results demonstrate that the proposed method can efectively balance an imbalanced dataset and can enhance the AIbased forecast accuracy.
Tis study is a pioneering efort to develop an autonomous system that integrates monitoring sensors for data collection, wireless data transmission functionality, and settlement prediction for disaster warning prevention. Te system was tested on data from a real MRT construction project in Taiwan to demonstrate its novelty and practicality in real-world applications. Moreover, due to the limited time available to conduct structural analyses and technical expertise inadequacies in the feld, most project-site engineers are challenged to make the timely and correct decisions necessary to efectively prevent soil-settlement disasters. Te system developed and proposed in this study can be easily and quickly used by engineers to facilitate appropriate preemptive actions to avoid disasters, construction failures, and their associated losses in terms of property and life.
Te fndings of this study provide two directions for future research. First, the diferences in soil characteristics, as well as the quantity and completeness of the data, may directly infuence the accuracy and reliability of estimation results. Future researchers are advised to collect other types of settlement monitoring data for training and testing, to determine if this makes a diference. Second, the PDDBS method was only compared with SMOTE. Future researchers are advised to include more resampling methods for comparison to learn their relative efectiveness. Furthermore, only one imbalanced dataset was used in the present study. Future researchers are advised to collect a wider variety of imbalanced data for training and testing more resampling methods and to compare their classifcation accuracy as well as improve the practicality and accuracy of the inference models.

Data Availability
Te data used to support the fndings of the study are available from the corresponding author upon request.

Conflicts of Interest
Te authors declare that they have no conficts of interest.