Application of Partial Least Squares-Discriminate Analysis Model Based on Water Chemical Compositions in Identifying Water Inrush Sources from Multiple Aquifers in Mines

Mine water inrush seriously threatens the safety of coal mine production. Quick and accurate identi ﬁ cation of mine water inrush sources is of great signi ﬁ cance to preventing mine water hazards. This paper combined partial least squares-discriminate analysis (PLS-DA) with inrush water chemical composition to identify the source of water inrush from multiple aquifers in mines. The Renlou Coal Mine in the Linhuan mining area was selected for this study, and seven conventional water chemical compositions from 54 water samples in three aquifers were collected and tested, of which 45 water samples were used to establish the PLS-DA discriminant model, and nine were used to test the prediction e ﬀ ect. To improve model accuracy and predictive ability, hierarchical clustering analysis method was used to eliminate seven unquali ﬁ ed water samples to reduce the errors caused by improper data. PCA and PLS-DA methods were used to analyze and process the remaining water sample data, and on the basis of PCA analysis, the remaining 38 water samples were used to establish the PLS-DA discriminant model. The model was validated using permutation and external prediction tests. The research shows the following results: (1) Both PCA and PLS-DA methods can distinguish water samples from three di ﬀ erent water sources, but the classi ﬁ cation e ﬀ ect of PLS-DA was better than PCA because it can strengthen the di ﬀ erence of water chemical composition between di ﬀ erent water sources. (2) The correct discrimination rate of the PLS-DA discriminant model was as high as 100%, and permutation tests showed that the model was not over ﬁ t. External validation found that the model had good stability and discrimination. (3) HCO 3- and total dissolved solids (TDS) were the most important di ﬀ erential marker compositions that a ﬀ ected the discrimination results based on Variable Importance for the Projection (VIP) scores. The discriminant model established in this study combined the advantages of principal component analysis and multiple regression analysis, providing a new method for accurately identifying the sources of water inrush in mines.


Introduction
Coal resource is an important basic resource for the longterm rapid and stable development of the national economy in China [1]. As mining depth has increased, water inrush disaster occurrence has also increased, which poses a serious threat to the safety of coal mine production [2,3]. Quickly and accurately identifying the sources of mine water inrush is important for the prevention and control of coal mine water inrush disasters, and it is also a top concern in mine water disaster management research [4]. Many scholars have proposed various methods to identify the source of water inrush in mines, such as groundwater chemistry [5,6], trace elements and isotopes [7,8], water temperature [9,10], and groundwater level dynamic observations [11]. After comparing groundwater chemistry with other methods, Wu et al. [12] concluded that the water chemistry discrimination method had more advantages in practical applications. Because groundwater chemistry can reflect the essential characteristics of groundwater, and can accurately, quickly, and economically identify water sources, it has been more commonly used to identify water inrush water sources in mines [13].
At present, mathematical statistical methods and machine learning methods are typically applied when using the chemical composition of groundwater to identify water inrush sources, such as fuzzy mathematical theory [14,15], grey relational analysis [16,17], back propagation neural network (BP neural network) [18], Fisher's discriminant method [19], Bayes' discriminant method [20], distance discriminant method [21], extension identification method [22], support vector machine (SVM) [23], and extreme learning machine (ELM) [24]. The application of these mathematical and machine learning methods enriches the content of mine water source discrimination theory, improves identification accuracy, and demonstrates good practicability and effectiveness. However, most of these present discriminant methods have not considered the complicated information superposition problem between water chemistry indicators, a problem that results in misdiscrimination of the established model in the practical application process, and their recognition accuracy still needs to be further improved [25]. Therefore, some scholars have adopted the principal component analysis (PCA) method in the water source discriminant analysis and obtained better analysis results [26]. PCA can extract and compresses the information of hydrochemical data of different water sources, transform original data into mutually independent new data without information superposition, and eliminate the effects caused by information superposition between indicators so that characteristics of different water sources can be described more effectively [27].
In this paper, a new promising method (partial least squares-discriminate analysis (PLS-DA)) is presented, and this method can effectively solve the problem of multicollinearity between multiple variables and is reliable especially when there is a high degree of correlation between them. PLS-DA is a supervised multivariate statistical method that integrates the basic functions of PCA, canonical correlation analysis and multiple regression analysis [28,29]. Similar to PCA, PLS-DA is also a multidimensional vector analysis method based on dimensionality reduction. However, different from PCA, the PLS-DA method performs orthogonal decomposition of the measurement matrix while also performing orthogonal decomposition of the response matrix. In other words, PLS-DA can preset classifications and add grouping variables for supervised analysis to further strengthen the differences between groups [29]. Its advantage is that it can remove the influence of uncontrolled variables on data analysis as much as possible, further mine the information in the data, and quantify the degree of component difference caused by characteristic ions [30]. Barker and Matthew [30] used statistical theory to show that PLS-DA performed good classification. In recent years, PLS-DA has been widely used for screening pharmaceutical ingredients; tracing the origins of wine, meat, etc.; and identifying and classifying tea and navel oranges [31]. However, few studies have used it to identify water inrush in mines. Yan et al. [32] used laser-induced fluorescent (LIF) technology to obtain the fluorescence spectrum of inrush water sources and used it as an indicator for PLS-DA discrimination with good effect. However, it is difficult to obtain the fluorescence spectrum of the inrush water sources using this technology for all mines, and the test cost is relatively high. This paper used the conventional ion compositions of the inrush water as indicators to establish a PLS-DA discriminant model and further broaden the application range of PLS-DA in identifying mine water inrush sources. Seven conventional water chemical compositions from water samples of three aquifers were used as indicators in this study and the hierarchical clustering analysis method was used to eliminate the unqualified water samples. The PCA method was used to analyze the remaining water sample data, and then the PLS-DA discrimination model based on chemical compositions of inrush water was established. Permutation and external verification tests demonstrated model stability and discriminative ability.

Description of the Study Area
The Renlou Coal Mine is located in the Linhuan mining area of northern Anhui Province, China. Its geographic location is shown in Figure 1. The mine field is located in the middle of the Huaibei Plain, and the terrain is flat. There is only a small, artificially dredged seasonal river in the mine field and its flow is controlled by rainfall. The average annual rainfall in the study area is 820 mm, mostly concentrated from June to September, and the maximum rainfall in July is 268.5 mm. The annual average temperature is 14.3°C, with the lowest temperature in January of -23.2°C and the highest in July of 41°C. The maximum evaporation occurs from June to August, with a multiyear average evaporation of 1774 mm.
The Renlou Coal Mine is located in the southeast wing of the Tongting Anticline, and the stratigraphic occurrence in the area is relatively gentle, generally 13°~20°. At present, the primary mines are No. 7 2 coal, No. 7 3 coal, and No. 8 2 coal. Water inrush is an important threat to the safe production of the Renlou Coal Mine, where 21 inrushes occurred from January 1989 to February 2013. The water inrush duration at some points was long with a large amount of water. For example, during the excavation process of working face 7 2 22, the maximum instantaneous water influx reached 34570 m 3 /h due to the karst collapse column connected to other aquifers, causing the entire well to be flooded. Therefore, accurate identification of water inrush sources is very important for the prevention and control of water disasters in the Renlou Coal Mine.
There are multiple groundwater aquifer layers in the minefield. From top to bottom, there are loose pore aquifers, coal-measure formation sandstone fractured aquifers, Taiyuan formation limestone karst fractured aquifers, and Ordovician karst fractured aquifers. Of these, the fourth aquifer in the loose layer (referred to as the "fourth aquifer") may enter the mine through a crack or a vertical guide channel and affect production, and it is also the main hidden water hazard in shallow coal mining. The sandstone fractured aquifer (referred to as the "coal-bearing sandstone aquifer") is mainly stored in the structural fissures of sandstone layers as static reserves. Due to the influence of geological structure, the fissures are unevenly developed. When the 2 Geofluids fissures are developed or connect with other aquifers, the water output will increase, causing other aquifers to become indirect water sources for the main coal seams. The limestone karst fissure aquifer of the Taiyuan Formation (referred to as the "limestone aquifer") is the main water source for the mine. The average distance between the aquifer and the No. 8 2 coal seam floor is about 140 m. However, due to the development of hidden karst water-conducting subsidence columns and water-conducting faults in the minefield, these passages may cause limestone aquifer water to enter the mine. The water content of the Ordovician karst fissure aquifer is very rich, but the aquifer is nearly 290 m away from the No. 8 2 coal floor. Therefore, it does not have hydraulic connections with the mine under normal conditions and does not directly threaten the safety of the mine.   3 Geofluids Table 1: Test data for water samples.

Materials and Methods
Num Ca 2+ (mg/L) Mg 2+ (mg/L) Na + +K + (mg/L) Cl -(mg/L) SO 4 2-(mg/L) HCO 3 -(mg/L) TDS (mg/L) Water source type  16 were from the coal-bearing sandstone aquifer, and 20 were from the limestone aquifer. Of the nine samples used for validation, two were from the fourth aquifer, three were from the coal-bearing sandstone aquifer, and four were from the limestone aquifer. The water sample locations are shown in Figure 1. Samples were collected through underground drainage holes or surface hydrological observation holes.
The underground drainage holes were directly collected from the mine, and the surface hydrological observation holes were collected with a self-made deep-water sampler. When collecting samples, a 2.5 L polyethylene bucket was rinsed with sampling water three times before a sample was taken.
After that, the samples were kept in a clean place to prevent contamination and placed in a low-temperature environment to inhibit the oxidation-reduction reaction and biochemical effects. Water sample chemical testing included K + +Na + , Ca 2+ , Mg 2+ , CI -, SO 4 2-, HCO 3 -, and TDS. HCO 3 was tested using the acid-base titration method, Cland SO 4 2were tested using ion chromatography, Ca 2+ and Mg 2+ were tested using EDTA titration, K + +Na + was tested by flame atomic absorption spectrophotometry, and TDS was calculated according to the mass concentration of each component. The water sample test data are shown in Table 1.

Hierarchical Clustering Analysis.
Hierarchical clustering analysis is an unsupervised identification method that can group samples based on the data itself without known category information. The basic idea is to first treat n samples as n classes, and then specify the distance between samples and the distance between classes. Then, select the two classes with the smallest distance, merge them into a new class, and calculate the distance between the new class and other classes. Continue reducing the number of classes in this way, until all samples are clustered into one class, to obtain a classification system from small to large that can reflect the close relationships between individuals or groups and use a cluster dendrogram to represent them [33]. Classes with stronger correlations are therefore merged, and then the degree of affinity between a new merged class and other classes is con-sidered, and then merged, so that differences within categories are as small as possible and differences between categories are as large as possible.
Generally, R-type and Q-type cluster analyses are used. The R-type cluster analysis classifies variables, and the Qtype cluster analysis classifies samples [34]. If you are interested in the mathematical basis of hierarchical clustering analysis, you can find it in the literature [35,36]. In this paper, the Wald method was used to perform Q-type clustering analysis on the original water samples, and the square Euclidean distance was used as the metric to determine the relationships between them by using the statistical software IBM SPSS Statistics 26. Finally, the cluster dendrogram of the original water samples was obtained. The data were screened to eliminate water samples that did not meet the requirements.
3.3. Partial Least Squares-Discriminate Analysis. PLS-DA is a supervised multivariate statistical method that integrates the basic functions of principal component analysis, canonical correlation analysis, and multiple regression analysis [37] and is capable of compressing data and extracting characteristic information. The principle of PLS-DA is to separately train the characteristics of different samples, generate a training set, and test the reliability of the training set. This method can group the required observation variables in advance and perform statistical analysis on the data according to the nature of the groups, and the key variables that affect the grouping can be learned [38].
Based on the PLS regression, PLS-DA inputs class member information provided by the auxiliary matrix in the form of code when constructing the factors, uses the independent variable matrix X and the categorical variable Y from the training set samples to establish a regression model, and determines the sample category based on its predicted PLS value. It also reduces the dimensionality of the highdimensional data matrix to a lower-dimensional space. Similar to PCA, the new variables obtained are also not related to each other, but the difference is that PLS needs to introduce the information from category matrix Y into matrix X while decomposing the independent variable matrix X, and then perform orthogonal decomposition. This processing can effectively eliminate any useless noise in the independent 5 Geofluids variable matrix X and any useless information in category matrix Y. The use of this method to analyze mine water inrush water sources can eliminate overlapping parts from water chemical information to solve the multiple correlation problem and make the data more accurate and reliable to ensure the best calibration model [30]. If you are interested in the mathematical basis of partial least squares (PLS) regression, you can find it in the literature [39]. The specific steps of the PLS-DA analysis method are as follows and this method can be completed by SIMCA 14.1 software.
(1) Establish categorical variables of training set samples (2) Decompose the independent variable matrix X and the category matrix Y at the same time, and ensure their principal components are linearly correlated to the highest degree. The model can be expressed as where T and U are respective score matrices of X and Y; P and Q are respective load matrices of X and Y; E and F are respective fitting residual matrices of X and Y.
(3) Conduct linear regression on T and U to obtain regression factor B U = TB, (4) According to the load matrix P, obtain the score vector t test of the sample x test to be tested during prediction, and then obtain the predicted value Y P according to the following formula (5) Determine the type of sample to be tested according to the following rules When Y P > 0:5 and deviation D < 0:5, it belongs to this category; when Y P < 0:5 and deviation D < 0:5, it does not belong to this category; when deviation D ≥ 0:5, it is uncertain [40].  [39]. There were three main reasons for screening the original water samples. First, in order to reduce the impact of external human factors (such as possible contamination of water samples during sampling, storage and testing, measurement deviations during testing, etc.), it was necessary to screen the original water samples to eliminate unqualified samples and avoid large errors in the discrimination results [33].

Results and Discussion
Second, although the water chemical composition within an aquifer may be significantly different due to different hydrogeological conditions, it should maintain a dynamic balance through a series of physical and chemical reactions. Therefore, samples from the same aquifer generally have the same water chemistry characteristics. Due to the influence of factors such as hydraulic connections between different aquifers and groundwater movement, however, the water chemical composition from the same aquifer sometimes differ greatly. Abnormal water chemical compositions in the same aquifer cannot reflect the hydrochemical characteristics of underground water in this aquifer. We therefore had to identify the water sample that best represented the aquifer water chemical composition and establish a high-precision water inrush water source discrimination model [25].
Third, PLS-DA model performance may have deteriorated due to the presence of abnormal sample values. In order to reduce the influence of abnormal samples on the PLS-DA model, the original water samples were also screened [41].
Before screening the water samples, we performed the Piper trilinear diagram analysis on 45 original water samples, as shown in Figure 2. It can be seen from Figure 2 that among the 45 original water samples of three different water sources in the study area, some of the water samples of the same type were scattered and significantly deviated from the formation center in the Piper trilinear diagram and these samples should be regarded as abnormal water samples and excluded. Hierarchical clustering analysis is a commonly used unsupervised agglomerative clustering analysis method that can be used for this task [33]. In this paper, the ion contents of 45 original water samples were used as the analysis variable, and the Q-type cluster analysis of the original water sample was completed by SPSS software. A clustering dendrogram for the samples was obtained (Figure 3). According to the distance of each original water sample in the dendrogram, each original water sample category was compared, and water samples with numbers 2, 3, 4, 17, 26, 38, and 41 were excluded. The remaining 38 original water samples were kept for subsequent analysis and modeling, as shown in Table 1.

Correlation Analysis of Water Chemical Compositions.
Ion concentrations in the water samples of each aquifer reveal the chemical characteristics of different groundwater sources and are the basis for distinguishing water from each aquifer. These kinds of hydrochemical components are not completely independent in groundwater, but are related to each other to a certain degree. However, most prior studies have not considered this connection [42].
This paper used Python software to draw heat maps of the correlation coefficients between water chemical compositions of water samples from three aquifers (Figure 4). There 6 Geofluids were both positive and negative correlations as well as strong and weak correlations between the evaluated ions, and some ions had strong correlations with each other. For example, the correlations between Mg 2+ and SO 4 2-, as well as between Cland Na + +K + , Ca 2+ , HCO 3 -, and TDS values in the water samples from the fourth aquifer were all greater than 0.8 (Figure 4(a)). The correlations between Na + +K + and TDS, as well as between Cland Na + +K + and TDS values in the water samples from the coal-bearing sandstone aquifer were all greater than 0.8 (Figure 4(b)). Finally, the correlations between Mg 2+ and Cl -, as well as between Ca 2+ and Mg 2+ and Clin the water samples from the limestone aquifer were all greater than 0.8 (Figure 4(c)). This indicated that the hydrogeological information reflected between water chemical compositions overlapped. If this kind of information overlap was not considered in water source identification, it would cause information redundancy, which can cause serious multicollinearity, affect the accuracy of the mine water inrush source identification model, and lead to poor judgment [43].
4.2. PCA of the Training Samples. PCA is an unsupervised multivariate statistical method, which is one of the most commonly used dimensionality reduction methods. Through orthogonal transformation, multiple indicator data are converted into a set of linear and uncorrelated few new comprehensive variables. PCA is helpful to analyze hydrochemical data and can be considered on hydrochemical data to screen the variation between composition and sample variation [44].
This study imported the water chemistry data of 38 water samples into the SIMCA 14.1 software for principal component analysis. The analysis results show that the eigenvalues of the first two principal components were greater than 1 (the first and second principal components were 3.85 and 2.53, respectively), and the cumulative contribution rate reached 91.1%, which means that the selection of two principal components can fully reflect the hydrochemical information of the training samples [33]. Therefore, using the first and second principal components as the abscissa and ordinate, respectively, the PCA score plot ( Figure 5) and PCA loading plot (Figure 6) of the three different water sources were obtained. The PCA score plots can explain the variation among sample sources, and loading plots can explain the variation among compositions.
It can be seen from the PCA score plots ( Figure 5) that the first principal component scores of the water samples of the fourth aquifer water, coal-bearing sandstone aquifer water, and limestone aquifer water ranged from -56.32 to -5.54, -47.56 to 8.22, and 2.00 to 50.60, respectively. The second principal component scores of the water samples of the fourth aquifer water, coal-bearing sandstone aquifer water, and limestone aquifer water ranged from -47.64 to -22.45, -3.00 to 56.67, and -23.24 to 0.20, respectively. Therefore, PCA can roughly divide water samples from three different water sources into three categories. However, it was  7 Geofluids impossible to distinguish the three water sources based on the first or second principal component alone.
It can be seen from the PCA loading plot ( Figure 6) that HCO 3 -, Na + +K + , and TDS were far from the origin, indicating that these three water chemical composition variables played a greater role in water source identification. The first principal component was mainly composed of TDS, SO 4 2-, and Cl -, and the second principal component was mainly composed of HCO 3 -, Na + +K + , and TDS.

PLS-DA Discriminant Model Establishment Based on Water Chemical Compositions for Mine Water Inrush
Sources. On the basis of principal component analysis, the PLS-DA method was used to further analyze the water chemistry data of different water sources to discover and screen out the characteristic water chemical components, and estab-lish the PLS-DA discrimination model based on water chemical compositions for mine water inrush sources.

Determining Classification Variable
Values. The PLS-DA discriminant model for mine water inrush sources is a PLS-based regression model between the classification variables and the ion component content of the water samples. This paper used SIMCA 14.1 software to establish and analyze the PLS-DA model. Taking 38 water samples as the training set, first the classification variable values of the training set samples were assigned. The classification variable group Y was manually set according to the water sample category, as shown in Table 2. Then, the PLS method was used to perform regression analysis on the content of the seven ion components for the training set samples and the classification variable Y, and a model of the ion components and the classification variable Y was established.

Determining the Number of Principal Components.
When modeling, the appropriate number of principal components must be determined. Generally speaking, increasing the number of principal components can extract more information, but using too many principal components will introduce some redundant information [40]. When selecting the number of principal components, therefore, the cumulative explanatory power (expressed by R2X(cum)) and the prediction accuracy of the model (expressed by cumulative crossvalidity Q2(cum)) should be considered [40,45]. Table 3 shows the relevant statistical results when the number of principal components was used for modeling. When there were five principal components, the cumulative crossvalidity value Q2(cum) began to decrease, so the prediction accuracy of the model decreased. Therefore, the appropriate number of principal components was four.

Analysis of Discriminant Model Results
. The model quality parameter R2X(cum) is 0.979, R2Y(cum) is 0.889, and Q2(cum) is 0.848, indicating good model fit [46]. In the model space, the first and second principal component scores for the water samples are shown in Figure 7. Each point in the PLS-DA model score map represents a water sample, and the degree of aggregation reflects the similarity between them. The results of PLS-DA analysis were consistent with the results of PCA analysis. All data points were within the 95% confidence interval, and the water samples of the three aquifers had obvious clustering. However, the number of water samples in the fourth aquifer was relatively small, and the distribution dispersion was relatively large. At the first principal component t½1, the limestone aquifer water samples were easily distinguished from fourth aquifer water samples and coal-bearing sandstone aquifer water samples, but it was impossible to further accurately distinguish the water samples between the fourth aquifer and coal-bearing sandstone aquifer. At the second principal component t½2, the fourth aquifer samples were easily distinguished from coal-bearing sandstone aquifer water samples and limestone aquifer water samples, but it was impossible to further accurately distinguish between the sandstone and limestone samples. The schematic diagram of the PLS-DA    Figure 8) showed that the three-dimensional space diagram could significantly distinguish the water samples from the three different sources. Figure 9 shows the loading scatter plot of the PLS-DA model, which clearly demonstrates the relationship between the characteristic variable X and the categorical variable Y, reflecting the contribution of each water chemical composition variable on the score plot. Blue dotted DA (100), DA (010), and DA (001) in Figure 9 represent the positions of the Y values for the three water source categories in the scatter plot, and each green point represents an ion variable. The farther the point is from the origin, the greater the weight value, or the greater the effect of determining the sample difference [47]. It can be seen in Figure 6 that TDS, HCO 3 -, and Na + +K + were far from the origin, indicating that these three , and Clwere larger, and on the second principal component t½2, the loading values of TDS, HCO 3 -, and Na + +K + were larger. Therefore, the first principal component mainly reflected the content characteristics of HCO 3 -, SO 4 2-, and Clin different water sources, and the second principal component reflected the content characteristics of TDS, HCO 3 -, and Na + +K + in different water sources.
Compared with PCA, PLS-DA has the function of quantifying the difference of different water sources caused by the characteristic chemical compositions of water. To further analyze the effect of each water chemical composition variable X on the categorical variable Y, a VIP score plot was created ( Figure 10). It summarizes the importance of the variables to explain X and correlate to Y. VIP scores can quantify the contribution of each variable in the PLS-DA  9 Geofluids model to the classification. The larger the VIP value, the more obvious the difference of the variable in different water source categories. When the VIP value of a variable is greater than 1.0, it indicates a higher than average contribution of the variable to the overall model with a statistically significant impact on the water sample classification, which can be used as the difference marker composition [47]. When the value is less than 0.5, it indicates that the variable is unimportant in the process of model classification and discrimination. The interval between 1 and 0.5 is a gray area, where the importance level depends on the size of the data set. Figure 7 shows that in the explanatory water chemical composition variable X, there were two variables with VIP scores greater than 1, followed by HCO 3 and TDS, indicating that HCO 3 and TDS played an important role in distinguishing three different types of water sources. The VIP scores for Cl -, Na + +K + , and SO 4 2were between 0.9 and 1.0, indicating that these three ion variables played roles in distinguishing three  10 Geofluids different water source types to some degree. The VIP score for Mg 2+ was the lowest among the seven hydrochemical components, indicating that Mg 2+ played the least important role in discrimination. The statistical results of the PLS-DA discriminant model for 38 water samples are shown in Table 4 and Figure 11. Table 4 shows good correlations between the water sample hydrochemical composition variables and the categorical variables established by PLS regression. The correlation coefficients R between the actual values of the categorical variables and the predicted values of the model were 0.8876, 0.9608, and 0.9778, respectively. Root mean square error of estimation (RMSEE) is an index to predict the average error of training set samples by using the model built by training set samples. Root mean square error of cross validation (RMSEcv) is an important parameter in internal cross validation, which is used to measure the accuracy of prediction results of training set samples. The discriminant rate of all training set samples was 100%, indicating that the model fit well. Figure 11 shows regression curves for the PLS predicted values and actual values of classification variables for all training samples. The straight lines are the regression curves of the model prediction and classification results. The three models clearly distinguished the three types of water source samples: water sample points scattered on the line where the actual values were equal to 1 and the other two water source water points on the line where the actual values were equal to 0 were obviously separated. The PLS-DA model established in this study had high reliability and can be used to test and discriminate new water samples.

PLS-DA Discriminant Model Validation for Mine Water
Inrush Sources 4.4.1. Permutation Test. Statistical inference analysis was used to further validate the built PLS-DA discriminant model. Two hundred permutation tests were performed to iteratively analyze the predictor variable Y based on the known measured data variable X and obtain statistics on these variables (Figure 12). By examining the intercept of the fitting line formed by the calculated values of R2 and Q2 corresponding to all samples on the Y coordinate axis, the reliability of the model and the degree of overfitting were determined. The larger the value of Q2, the better the predictive ability of the model, and the larger the value of R2, the  11 Geofluids stronger the explanatory ability [48]. The permutation tests ( Figure 12) showed that all R2 and Q2 values (Y-axis data) on the left were lower than the R2 and Q2 values on the far right. The intercepts of the Q2 regression line were all negative, indicating that although there were differences in predictability, the three PLS-DA discriminant models established were not overfitted and all had good predictive ability [49]. Therefore, they can be used for discriminant analysis of various types of water sources.

External Validation.
The actual predictive ability of the model was further checked by an external validation set test. The validation set was composed of nine water samples that were not involved in the modeling, including two fourth aquifer water samples, three coal-bearing sandstone aquifer water samples, and four limestone aquifer water samples. The categorical variable predicted values Y p of the verification set water samples were calculated using SIMCA 14.1 software, and the prediction results were evaluated according to rules described above. The results are shown in Table 5.
The accuracy of this model for the validation set water samples was 100%.

Discussion
This study innovatively combined the PLS-DA method with the water chemical compositions to establish a discriminant model to be used in the identification of water inrush sources in mines, which effectively solved the problem of low discrimination accuracy caused by not considering the overlapping information between hydrochemical identification indexes. The water sample data were processed by PCA and PLS-DA methods, and the results showed that both methods can classify water samples from three different water sources (as shown in Figures 5 and 7). However, with regard to the two principal components extracted by the PCA method, it was impossible to distinguish any one of the three water sources based on the first or second principal component alone. For the PLS-DA results, the limestone aquifer water samples were easily distinguished from fourth aquifer water samples and coal-bearing sandstone aquifer water samples based on the first principal component, and the fourth aquifer samples were easily distinguished from coal-bearing sandstone aquifer water samples and limestone aquifer water samples based on the second principal component. This showed that PLS-DA has better data processing and analysis capabilities than PCA. The reason is that PLS-DA is a supervised discriminant analysis method, which artificially adds grouping variables, further excavates the information in the water sample data, strengthens the difference of water chemical composition between different water sources, and makes up for the deficiency of the PCA method [28].
In addition, compared with the PCA method, PLS-DA has the function of quantifying the degree of difference between different water sources caused by the characteristic water chemical composition. The loading scatter plots of PCA and PLS-DA both showed that TDS, HCO 3 -, Na + +K + ,    Figure 10: VIP score plot of PLS-DA.

Geofluids
Cl -, and SO 4 2were factors that had a greater impact on the discrimination results of the three different water sources (as shown in Figures 6 and 9), but the degree of influence of each factor could not be accurately determined. Through the VIP scores in the PLS-DA method, we can accurately screen out the characteristic water chemical components that cause differences in different water sources. Through the analysis of the VIP scores, two difference marker composi-tions were found from the seven water chemical compositions, followed by HCO 3 and TDS, indicating that HCO 3 and TDS were the main marker compositions that distinguished the difference between the fourth aquifer water, the coal-bearing sandstone aquifer water, and the limestone aquifer water in the study area [49]. The influence of other water chemical components on the water source identification results were Cl -, Na + +K + , SO 4 2-, Ca 2+ , and Mg 2+ ,   13 Geofluids respectively. Quickly determining the iconic ionic components in each aquifer not only is conducive to accurately and quickly identifying mine water inrush sources but also furthers research on the formation and evolution of aquifers. However, considering the complexity of hydrogeological conditions in different mines, the difference marker ions will  14 Geofluids vary. Therefore, more ionic components should be tested, and factors such as water temperature, isotopes, and trace elements should also be considered in future studies to perfect the discriminant model as much as possible, so that it can be better used to identify the source of water inrush in mines [26]. Furthermore, as a supervised model, the PLS-DA model has the disadvantage of overfitting, so the model can distinguish samples well but performs poorly when used to predict new sample sets. Therefore, for the supervised classification model, we need to verify the reliability of the model [40]. In this study, only seven water chemical compositions were tested and used as the identification index combined with the PLS-DA method to establish a mine water inrush source discrimination model. A good discrimination effect was achieved with discrimination rates for the training and validation sets as high as 100%, which indicated that the PLS-DA discriminant model for mine water inrush sources performed better in identifying water samples. We used permutation testing to judge the reliability of the model. The permutation test randomly scrambles the classification mark of each sample, and then remodels and predicts. The Q2 of a reliable model should be significantly greater than the Q2 obtained by randomly scrambling the data. The results of the permutation test showed that the model had no overfitting and was reliable [46], which indicated that the established water source recognition model was successful.

Conclusions
Based on the hydrogeological conditions of the study area, water chemical compositions of water inrush samples from three aquifers were tested. The water samples were screened using the hierarchical clustering analysis method, and some unqualified samples were removed. The PCA and PLS-DA methods were used to analyze and process the remaining water sample data. On the basis of PCA analysis, a PLS-DA discriminant model for mine water inrush sources was established. According to the results, the following conclusions were obtained: (1) Hierarchical clustering analysis was used to screen the 45 original water samples and eliminate seven unqualified samples to reduce errors, so the remaining 38 samples well represented the water chemical compositions of the aquifers. The 38 samples were used to establish a discriminant model and avoid the influence of abnormal water samples (2) Correlation analysis was carried out on the water chemical compositions of the water samples from three aquifers. The results showed that there were strong correlations between some water chemical compositions, indicating that the hydrogeological information reflected between the water chemical compositions had a significant overlap. It would cause information redundancy, which could lead to multicollinearity (3) The PCA and PLS-DA methods were used to analyze and process the remaining water sample data, and the results showed that both methods can distinguish water samples from different water sources; however, the classification effect of PLS-DA was better than PCA. The reason is that PLS-DA is a supervised discriminant analysis method, which artificially adds grouping variables, further excavates the information in the water sample data, strengthens the difference of water chemical composition between different water sources, and makes up for the deficiency of the PCA method (4) The PLS-DA discriminant model for mine water inrush sources was established. The correct discrimination rate of the PLS-DA discriminant model was as high as 100%, and permutation tests showed that the model was not overfit. External validation found that the model had good stability and discrimination (5) PLS-DA has the function of quantifying the degree of difference between different water sources caused by the characteristic water chemical composition. VIP scores were used to identify the most important difference marker compositions that affected the discrimination results of the three different water source types, followed by HCO 3 and TDS, while Mg 2+ had little effect in distinguishing them (6) The discriminant model established in this study combined the advantages of principal component analysis and multiple regression analysis, and had a high discrimination accuracy. Thus, it can meet the needs of modern mine water inrush source identification, and can be applied to other mines as well

Data Availability
The data used to support the findings of this study are included within the article.

Conflicts of Interest
The authors declare that there are no conflicts of interest.