Identification of Mine Water Inrush Source Based on PCA-FDA: Xiandewang Coal Mine Case

Key Laboratory of Karst Georesources and Environment, Ministry of Education, Guizhou University, Guiyang, Guizhou 550025, China Key Laboratory of Karst Environment and Geological Disaster Prevention and Control of Guizhou Province, Guizhou University, Guiyang, Guizhou 550025, China National Engineering Research Center of Coal Mine Water Hazard Controlling, China University of Mining and Technology (Beijing), Beijing 100083, China


Introduction
Coal is a kind of primary energy. Various kinds of disaster accidents often occur in the process of coal mining, and water inrush is one of the factors that cause serious accidents in coal mines [1][2][3]. Many water inrush cases show that when a water inrush accident occurs in a coal mine, timely and accurately identifying the source of water inrush can find causes of the occurrence quickly and make solutions to the water inrush timely, which is important for the prevention and control of water inrush disasters in a coal mine [4][5][6][7]. People often use groundwater chemistry, isotope, water temperature, water level, and other indicators to identify the water inrush source. Experience has shown that the water chemistry analysis method is a more effective method among them [8][9][10]. The basis of using the water chemistry analysis method to identify the water inrush source is that the groundwater of different aquifers has different water chemical composition. These components which can be used to distinguish the characteristics of groundwater in different aquifers are called "standard components", and "standard components" used more extensively are conventional components like Na + , K + , Ca 2+ , Mg 2+ , Cl − , SO 4 2− , HCO 3− , alkalinity, acidity, hardness, TDS, and pH. Based on the analysis of the water chemical composition of aquifers, different researchers have proposed different methods for water source identification, and examples below are representative [11] used the fuzzy comprehensive evaluation method and the cluster analysis method to identify the water inrush sources of the Mindong No. 1 mine [12] used the maximum likelihood method to identify the potential water sources of the Sanshandao Gold Mine based on the hydrogeochemical and isotopic analyses [13] used the distance discrimination method to identify mine inrush water sources, and the results were verified through the grey relational analysis method. Liu et al. (2015) established a water source identification model based on the BP neural network theory and randomly selected water samples collected during mine excavation to predict water source identification. These methods above have a positive effect on enriching the water inrush source identification technology, but they did not take information superposition between identification indicators of water chemical into consideration, which caused problems like low precision of classification and long response time. To solve these problems, this paper introduced the principal component analysis (PCA) method into the water inrush source identification technology, refined the water chemical indicator data of different water sources, converted multiple related indicator variables into a new independent one by linear combination, and eliminated the effects caused by information superposition between indicators so that characteristics of different water sources can be described more effectively. On this basis, the Fisher discrimination analysis (FDA) method was combined to establish a water source identification analysis model. By using this model, the water inrush source of the typical coal mine was identified and the results of identification were good.

Methods of Mine Water Inrush
Source Identification 2.1. Principal Component Analysis (PCA). As a statistical method, PCA is aimed at converting a set of potentially correlated variables to a new set of variables that are linearly uncorrelated by means of orthogonal transformation; and the new variables obtained through transformation are known as principal components and they are capable of keeping the original information to be revealed unchanged in the aspect of expressing information. Data processing based on PCA plays a part in effectively eliminating the correlation of high-dimensional data, realizes data dimension reduction, and simplifies data structures [14,15]. A mathematical model of PCA is expressed as follows. P variables (X 1 , X 2 , ⋯, X p ) of a raw data matrix X form a linear combination denoted as Y = AX, namely, where a i1 + a i2 + ⋯+a ip = 1, Y i is uncorrelated with Y j (i ≠ j; i, j = 1, 2, ⋯, p), Y 1 has the maximum variance if compared with all linear combinations of X 1 , X 2 , ⋯, X p ; among all linear combinations of X 1 , X 2 , ⋯, X p that are uncorrelated with Y 1 , Y 2 has the maximum variance; Y P has the maximum variance in comparison with those of all linear combinations correlated with none of Y 1 , Y 2 , ⋯, Y p−1 ; and the sum of variances of Y 1 , Y 2 , ⋯, Y p is equal to that of variances of X 1 , X 2 , ⋯, X p .
Steps of figuring out its principal components are generally as follows: (1) The original variable data are normalized, which is followed by calculations of a covariance matrix ∑ for all variables (2) Eigenvectors of the covariance matrix figured out can be ranked as λ 1 ≥ λ 2 ≥ ⋯≥λ p and the corresponding unit eigenvectors are T 1 , T 2 , ⋯, and T p . In the event of conversion matrix A = T ' , row i of A represents an eigenvalue T i in the i th place of ∑; and a variance of i th principal component Y i is just an eigenvalue λ i in i th place of ∑ as well (3) The variance contribution rate of k th principal component Y k is denoted by η k = ðλ k /∑ p k=1 λ k Þ = 1. If m ðm < pÞ principal components are adopted, the cumulative variance contribution rate of principal components Y 1 , Y 2 , ⋯, and Y m is expressed in ξ m = ð∑ m k=1 λ k /∑ p k=1 λ k Þ (4) Determination of the number of principal components depends on the cumulative variance contribution rate in general. Usually, the fact that the cumulative variance contribution rate reaches at least 80% indicates that the following requirement can be satisfied: sample information of the first m principal components extracted contains most of the information about primary samples 2.2. Fisher Discrimination Analysis (FDA). FDA serves as a multivariable statistical analysis method that uses eigenvalues of a research object to identify its type. Basic thoughts of FDA can be described as follows. Dimensionality reduction for a multidimensional data is achieved through a projection so as to simplify the corresponding problem and determine a discrimination function based on a principle of maximum interclass distance and minimum within-class distance [16][17][18][19]. A mathematical model of FDA is expressed as follows. It is assumed that there are n ensembles G 1 , G 2 , ⋯, and G n (i = 1, 2, ⋯, n); and their corresponding mean vectors and covariance matrixes are U ð1Þ , U ð2Þ , ⋯, U ðnÞ and V ð1Þ , V ð2Þ , ⋯, V ðnÞ , respectively. In case that samples with a size of m i are taken from the ensemble G i , that is, then, U T X i a = ðU ð1Þ x i a 1 + U ð2Þ x i a 2 +⋯+U ðpÞ x i a p Þ T (i = 1, 2, ⋯, n) stands for a projection of sample X i a on a space axis, which can be denoted as follows: 2 Geofluids where X −i and X are mean values of the selected samples and the total sample, respectively; in which case, intragroup deviation e for a set of samples is where S i refers to a sample difference of sample T i (sample size = m i ) projection X i a on the space axis; and W is a "total within-class scatter" matrix. Between-group devia- where B is a "within-class scatter" matrix of the sample. In order to discriminate it from the total sample under the circumstance that a discrimination function is adopted, ϕ can be expressed as If M = U T BU − λðU T WU − 1Þ and its partial differential is calculated, then In Equation (7), λ is its eigenvalue. Through simplification, the following equation is acquired.
In the above equation, U is an eigenvector to which the maximum eigenvalue λ corresponds; and I represents the ratio of a within-sample sum-of-squared difference for the total sample to a sum-of-squared difference between adjacent samples. As can be observed from the above equation, both the maximum eigenvalue and eigenvector U of W −1 B can be obtained, thereby figuring out its discrimination function.

Procedures of Mine Water Inrush Source Identification
Based on PCA-FDA. The discriminating idea of using the PCA and PDA methods to identify the water source of mine water inrush is shown in Figure S1. The discriminating process is as follows: (1) Water sample data are normalized (2) A correlation matrix of normalized data is figured out to analyze and identify correlation of variables  Figure 1. Being high in the west and low in the east, it is undulating in terrain and has maximum and minimum elevations of +339.6 m and +194.10 m, respectively. As this region belongs to a semiarid warm temperate continental monsoon climate, precipitation mainly takes place from July to September each year. For the past 10 years, annual precipitation ranges between 351.5 mm and 800 mm, generating an average annual precipitation of 507.74 mm. Moreover, its geographical coordinates are 114°11′15″~114°15′00″E and 36°48 ′ 45 ″~3 6°55 ′ 00 ″ N.
3.1.2. Hydrogeology. According to hydrogeologic prospecting data of the mining area, aquifers threatening safe mining include an Ordovician limestone karst fractured aquifer, a Permian sandstone fractured aquifer, and Daqing and Yeqing Carboniferous limestone karst fractured aquifers. Due to differences in chemical components, structures, lithological associations, and fracture development of various rock strata, their water-bearing characteristics and water yield properties are also significantly different. Once a fault is encountered or any damage is caused to roof-floor strata in the process of coal mining, it is much likely for underground water in an aquifer to burst into a mine and thus result in water inrush. For this reason, an investigation on mine water inrush source identification for Xiandewang coal mine is beneficial for water inrush control there.

Indicators for Source Identification
Considering that each aquifer contains diversified water chemical compositions, it is infeasible to adopt chemical constituents of a type of water as indicators for source identification. Taking into account the groundwater detection of coal mines, a brief water quality analysis method is generally used, To be specific, the aquifers consist of an Ordovician limestone karst fractured aquifer (I), a Permian sandstone fractured aquifer (II), a Carboniferous Daqing limestone karst fractured aquifer (III) and a Carboniferous Yeqing limestone karst fractured aquifer (IV). As for sample data, please refer to Table 1.  Table 1 were firstly normalized, and their normalized values were equal to a difference between the actual value and the minimum value divided by a difference of the maximum and the minimum values. Table 2 presents the normalized data. Subsequently, the normalized data were processed based on PCA and the correlation coefficient matrix for hydrochemical constituents of various water sources is shown in Table 3. It can be observed from Table 3 that such 8 hydrochemical constituents are clearly correlated with each other. For example, a correlation coefficient of Ca 2+ and the total hardness is 0.983, while that of Na + +Ks + and Clis equal to 0.919; in addition, information about sample indicators significantly overlaps, so that it is  inevitable to affect accuracy of the corresponding mine water inrush source identification model when data about such 8 water samples are utilized to identify sources of water inrush. Consequently, misjudgments may be made. Therefore, data of training samples were processed based on PCA by virtue of the abovementioned PCA mathematical model. In this way, a cumulative contribution rate diagram of all indicators can be acquired. In Figure 2, it is clear that the first 4 indicators include 97.37% of the raw data information content and thus can effectively summarize the information of raw data. In order to reconstruct the new characteristics of the original data, the eight indicators (A 1 , A 2 , A 3 , A 4 , A 5 , A 6 , A 7 , and A 8 ) that had a certain correlation were recombined into a new set of independent indicators (Y 1 , Y 2 , Y 3 , and Y 4 ) to replace the original indicators; according to the PCA calculation, a PCA matrix can be obtained, as shown in Table 4.
Each value in the principal component analysis matrix in Table 4

Geofluids
could be classified into 4 types in consistency with differences in aquifers. On the assumption that the covariance matrices in the group are equal, the coefficients of the discriminant formula are calculated according to the formula (6) in the FDA principle mentioned above. The principle of determining the coefficient is to maximize the distance difference between the various types and minimize the distance within the type. The following discriminant function is obtained as follows: Central values of such 3 discrimination functions in 4 groups are presented in Table 5. Taking the first discrimi-nant, for example, its central values for type I water source (Ordovician limestone), type II water source (Permian sandstone water), type III water source (Carboniferous Daqing limestone water), and type IV water source (Carboniferous Yeqing limestone water) are 3.164, -2.261, 1.406, and -3.248, respectively. The three discriminant functions calculate the coordinates of each water sample in each dimension. By comparing the distance between the water sample to be judged and the center value of the four types of water source groups, the water source group to which the sample belongs is determined.

Validity Check for Source Identification.
For the sake of validity check for a PCA and FDA-based water source identification model, all training samples given in Table 1 were substituted in the established identification model one by one for reverse identification. Regarding reverse identification results, they have been presented in Table 6. It can be observed from this table that misidentification is incurred for water samples 4 and 11 with identification accuracy at 12/14 = 85:7%. In terms of the conventional Fisher water source identification model, water samples 3, 4, and 12 are subjected to misidentification as well and their accuracy of identification is only up to 11/14 = 78:6%. Comparison of identification results between such 2 models signifies that the PCA and FDA-based water source identification model is rather reliable and can meet source identification requirements of mine water inrush better.
To further check the accuracy of model, the verified PCA and FDA-based water source identification model was utilized to identify 12 samples taken from the Xiandewang coal mine. As for relevant identification samples and results, they have been shown in Table 7. It can be known from this table that except for water sample 4 with misidentification, identification results of other water samples are consistent with their actual outcomes, generating identification accuracy at 11/12 = 91:7%. Therefore, such a model can be used to

Conclusion
According to the characteristics of mine water inrush source, the water source identification analysis model based on PCA-FDA and the conventional water source identification analysis model based on FDA were used to identify the water inrush source of the 14 groups of training samples and 12 groups of samples to be judged, respectively, and the identification accuracies were 85.7% and 78.6%, 91.7% and 75% correspondingly. The result shows that after processing data with the PCA method, the identification accuracy of mine water inrush source has increased greatly, compared with using the FDA method only.
In the process of water source identification, the eight indicators in the original data are reduced to four principal components; the PCA method can project highdimensional data into a low-dimensional space, complete the process of dimensionality reduction of the data, and accurately characterize the water chemistry of each water source highly precisely with less independent identification indicators. This method greatly reduces the number of input-influencing factors when building the identification model and eliminates the effects caused by information superposition between each identification indicator in the identifying process. The FDA method can gather samples of the same kind together, distinguish the different samples, and achieve the analysis of mine water inrush identification. Combining these two methods can establish an accurate water source identification model with minimum characteristic information, simplify data structure, shorten analysis time, and improve analysis precision, which is a more effective analysis method for identifying water inrush sources.
In this case, when the data of the water sample was returned to the identification model, there were a few misjudgments because the number of training samples was not enough relatively. In order to increase the prediction accuracy of the model, in subsequent research, hydrochemical data should be collected in large quantities, comprehensive database of the mine hydrochemistry should be established, the training of the model should be enhanced, and the identification accuracy of the identification model should be improved.

Data Availability
All data included in this study are available upon request by contact with the corresponding author.

Conflicts of Interest
The authors declare that they have no conflicts of interest.