A Fast Independent Component Analysis Algorithm for Geochemical Anomaly Detection and Its Application to Soil Geochemistry Data Processing

A fast independent component analysis algorithm (FICAA) is introduced to process geochemical data for anomaly detection. In geochemical data processing, the geological significance of separated geochemical elements must be explicit. This requires that correlation coefficients be used to overcome the limitation of indeterminacy for the sequences of decomposed signals by the FICAA, so that the sequences of the decomposed signals can be correctly reflected. Meanwhile, the problem of indeterminacy in the scaling of the decomposed signals by the FICAA can be solved by the cumulative frequencymethod (CFM). To classify surface geochemical samples into true anomalies and false anomalies, assays of the 1 : 10 000 soil geochemical data in the area ofDachaidan in theQinghai province of China are processed. The CFM and FICAA are used to detect the anomalies of Cu and Au. The results of this research demonstrate that the FICAA can demultiplex themixed signals and achieve results similar to actual mineralization when 85%, 95%, and 98% are chosen as three levels of anomaly delineation. However, the traditional CFM failed to produce realistic results and has no significant use for prospecting indication. It is shown that application of the FICAA to geochemical data processing is effective.


Introduction
An ore-forming system in the crust of the Earth is a highly nonlinear system, which commonly involves coupled processes between material deformation, pore-fluid flow, heat transfer, mass transport, and chemical reactions [1][2][3][4][5][6][7][8][9].Because these processes can be described mathematically by a set of partial differential equations [1][2][3][10][11][12][13], it is important to obtain theoretical and numerical solutions using both mathematical and computational methods in the field of applied mathematics so that the distributions of mineral resources in the upper crust of the Earth can be better predicted.For this reason, an emerging discipline known as computational geoscience [14,15] has been established in recent years.
For quantitative prediction of mineral resources, geochemical data are an important information source.Researchers need the data to understand and simulate the dynamic mechanisms of ore-forming systems.Extracting prospecting information from geochemical element data is a central purpose of metallogenic prediction [16].The geochemical element data are obtained from the field by sampling and analysis.During this process, the corresponding interferences are produced, so collecting hidden anomaly information that may reflect the spatial distribution characteristics of geochemical elements is the key to suppressing interference between geochemical elements [17][18][19].Common data analysis methods are usually based on low-order statistical properties without considering the higher-order statistical characteristics of the data [20,21].Most researchers believe that, because samples disobey the normal distribution, the existing methods cannot be directly applied to raw data analysis [22].Moreover, due to the complexity of geological information, traditional data mining cannot reflect the spatial distribution characteristics of mineral resources because the dynamic mechanisms that control the formation of these mineral resources are completely neglected [16][17][18][19][20][21][22].
To better understand the geological information associated with geochemical data, many researchers study the spatial distribution characteristics of the elements and the physical and chemical properties of the bare geologic body [23].As a result, several quantitative geochemical methods, such as spatial statistics, the fractal technique, discriminant analysis, and fuzzy clustering [24][25][26][27][28], have been developed over the past few decades.These methods are associated with the density frequency of the element and are based on the perceptions of mineralization, which may lead to false anomalies that have nothing to do with mineralization [29].In fact, geochemical exploration data processing methods should only consider data cardinalities [30].For example, in the independent component analysis (ICA) method, higher-order statistical features of data are taken into account [31].The method isolates the independent source signals that are implicit in the mixed signal under "blind" conditions.In previous studies, the ICA method was mainly used for mineral predictions and is still at a fledging stage in geochemical applications [32].
Although the mathematical analysis and computational simulation methods associated with applied mathematics have been widely used to produce analytical solutions [1][2][3][33][34][35][36] and numerical results for understanding the dynamic mechanisms and processes of ore-forming systems [4][5][6][7][8][9], this study attempts to utilize the FICAA for detecting anomalies in geochemical data.For example, Zhao and his coworkers have successfully solved a set of partial differential equations for the convective pore-fluid problems that are closely related to hydrothermal ore-forming systems [1-3, 33, 34].They also solved a different set of partial differential equations for the chemical dissolution front instability problems that are present in many ore-forming systems within the upper crust of the Earth [37][38][39][40][41][42][43][44][45].According to the FICAA and the characteristics of the geochemical data, processing raw geochemical data with the FICAA can solve the sequence indeterminacy problem of separated results caused by the FICAA by calculating the correlation coefficient matrix of raw data and separating the independent components.In addition, the cumulative frequency method is used to solve the indeterminacy problem in scaling by direct comparison to the resultant anomalies.When a late prospecting trench validation is conducted, the data (after FICAA processing) can be used to better reflect the spatial distribution characteristics of the elements.This can provide useful geological information for understanding and simulating the dynamic mechanisms of ore-forming systems [14,15].This paper is organized as follows.The principles of the FICAA and the correlation coefficient algorithm are introduced in Section 2. A brief introduction of the geological background of the area around Dachaidan in the Qinghai province of China is given in Section 3. The FICAA is applied to the 1 : 10 000 soil geochemical data of this area in Section 4. The results are discussed and conclusions are stated in Sections 5 and 6, respectively.technique that, based on the hypothesis of statistical independence, analyzes data from a perspective of higher-order statistical correlation [46].A set of random initial vectors approximates the independent signal source implicated in the mixed signal by "decomposing" the mixed signal.Each mixed signal is a linear combination of original signals (Figure 1).The basic mathematical model of the ICA is as follows:

Algorithm Principle
where  denotes some observed mixed signals, matrix  represents the mixing matrix of the system, and vector  represents unknown source signals that are assumed to be statistically independent.
When the source signals  and the mixed matrix  are both unknown, only  can be obtained.The ICA algorithm is designed to determine a matrix  such that where  is an optimal estimate of the source signals .The linear solution of the ICA model can be obtained with ( 1) and ( 2).Consider where  =  is an  ×  identity matrix.
In the late 1980s, Jutten and Herault proposed the concept of the ICA [47].In 1994, Comon extended the principal component analysis, based on data processing and compression, as an independent component analysis algorithm and proposed independent component analysis based on the minimum of mutual information [48].In 2001, Hyvärinen et al. proposed an algorithm for extracting the fixed-point from a blind signal, also called the fast ICA (FICA).In 2012, Yu et al. applied the FICA to mineral prediction [32].However, both the dynamic mechanisms and the processes of ore-forming systems were completely neglected in their studies [32,47,48].This is the main shortcoming of the existing method used to treat geochemical data with statistical mathematics rather than simulating the dynamic mechanisms and the processes of ore-forming systems using applied mathematics [49,50].

2.2.
The Data Requirements of the ICA Algorithm.First, the ICA requires that each source signal be a random signal with a zero mean value, which is statistically independent at any moment, although geochemical data are usually spatially correlated.Thus, we should obtain various results due to the spatial correlation when the ICA is used.Second, it requires an equal number of source signals and mixed signals (namely, it requires that the relationship among geochemical elements is a linear combination).Third, it requires that, at most, one source signal obey the Gaussian distribution.In practice, the influence of the "noise" is usually not considered.

The FICAA.
The rationale of the FICAA is to determine a target function by maximizing negentropy and to obtain the optimal value of the target function by using Newton's iterative method.
Negentropy can be used to measure the non-Gaussianity.The negentropy of a random variable  is given by where (⋅) is the entropy function and  gauss and  are the Gaussian variable and the random variable, respectively, with the same mean and normalized variance.If negentropy is zero, then  obeys the Gaussian distribution.If  is a non-Gaussian distribution, negentropy must exceed zero.
A non-Gaussian measurement reaches its maximum when negentropy is maximized.Meanwhile, the optimal estimation of source signals is accomplished.The approximate formula of negentropy can be written as follows: where (⋅) is an arbitrary non-quadratic function.After iterative trials, it was discovered that the rate of convergence is faster and the convergence effect is better when () =  4 /4.
To maximize negentropy, the optimal {(  )} must be achieved.According to the Kuhn-Tucker conditions, under the constraint {(  ) 2 } = ‖‖ 2 = 1, the optima are obtained at points where the gradient of the Lagrangian is zero: where  is a constant that can be easily evaluated as  = {  0 (  0 )},  0 is the value of  at an optimum, and (⋅) is the first order derivative function of (⋅).Assuming the data are bounded, {  } = , the left part of (6), can be written as  and we can obtain the Jacobian matrix () as To simplify the matrix inversion, we can reasonably approximate the first item of (7) as If  0 is approximated as  in , then () is transformed into a diagonal matrix.Thus, we can obtain the approximate Newton iterative formula as where  * is the new value of  and  = {  (  )}.
After simplification, we can derive the iterative formula of the FICAA:

The Correlation Coefficient Matrix.
The FICAA is limited in the fact that separated signal sequences do not correspond one-to-one with the order of the source signals.In the absence of any other prior knowledge, this problem cannot be solved.
In this paper, we use the correlation coefficient matrix to solve it.Let ( 1 ,  2 ,  3 , . . .,   ) be one-dimensional random variables.If arbitrary   and   have the correlation coefficient   (,  = 1, 2, . . ., ), then an  ×  matrix with elements   is the correlation coefficient matrix () of ( 1 ,  2 ,  3 , . . .,   ) [51][52][53]. where To restore the source signals of independent components, mixed signals and separated independent components are taken as new observed data to calculate the correlation coefficient matrix.The two variables of maximum absolute values with the same correlation coefficient are deemed to have a corresponding relation to each other.

Geology of the Research Area
The research area is situated in the Qinghai province of China, 700 km east of Xining and 35 km west of Dachaidan.The 1 : 10 000 geographical coordinates cover the eastern longitude range from 95 ∘ 47  51  to 95 ∘ 58  18  and the northern latitude range from 37 ∘ 42  44  to 37 ∘ 45  14  .The study area is approximately 6 km 2 .The strata known in the area  2).The major minerals in this area are galena, malachite, and azurite.

Application
The project involved setting up 34 soil geochemical profile lines and completing 6 km 2 of soil geochemical surveys.The sampling network is 100 m × 20 m.Soil geochemical samples included 3307 samples that were analyzed for six elements: Au, Cu, Pb, As, Sb, and Zn.In this study area, ten prospecting trenches were dug and 200 samples were collected and notched.Based on descriptive statistics analysis of soil geochemical data, characteristics indexes were determined and are shown in Table 1 (Au is reported in ppb and the remaining metals are reported in ppm).
The existing geochemical data processing methods frequently rely on the assumption that geochemical data obey a normal distribution.However, the spatial distribution of geochemical element content is very complex, so existing methods have limitations.Because the FICAA requires that only the contents of one element obey the normal distribution, the first analysis of this framework examines the probability plots reported in Figure 3.The purpose is to verify whether the raw data obey a normal distribution.Figure 3 shows that Au, Pb, As, and Sb obviously disobey the normal distribution.Meanwhile, Cu and Zn approximately obey a normal distribution.
Graphical tools can support hypothesis tests for normality.The results of the tests for Cu and Zn are controversial, as expected (Figure 3).Furthermore, the Kolmogorov-Smirnov (K-S) normal distribution test is conducted on the soil sample data.Table 2 shows that the normality hypothesis cannot be accepted ( < 0.05) because the contents of the six elements disobey the normal distribution.Thus, these data meet the basic requirement of the FICAA.
Based on the FICAA and the statistical independence of the raw geochemical data, we consider the six element content values as the mixed signals (the raw data).The oscillogram of the raw data is shown in Figure 4 and the oscillogram of the source signals after separation by the FICAA is shown in Figure 5.Because the FICAA is a copy or estimation of the source signals, the sequences and scaling of the source signals have changed.The changes are displayed in the oscillogram of the separated results.We introduce the correlation coefficient to solve this problem.
As clearly demonstrated in Figure 5, the waveforms of the six elements after separation correspond to the waveforms of the raw data.Treating the survey data and independent components as new observation variables, we calculate the correlation coefficient between them.The correlation coefficient results in Table 3 show that the maximum absolute values of each column are 0.967, 0.546, 0.982, 0.985, 0.949, and 0.985, respectively.Thus, it can be concluded that 1, 2, 3, 4, 5, and 6 correspond to Pb, Sb, As, Au, Cu, and Zn.FICAA processing is only a copy or estimation of the raw data and only reflects the general trend of the data.The cumulative frequency method is employed to determine the distribution characteristics of the elements and to solve the problem of sequence indeterminacy.Selecting 85%, 95%, and 98% as the intrazone, mesozone, and external zone of the anomaly, the isograms of the raw data and separated data are depicted using the cumulative frequency method.Their three-level cumulative frequencies are given in Table 4.Because this study mainly analyzes the metallogenic elements Au and Cu, Figures 6 through 9 show the Au and Cu isograms of the raw data and data separated by the FICAA, respectively.

Discussion
The anomaly analysis of Au and Cu and the prospecting trench work are carried out based on the anomaly isograms.The anomaly isograms show that Au is mainly distributed in the Permian while Cu mostly spreads over the Silurian.There are obvious differences between the isograms of Au and Cu that are delineated by the raw data and the FICAA processed data.Test and analysis results are shown in Table 5.In trench TC04, the element content of Au is 0.1, where the raw data of Au has an obvious anomaly, but the FICAA processed data does not.The FICAA processed result is consistent with the    actual situation.In trenches TC01 and TC05, the element contents of Cu are 0.15 and 1.99.The raw data show normal distributions, while the FICAA processed data reveal apparent anomalies that match the true conditions.Meanwhile, other trenches reasonably show the distribution abnormalities of Au and Cu.Thus, employing the FICAA to process the geochemical data can better reflect the spatial distribution characteristics of elements.This is because when the existing statistical data processing method is used, we must assume that the data obey the normal distribution or lognormal distribution.However, the actual geochemical data may not satisfy this assumption.Therefore, these existing methods have limitations.The FICAA can preprocess the geochemical data under unknown circumstances.This can eliminate the mutual interference between elements so that the data may provide a clearer direction for geochemical data processing.
The analysis results show that the anomaly isograms processed by the FICAA are more in line with the actual element distributions.Although we solve the problem of scaling indeterminacy, the anomalies still cannot be displayed directly by the processed data.Moreover, the FICAA should be verified in other geological areas.More importantly, the existing geochemical data processing methods with the statistical mathematics characteristic (such as the FICAA used in this study) cannot be used to simulate the dynamic mechanisms and processes of oreforming systems, so they are invalid for predicting concealed ore deposits within the upper crust of the Earth.To solve this problem, applied mathematics methods have been used to establish an emerging discipline, known as computational geoscience, during most of the past two decades.As a result, computational simulation methods have become important tools not only for simulating the dynamic mechanisms and processes of ore-forming systems but also for predicting the potential locations of ore deposits within the upper crust of the Earth [14,15,49].

Conclusions
This study identifies soil geochemistry anomaly areas based on the presence or absence of mineralization in trench samples.Use of the correlation coefficient and the cumulative frequency can solve indeterminacy problems in both sequences and scaling.Comparing FICAA processed raw data to the results of the CFM beforehand can better reflect the distribution characteristics of the geochemical elements.Because geological backgrounds are different when the study areas are different, the geochemical exploration method should be applied on the basis of the actual situation.
The existing geochemical data processing methods with the statistical mathematics characteristic (such as the FICAA used in this study) cannot be used to simulate the dynamic mechanisms and processes of ore-forming systems; therefore, they are invalid for predicting concealed ore deposits within the upper crust of the Earth.Because computational geoscience methods with the applied mathematics characteristic can effectively simulate the dynamic mechanisms and processes of ore-forming systems, they should be used in future research to predict potential locations of ore deposits within the upper crust of the Earth.

Figure 2 :
Figure 2: Simplified geological map and location map of the study area with alterations.

Figure 7 :
Figure 7: The isogram of Au after processing by the FICAA.

Figure 9 :
Figure 9: The isogram of Cu after processing by the FICAA.

Table 1 :
Parameters of soil geochemical elements in the given area of Qinghai.

Table 2 :
Results of the Kolmogorov-Smirnov test on the variables.
are the Silurian (S), Permian (P), and Quaternary (Q 4 ).Rock types are conglomerate, altered conglomerate, sandstone, quartz schist, altered quartz schist, arkose quartzite, lithic sandstone, slate, and phyllite.The mineralizations are silicification, ferritization, pyritization, malachitization, and chalcopyritization.The study area mainly consists of an Au mineralization belt and a Cu mineralization belt.The Au belt is mainly distributed in the Permian stratus and the Cu belt is primarily found in the Silurian stratus (Figure

Table 4 :
Zoning sequences delineated by the cumulative frequency.

Table 5 :
The analysis results of chemical groove samples.