Sedimentary Environment Analysis by Grain-Size Data Based on Mini Batch K-Means Algorithm

During the last several decades, researchers have made significant advances in sedimentary environment interpretation of grainsize analysis, but these improvements have often depended on the subjective experience of the researcher and were usually combined with other methods. Currently, researchers have been using a larger number of data mining and knowledge discovering methods to explore the potential relationships in sediment grain-size analysis. In this paper, we will apply bipartite graph theory to construct a Sample/Grain-Size network model and then construct a Sample network model projected from this bipartite network. Furthermore, we will use the Mini Batch K-means algorithm with the most appropriate parameters (reassignment ratio ε = 0 025 and mini batch= 25) to cluster the sediment samples. We will use four representative evaluation indices to verify the precision of the clustering result. Simulation results demonstrate that this algorithm can divide the Sample network into three sedimentary categorical clusters: marine, fluvial, and lacustrine. According to the results of previous studies obtained from a variety of indices, the precision of experimental results about sediment grain-size category is up to 0.92254367, a fact which shows that this method of analyzing sedimentary environment by grain size is extremely effective and accurate.


Introduction
Data mining, knowledge discovery, and machine learning algorithms have virtually permeated into research in various fields [1][2][3][4]. The complex network as a significant method of data mining gives top priority to discovering concealed information between things. Therefore, a great number of researchers from various research fields, including mathematics, physics, biology, chemistry, and oceanology, used the complex network to explore the potential relationships between data [5][6][7][8][9]. The complex network has some characteristics: self-similarity, self-organization, scale-free, smallworld, community structure (cluster), and node centrality. The community structure is one of the most important traits because it can objectively reflect the potential relationships between nodes. A community is made of one group of nodes within which the links between nodes are densely connected but between which they are sparsely connected with other clusters [10,11].
The grain-size analysis is one of the basic tools for classifying sedimentary environments, an analysis which can provide important clues to the provenance, transport history, and depositional conditions [12]. In general, the representative statistical parameters of grain-size analysis involve median, mode, mean, separation parameter, skewness, and kurtosis [13]. During the last few decades, two computing methods of grain-size parameters were developed: the graphical method and the moment method [12]. Blott and Pye (2011) presented that these two analysis methods had some advantages and disadvantages in computing sediment grain-size samples with various parameters. As most sediments are polymodal, curve shape and statistical measures usually simply reflect the relative magnitude and separation of populations. Polymodal grain-size spectrum can be considered as a result of the superposition of several unimodal components [14]. Many works have shown that different grain-size distribution is related to special transport and deposition process [15]. Three kinds of functions are commonly used to fit the grain-size distribution: Normal function, Lognormal function, and Weibull function [15]. Base on experimental results, Sun et al. [15] found that the Weibull function was appropriate for the mathematical description of the grain-size distribution of all kinds of sediments while the application of Normal function for fluvial and lacustrine sediments was also acceptable. Although these methods, especially Weibull function, performed well in sediment in fitting grain-size distribution, they often need subjective experience of the researchers, and the definite criteria for environmental determination have not been given. Based on the data of borehole Lz908, Yi et al. analyzed the evolution of the sedimentary environment. Besides grainsize data, they also used the data of magnetic susceptibility, tree pollen, radiocarbon dating, and optically stimulated luminescence (OSL) dating [16,17]. Can the same conclusion be obtained by using only the grain-size data which are the relatively convenient and low-priced indices?
In this paper, we introduce complex network into the data modeling of sediment grain-size data. Based on the theory of bipartite graph [18], we construct the Sample/ Grain-Size bipartite weighted network model which can objectively reflect the association relationships between sediment samples and grain sizes. By using projection, we will construct the Sample network model from the bipartite network. After repeatedly testing based on tens of representative clustering algorithms, we have selected the Mini Batch K-means algorithm [19], an optimization algorithm combined with the K-means algorithm [20], and the classical batch algorithm [21] to split the Sample nodes into their categories and find the relationships between the sedimentary environment and grain size. After 400 tests, we can find the most appropriate parameters in Mini Batch K-means algorithm. Finally, we will use four evaluation indices AMI, NMI, completeness, and precision to verify the accuracy and efficiency of clustering divisions.

Evaluation Functions
In the research field of complex networks, researchers always use several representative performance evaluation indices AMI, NMI, completeness, and precision to verify the accuracy and efficiency of clustering divisions. It is universally acknowledged that the higher the value of one index is, the better the result of clustering division is. Therefore, we will also use these four evaluation indices to verify the clustering result of sediment grain-size samples.
2.1. NMI and AMI. Normalized Mutual Information (NMI) [22,23] is an approach to measure shared information between two data distribution by the information theory in which entropy is defined as the information included in a distribution [24].
where P i, j = U i ∩ U i /N represents the probability that an object picks at random falls into both classes U i and V j . The two label assignments U and V have the corresponding entropy H U and H V defined as follows: where P i = U i /N is the probability that an object picked at random from U falls into class U i . The equation H V has the similar definition P ′ j with P i .
The NMI and the Adjusted Mutual Information (AMI) [25] are defined as where E MI is the expected value for MI. The range of NMI and AMI is 0, 1 and −1, 1 , respectively.

Completeness.
Based on the standard partitioning precious study results of known grain-size samples, conditional entropy analysis is used to define some intuitive measures.
The completeness assigns all nodes in the given class to the same cluster [26,27]. The completeness is formally given by where H K is the entropy of the classes and H C | K is the conditional entropy of the classes given the cluster assignments. [28] (P) is the number of true positives (T p ) over T p plus the number of false positives (F p ). The precision is given by

Dataset of Sediment/Grain-Size
The sediment samples of this study came from borehole Lz908 (37°09 ′ N, 118°58 ′ E), which is located in the southern Bohai Sea, China ( Figure 1). The borehole was drilled to 101.3 m below the surface in 2007, and the recovery rate 2 Geofluids reached 75%. The existing research results show that this region developed into three transgressive layers from late Pleistocene, and the thickness of fluvial, lacustrine, and marine sediments reached 2000-3000 m in this basin [16]. We extracted 2141 sediment samples of grain size from the borehole based on a 2 cm interval. We tested the grain size by using a thorough pretreatment method at the First Institute of Oceanography, State Oceanic Administration, China. The measuring instrument for grain size was a Mastersizer 2000 laser particle analyzer produced by the UK company Malvern; the measurement range was 0.02-2000 μm; the repeated measuring error was less than 3%. We calculate the Phi value of every sediment sample by using 51 sequences (Table 1), which represent the corresponding magnitude of various grain sizes. The data describe the percentage of the magnitude of each grain size accounting for the total magnitude of the grain size. Consequently, we constructed a dataset X with the 2141 × 51 matrix, where X ij denoted the percentage composition of the jth grain size in the ith sample (Table 1).

Construction of Sample/Grain-Size
Bipartite Network In this paper, we construct the Sample/Grain-Size network based on the bipartite graph theory [18] in which the graph is denoted as G = V, E , where V is the node set and E represents the edge set. In the bipartite graph theory, V is divided into two disjoint subsets A, B , whereA is one class of nodes and B represents the other class; E denotes the association relationships between a node in the set A and a node in the set B. According to this theory, the construction process of Sample/Grain-Size bipartite weighted network model is as follows.
In this process, one class, A, is the sample nodes and another class, B, denotes the grain-size nodes. As shown in Figure 2, the sample node numbered as Lz04-076 includes several grain-size nodes with the magnitude of 7.25-7.00. If the sample node includes a grain-size node, an edge will exist between this sample node and the corresponding grain-size   3 Geofluids node. The weight of edge denotes the number of grain sizes included in one sample. Based on this regular, we construct the final Sample/Grain-Size bipartite weighted network model as follows (Figure 3).
In this bipartite network, we identify the grain-size nodes as a green color, corresponding to the 51 class grain sizes with different magnitudes; we mark sample nodes as a pink color, corresponding to 2141 sets of samples. This model can clearly reflect the association relationships between the sample nodes and the grain-size nodes.
We construct a Sample network model projected from the Sample/Grain-Size bipartite network model ( Figure 4). The Sample network model has 2141 nodes and 44,198 edges; a node denotes a sample and an edge represents that the two samples contain a grain size with the same magnitude. The weight of edge shows the frequency of the two samples having the same grain size.

Sediment Grain-Size Sample Analysis Based
on Mini Batch K-Means 5.1. Idea of Sediment Grain-Size Data Analysis. In this paper, we cluster the Sample network model by the Mini Batch Kmeans algorithm. In the processing of every iteration time for the sediment samples, we randomly extract the mini batch subsamples from the total samples and update every mini batch sample by using the method of convex combination. At the same time, we use the per-center learning rates to increase the speed of the convergence rate. As the iteration times increased, we detect the convergence condition of this algorithm when the clustering result has no change in successive iterations. In the end, we divide the sample nodes into several clusters.

Steps of Sediment Grain-Size Sample Data Analysis
Step 1 Randomly extract b mini batch subsamples M from sediment sample dataset X with 2141 samples and 51 properties Step 2 Randomly select k samples as the initial clustering centers; save them into an array C storing k clustering centers which will be changed as the algorithm runs Step 3 Select a sample x from M; calculate the clustering central sample node d x having the nearest distance to the sample x by using Euclidean distance; save results in an array d. The Euclidean distance is as follows: where f C, x indicates the nearest Euclidean distance between the sample x and the central nodes in C. The ith property in the sample x is x i Step 4 Acquire sample x and d x ; update the per-center counter v: v c ← v c + 1 8 Step 5 Get the real-time per-center learning rates η, which speeds up the convergence of this algorithm η ← 1 v c 9 Step 6 Take the gradient step: Step 7 If M = null, all the samples have been divided into a cluster, otherwise, return to step 4

Geofluids
Step 8 If iteration time ≤ t, return to step 1. The algorithm will stop when the convergence condition is satisfied or the iteration time > t Algorithms 1 and 2 show the pseudocode of Mini Batch K-means algorithm for sediment sample data processing.

Simulations and Analysis
6.1. Multi-Index Analysis of Clustering Results. In this paper, we use the four indices AMI, NMI, completeness, and precision to verify the clustering result of sediment samples. We set the classical two parameters as mini batch and reassignment ratio in the Mini Batch K-means algorithm. After repeatedly testing 400 times, we acquire the corresponding results in Table 2

Heatmap Analysis of Clustering
Results. The following heatmaps can objectively reflect the accuracy and efficiency of the clustering division of sediment sample data calculated by Mini Batch K-means algorithms. In each figure, every square represents the index score with different mini batch and reassignment ratio in a certain evaluation index. The various colors in the rightmost dashboard show the different scores, and the score range of every index is 0, 1 . The gradation of color in every square represents the size of the value.
As shown in Figures 5 and 6, the AMI can acquire the maximum value 0.40919072 when reassignment ratio ϵ = 0 025 and mini batch = 25. The maximum value of NMI is 0.41485376 under the same parameters.
Mini Batch K-Means Input: the dataset of grain size is X; the number of initial clusters k is 3; the iteration times is t; the mini batch is b. Output: the set of clustering labels is C; the cluster label of every sample is c. Initialize every sample label as c ∈ C. v ← 0; Algorithm 2: Pseudocode of Mini Batch K-means algorithm for sediment sample data processing.    Through clustering analysis, we can assign these samples to their actual sediment clusters. Objectively, precision is the most significant index from these four performance evaluation indices. By analyzing the simulation results above, we can know that the clustering result of sediment grain-size samples calculated by Mini Batch K-means algorithm with appropriate parameters, ϵ = 0 025 and mini batch = 25, has high precision: 0.92254367. The other three indices can also acquire maximum values: AMI = 0.40919072, NMI = 0.41485376, and completeness = 0.44747697.

Network Characteristic Analysis of Clustering Results and
Comparison with Other Studies. We calculate the clustering result of sediment grain-size samples by using the Mini Batch K-means algorithm with the most appropriate parameters, ϵ = 0 025 and mini batch = 25. The simulation results are in Table 3 and Figure 9.
According to Table 3 and Figure 9, we divide the Sample network model into three clusters calculated by the Mini Batch K-means algorithm. Yi et al. divided the sedimentary environment of Lz908 through a variety of indices in the representative manuscripts [12,16]. Compared with their results, the three clusters correspond three sedimentary environments: marine, fluvial, and lacustrine. The green cluster shows that these samples can be assigned to the marine sediment category; the orange cluster indicates that these samples can be split into fluvial; the blue cluster represents that these samples can be divided into lacustrine. This network can require a high precision, 0.92254367, of clustering division when the parameters of Mini Batch K-means algorithm are set as ϵ = 0 025 and mini batch = 25. Furthermore, we find that most of the different points with precious studies are located at the junction of different sediment types ( Figure 10). These results show that this method of analyzing sedimentary environment by using grain size is extremely effective and accurate.

Conclusions
During the last several decades, researchers have made significant advances in the environmental interpretation of grainsize analyses, but the definite criteria for environmental determination have not been given. Previous studies often overemphasized the subjective experience of the researcher and usually combined grain-size analysis with other methods and rarely used only grain size for sedimentary environment analysis. Recently, complex networks have been playing an increasingly significant role in data mining and knowledge discovery because they can reveal the potential relationship and concealed information between things. In this paper, we use complex networks and the bipartite graph theory to construct a Sample/Grain-Size network model and a Sample network model. Furthermore, we use the Mini Batch Kmeans algorithm to cluster the sediment grain-size samples.

Geofluids
We use the representative evaluation indices AMI, NMI, completeness, and precision to verify the precision of the clustering results for the sample division. Simulation results show that this algorithm can divide the Sample network into three clusters-marine, fluvial, and lacustrine-a fact which is almost identical to the division in the classical manuscripts. At the same time, the evaluation indices can also acquire high values when we set the appropriate parameters as ϵ = 0 025 and mini batch = 25. The results also denote that the clustering results are efficient; for example, the samples that have the same classification with traditional method are up to 0.92254367, an excellent calculation result through a relatively convenient and low-priced way.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.