Mining Seasonal Marine Microbial Pattern with Greedy Heuristic Clustering and Symmetrical Nonnegative Matrix Factorization

With the development of high-throughput and low-cost sequencing technology, a large number of marine microbial sequences were generated. The association patterns between marine microbial species and environment factors are hidden in these large amount sequences. Mining these association patterns is beneficial to exploit the marine resources. However, very few marine microbial association patterns are well investigated in this field. The present study reports the development of a novel method called HC-sNMF to detect the marine microbial association patterns. The results show that the four seasonal marine microbial association networks have characters of complex networks, the same environmental factor influences different species in the four seasons, and the correlative relationships are stronger between OTUs (taxa) than with environmental factors in the four seasons detecting community.


Introduction
The oceans cover approximately 139 million square milesroughly 71% of the earth's surface. Marine microbes are the important composition in the marine ecosystem. They can provide the basis for the ocean's food webs and facilitate the flow of nitrogen, carbon, and energy in the ocean. Yet specific ecological relationships among these taxa and environment factors are largely unknown. This is partly due to the dilute, microscopic nature of the planktonic microbial community, which prevents direct observation of their interactions [1]. Although the technologies of microbial cultivation, gene chip, and metagenomics [2][3][4] can provide the information on microorganisms' potential ecological roles, they cannot describe the interactions among microbes and environment.
With the development of high-throughput DNA sequencing technologies that yield a mass of reads of rRNA (16S rRNA/18S rRNA) and DNA, we can describe the compositions of microbial communities, their diversity, and how communities change across space, time, or experimental treatments based on these sequence data [5]. However, most of the current analytical approaches often focus on the total numbers of taxa, the relative abundances of individual taxa, and the extent of phylogenetic or taxonomic overlap between communities or community categories [6][7][8]. In contrast, there has been far less attention focused on using sequence data to explore the direct or indirect relationship among microbial taxa and environments. Some researchers used the network analysis to explore cooccurrence pattern in soil and ocean [9][10][11], but they just constructed the association networks to show the cooccurrence pattern and did not further mine the networks to find the pattern structures. The microbial association (or cooccurrence) patterns can offer new insight into the structure of complex microbial communities, revealing the niche spaces shared by community members and identifying habitat affinities or shared physiologies that could guide more experimental settings.
In this paper, we proposed a novel method called HC-sNMF to detect the association community patterns and structures in the four seasonal marine networks. HC-sNMF provides new insights into the natural history of microbes,  finding the relationship among microbes and environmental factors and trying to determine the microbial association pattern difference among seasons and which environmental factors might have the greatest influence on the varying diversity.

HC-sNMF Work Engine and Process.
The work engine and process of HC-sNMF consist of the three following parts: (i) OTUs generation with NbHClust algorithm, (ii) network construction with mutual information algorithm, and (iii) community patterns detection with symmetrical nonnegative matrix factorization method. Figure 1 is a flowchart showing the work process of the HC-sNMF.

NbHClust Algorithm.
For OTU inflation caused by 454 sequencing errors, we proposed a heuristic clustering method based on neighbor seeds, namely, NbHCluster. Based on the distribution of homopolymer, the idea of neighbor sequence was introduced to generated neighbor seeds. Then, a heuristic cluster strategy was used to cluster the sequences based on neighbor seeds instead of single seed. Finally, a constraint parameter based on cluster size was used to fine the clusters.
The pseudocode of NbHClust is as shown in Pseudocode 1.

Networks Construction.
In order to research the association among different microbial species and environmental factors, we use vectors and ] to represent OTU and environmental factor in the four seasons, respectively, where is the th OTU abundance value in the sth sampling; that is, equals the ratio of the sequence number contained in the th OTU and the total sequence number contained in the sth sampling. To reduce the sequencing effort bias, the value was set to zero if < 5. For reducing the false higher correlation between vectors, we also remove these OTU vectors which contain less than 3 nonzero elements. After this processing, we can obtain 1,212 OTU vectors, in which spring season contains 280, summer 254, fall 313, and winter 365 OTUs, respectively.
Beyond Pearson correlation, mutual information (MI) can capture nonlinear dependencies and topology sparseness between variables. Here, we used MI [11] to compute the association relationship between variables and construct the seasonal marine microbial association networks. The process of MI can be described simply as follows.
Suppose that is the value range of variable and the subinterval set { }, = 1, 2, . . . , , is a partition of , satisfying that ∪ { } = and ∩ = if ̸ = . Define the following two delta functions: The probability of { } according the variable and the joint probability of { , } according to variables and are defined as The entropy and joint entropy of and are defined as So, we can calculate the mutual information between two variables and according to the following formulate: The permutation test was used to calculate the statistical significance. We considered that there are robust associations between OTU-OTU and OTU environmental factor vector if value ≤ 0.01, and there is a robust association between environmental factor vectors if value ≤ 0.05. In the end, we can construct the four marine microbial association networks ( Figure 4) of spring, summer, fall, and winter seasons. These networks are weighted and undirected networks in which the edge weight is MI value of two variables (nodes).

Symmetrical Nonnegative Matrix Factorization (s-NMF)
Clustering Algorithm. For a weighted and undirected graph ( , ) with nodes and links, we can describe it by a weighted adjacency matrix = [ ] × , where ≥ 0. Let be the feature matrix of graph calculated from , and represents the node-node similarity.
Suppose that nodes can be grouped into overlapping cliques (or communities). Then, a clique-node similarity matrix = [ ] × was introduced to represent the similarity degree between node and clique.
indicates the closeness degree between node and clique . Here, is nonnegative matrix, reflecting the relationship between node and clique. Because ∑ =1 is an approximation of similarity between node and node j, and also represents the node-node similarity; thus, we can use to estimate ∑ =1 . Our task can now be summarized as computing the parameter so as to minimize the function : BioMed Research International where ∘ is the Hadamard product (or element-byelement product) of matrices and . To solve this optimization problem, we will introduce a symmetrical nonnegative matrix factorization (s-NMF) method which is an improved method of nonnegative matrix factorization [12]. NMF can be described as a linear decomposition ≈ , where ∈ × is a positive matrix and ∈ × and ∈ × are nonnegative matrices. and are iteratively updated according to the following rules [13,14]: where [ ]/[ ] is the Hadamard division (or element-byelement division) of matrices and .
Supposing that = , s-NMF can be seen as a constraint form of NMF. Thus, the iteratively updated rule of s-NMF can be described as follows: Obviously, the optimal solution of s-NMF is a subset of the NMF solution set. The stable points of (8) can only fall into the set of NMF's stationary points which satisfy = , hence guaranteeing the convergence of s-NMF.
By normalizing the column of , we can obtain the fuzzy membership degree matrix . Then, the clique corresponding to the largest element of each column in is determined as the final membership clique of each node. That is, if is the maximum in the column i, the node is classified as the clique .
In order to determine the optimal number of community , we iteratively increase and choose the one which results in the highest modularity [15]: where is the degree of node , is the total number of edges in the network, and = ∑ =1 .

Performance of NbHClust.
In order to evaluate the performance of NbHClust, we compared NbHClust with the common used heuristic clustering methods CDHIT [16], Uclust [17], and DNAClust [18] on the Clone43 dataset [19], which consists of 202,340 reads from a mixture of 43 plasmid clones spanning the V6 region of 16S rRNA gene with an average length of 61 nt. Due to lack of ground truth, that is, species origin that each read belongs to is unknown, we used the number of OTUs estimated to evaluate the clustering quality. Figure 2 shows the clustering results of four methods. From Figure 2, we can see that, at the commonly used threshold 97%, the smallest number of OTUs was ∼ 260 returned by NbHClust, followed by Uclust (∼1400), and CDHIT (∼1900). The largest number was returned by DNAClust (∼3700). These results show that NbHClust can reduce the OTU inflation and is much closer to the expected number (i.e., 43).
The number of seasonal microbial OTUs generated with NbHClust at 97% sequence identity is displayed in Figure 3, which shows that there are seasonal variations in OTU number throughout a 6-year period, and there are also repeating patterns.

Topology Analysis of Four Seasonal Marine Microbial
Association Networks. In order to analyze the microbial diversity and the relationship among OTUs and environmental factors in spring, summer, fall, and winter seasons, we should construct the four seasonal marine microbial association networks. In general, mutual information (MI) provides a natural generalization of the correlation since it measures nonlinear dependency (which is common in biology) and has the ability to deal with thousands of variables (nodes). Although conditional mutual information (CMI) can detect the joint relationship of interesting variable (e.g., OTU) by two or more variables and other nonlinear interaction by two variables, its computational complexity is more than that of MI for large scale networks. Considering the number of OTUs and the computational time, we select MI to construct the four seasonal marine microbial networks. The four seasonal marine microbial association networks with MI algorithm are shown in Figure 4. We also computed their topological parameters including the average degree, average clustering coefficient, average power law degree, and modularity and compared them with their corresponding random networks. The comparison results of four seasonal networks and random networks are summarized in Table 1.
From Table 1, we can see that there is some difference in the topological parameters among the spring, summer, fall, and winter seasonal microbial correlation networks. Compared with random networks, four seasonal microbial correlation networks have bigger average clustering coefficient, average power law degree, and modularity, which indicate that the four seasonal microbial associate networks have some characters of complex network.

The Association Communities in Seasonal Microbial
Networks Detected by s-NMF. The four seasonal marine microbial association communities detected by s-NMF were shown in Figure 5. The results in Figure 5 show that the association community pattern diversity of winter is more than that of spring, summer, and fall, which indicates that the seasonal variability might have the greatest influence on the marine microbe diversity. We also find that some environmental factors are strongly associated with some microbes, and there are different association structures in four seasons. According to the annotation information of OTUs at taxonomic level by using a number of different annotation strategies (e.g., GAST [6], BLAST against Greengenes [20], SIVA [21], and RDP [22]), we analyzed in detail the OTU composition of community that included more environmental factors for every seasonal network.
The M1 community in spring microbial network is composed of 7 environmental factors (E1, E2, E4, E5, E6, E12, and E14) and 38 OTUs in which the 26 OTUs come  The community structural analysis in four seasonal microbial networks shows that a large fraction microbial association in class level occurs among Alphaproteobacteria and Gammaproteobacteria; the community dense of summer, fall, and spring is bigger than that of winter; the correlative relationships are stronger between OTUs (taxa) than with environmental factors. This may indicate that biological rather than physical factors can be more important in defining the fine-grain community structure.

Conclusions
Mining the marine microbial association patterns and diversity is a key for exploiting the marine resources. Considering that the marine microbes are symbiosis or competition, exhibiting numerous, significant intra-or interlineage associations, we used the NbHClust and s-NMF approaches to analyze the potential association patterns between the marine microbes and environmental factors from the 16S rRNA sequences. The results show that the four seasonal marine microbial association networks have characters of complex networks, and the marine microbial association patterns are related to the seasonal variability; in the four seasons, the association between microbe and environmental factor is significantly different; that is, the same environmental factor influences the different species; and the correlative relationships are stronger between OTUs (taxa) than with environmental factors. Although we cannot claim that we have a comprehensive view of association within marine microbial communities, our analysis method is more feasible and interesting for exploring the unseen patterns that emerged in the complex dataset, including nonrandom association, deterministic processes at different taxonomic levels, and expected relationship between community members.