Evaluation of the Bacterial Diversity in the Human Tongue Coating Based on Genus-Specific Primers for 16S rRNA Sequencing

The characteristics of tongue coating are very important symbols for disease diagnosis in traditional Chinese medicine (TCM) theory. As a habitat of oral microbiota, bacteria on the tongue dorsum have been proved to be the cause of many oral diseases. The high-throughput next-generation sequencing (NGS) platforms have been widely applied in the analysis of bacterial 16S rRNA gene. We developed a methodology based on genus-specific multiprimer amplification and ligation-based sequencing for microbiota analysis. In order to validate the efficiency of the approach, we thoroughly analyzed six tongue coating samples from lung cancer patients with different TCM types, and more than 600 genera of bacteria were detected by this platform. The results showed that ligation-based parallel sequencing combined with enzyme digestion and multiamplification could expand the effective length of sequencing reads and could be applied in the microbiota analysis.


Introduction
The complex microbial flora living on or within the human body has long been proposed to contribute to the human health as well as disease [1][2][3][4][5][6][7][8] (Eckburg et al., 2005). Using culturing or unculturing methodology, over 25,000 bacterial phylotypes and over 700 prevalent taxa at the species level have been identified in the oral microbiome, colonizing in the oral cavity including teeth, gingival sulcus, tongue, cheeks, palates, and tonsils [9][10][11][12][13][14][15][16][17][18][19][20]  As a reservoir for oral microorganisms, food scraps, saliva, and shed epithelial cells, the human tongue has been investigated showing significant association with the microbial communities of the gut and diseases such as gastritis or halitosis [11,[21][22][23][24]. In traditional Chinese medicine (TCM), according to the color and thickness, the human tongue coating can be divided into a few types consisting of thin-white coating, thick-white coating, sticky-white coating, thin-yellow coating, and so forth [25]. The color and shape of tongue coating may reflect the composition of the bacteria colonizing the tongue dorsum. Several researches have considered the association between tongue coating microbiome and traditional tongue diagnosis [26][27][28][29]. Jiang et al. investigated 19 gastritis patients with a TCM Cold Syndrome or TCM Hot Syndrome tongue coating and 8 healthy controls by Illumina paired-end, double-barcode 16S rRNA V6 tag sequencing. Han et al. sequenced the V2-V4 region of 16S rRNA gene by pyrosequencing to investigate the tongue coating microbiome in patients with colorectal cancer and healthy controls. Their results indicated that the richness of the bacterial communities in the patients with thin tongue coating and healthy controls was higher than in the patients with thick tongue coating.
In the past few years, pyrosequencing of the 16S rRNA gene and sequencing-by-synthesis of metagenomics have been the widely applied technologies in the study of microbial communities. [30][31][32][33][34][35][36][37][38][39]. Pyrosequencing technology has the benefits of relatively long length of sequencing read and the drawbacks of high reagent cost and high error rates in homopolymer repeats. Metagenomic analysis requires deep sequencing data mining and large amounts of sequencing reads for gene assembly and annotation [24,27,40]. Ligationbased sequencing technology provides inherent error correction by two-base encoding, which makes the platform much more accurate [34]. Due to the short sequencing length, this methodology has a few restrictions in the microbiome diversity research.
In the present study, we develop an effective approach using multigroup amplification and massively parallel ligation-based sequencing technology to determine the bacterial diversity. In addition, we compare the tongue coating microbial diversities of different TCM tongue coating types using this method.

Samples.
Tongue coating samples were collected from two groups of lung cancer patients in Nanjing Chest Hospital. Each group represented a specific TCM tongue coating type: the thin-white type and the white-greasy type ( Figure 1). A total of 13 subjects were collected in the morning and all the volunteers had no breakfast before sampling. All subjects had no oral inflammation and had refrained from brushing teeth and drinking colored beverage for 16 hours before testing. The tongue coating samples were collected with sterilized cotton swabs, mixed with 1 ml of Ringer's solution, and stored in −20 ∘ C immediately until extracting the total microbial genome DNA. All the 13 samples were investigated by DGGE analysis (data not shown). Based on the DGGE results, the bacteria composition was significantly different between different tongue coating types. Six typical samples from two tongue coating types were chosen for thorough parallel sequencing, and the clinical parameters of subjects were shown in Table 1.

Multiple Group PCR Primers
Design. The multigroup PCR primers were designed based on Human Oral Microbiome Database (HOMD) (http://www.homd.org/). The 16S rRNA gene sequences of a genus containing 4 species at least in HOMD were picked out as a group and aligned by MEGA (v 4.0.2). Totally, 421 species belonging to 27 genera were selected. The conserved regions flanking the variable region can be used for the multigroup PCR primers design. A specific group primer consisted of 22 bp of sequences containing Eco57I recognition site and 20 bp of conserved region sequences ( Figure 2). Table 2 indicates all the 26 pairs of group primers generated.

DNA Extraction, Amplification, and Ligation Sequencing
Library Preparation. All the samples were centrifuged and resuspended in 1 ml of 10x TE buffer (pH 8). The suspension was mixed with 100 l of lysozyme (200 mg/ml) and incubated at 37 ∘ C for 1 h. Lysis solution containing 50 l of SDS (10%, v/v) and 20 l of Proteinase K (20 mg/ml) was added and the mixture was incubated at 55 ∘ C for 3 h. The total bacterial genome DNA was obtained by phenol-chloroform extraction and ethanol precipitation.
The PCR amplification was carried out separately by each of group primers for each sample with the same condition except for the annealing temperature. All the PCR reactions were performed in a final volume of 25 l containing 1x PCR buffer (Takara), 2.5 mM MgCl 2 , 200 M each dNTP, 2.5 U of Taq polymerase, 20 ng of template DNA, and 1 M each forward and reverse primer. The PCR mixtures were initially denatured at 95 ∘ C for 3 min, followed by 30 cycles: 45 s at 94 ∘ C, 45 s at ( m − 5) ∘ C, and 45 s at 72 ∘ C. The PCR amplicons were digested using the enzyme Eco57I (Fermentas, Burlington, Canada) according to manufacturer's instructions. The expected fragment DNA was acquired by agarose gel electrophoresis and purified by QIAEX II Gel Extraction Kit (Qiagen, Hilden, Germany). PCR amplicon mixture was prepared by pooling approximately equal amounts of recovered fragments from the same sample. After blunting the 3 protruding termini of mixtures by T4 DNA  polymerase, four sequencing libraries were constructed for the ligation-based sequencing system.

Massively Parallel Ligation Sequencing.
In order to maximize the sequencing capacity and simplify the workflow of sample preparation, all the samples were operated in a single sequencing run with barcodes added. Six barcodes were ligated to the 3 end of the library templates with T4 DNA ligase as unique adaptors (Table 1). SOLiD P1 adaptors (5 -CCACTACGCCTCCGCTTT-CCTCTCTATGGGCAGTCGGTGAT-3 ) were ligated at the 5 end of templates, followed by standard SOLiD library preparation protocol.

Construction of Reference Sequences.
Due to the characteristic of 2-base encoding in SOLiD sequencing, the sequencing reads are in color-space format "0, 1, 2, 3," which stands for the permutation of the adjacent bases. The brief data processing pipeline in our study was designed as Figure 2 shows, because the novel microbial analysis methods are inappropriate. A total of 1,049,433 unaligned 16S rDNA sequences of both bacteria and archaea were downloaded from Ribosomal Database Project (RDP) (http://rdp.cme.msu.edu) resource (release 10, update 13). Using in silico analysis of amplification by the designed primers and endonuclease digestion, 26 groups of fragments were selected. Repetitive sequences were moved to construct 26 groups of reference sequences (REF-DB). Taxonomies of REF-DB (TAXA-DB) were assigned using the original full-length 16S rRNA gene sequences by the RDP online classifier (http://rdp.cme.msu.edu/classifier/classifier.jsp) at 80% confidence cut-off. If the reference was yielded from more than one full-length 16S rRNA gene sequence, each of the original sequences was assigned separately. From the genus level, the taxonomy shared by three-fourths majority of the full length sequences was defined as the taxonomy of the reference. Otherwise, taxonomy of higher level was compared until the domain level.

Analysis of Sequencing Data.
Sequencing reads were split into six samples according to the 4-mer barcodes in P2 adaptor. In color space, all the samples were aligned separately to REF-DB by Corona Lite Program (v4.2, http://solidsoftwaretools.com) with up to 3 color-space mismatches. For unique reads (uniquely placed matches), a few steps of filtering were performed. Firstly, the matching position should be at the 5 end of the reference sequence (forward-strand matching) or reverse compliment (reversestrand matching). Meanwhile, the top four positions of reads were checked with one mismatch in color space to filter out nonspecific amplification or false endonucleasedigested products. Secondly, the abundance of reads matching "forward-strand" and "reverse-strand" references was   counted separately and the reads with less than 5 counts in both strands were discarded. Finally, Operational Taxonomic Units (OTUs) were identified according to matched sequences in the REF-DB and corresponding taxonomy information in TAXA-DB. For the nonunique reads that matched the reference in more than one location, all of the matched references were assigned at the phylum level according to TAXA-DB. A matched read was assigned to a specific taxonomy, while all the references it matched belonged to accordant phylum. Otherwise, the read was denoted as "undefined." Both the unique and nonunique matched reads were summarized by a set of Perl scripts.

Phylogenetic Analysis.
The original full-length sequences of matched OTUs were aligned using ClustalW, and a relaxed neighbor-joining tree was built by PHILIP 3.68. Weighted and unweighted UniFrac were run using the resulting tree and environment annotation of number of hits. PCA was performed on the resulting matric of distances between each pair of samples.

Prediction of Metagenome from 16S rRNA Gene Data.
The PICRUSt project aims to support prediction of the unobserved character states in a community of organisms from phylogenetic information about the organisms in that community. This program was used to predict metagenome abundance from 16S rRNA gene data.

Sequencing Performance of SOLiD Reads.
One-quarter of slide was used to perform the ligation-based sequencing, and 84,399,432 reads of 35-base length (2.75 gigabases in total) were captured after removing sequences of insufficient quality. Splitted by 4-mer barcodes, a total number of 75,115,494 reads (89%) were generated ( Table 3). The difference of sequencing throughput among six samples was probably due to the sample quality or the procedure of library preparation.

Taxa
Assignment of the Sequencing Reads. Since SOLiD system employs 2-base encoding and color-space strategy, a single color change is a measurement error, two adjacent color changes may result from a single nucleotide variation in base space, and three color-space mismatches might imply two adjacent variants compared to reference. We used up to three color changes in 35-base length as the alignment parameter, which implied at most two nucleotide mismatches. Consequently, the sequence similarity was more than 94%, which corresponded to the genus level classification. Aligned to REF-DB, matched and uniquely matched reads were summarized in Table 3.
To increase the accuracy of taxonomy assignment, two steps of validation were performed before OTUs definition as Methods described. Altogether, we discarded 33.75% of the uniquely matched reads, leaving 647,375 sequencing reads for taxonomy analysis. Compared to TAXA-DB, number of OTUs was identified, ranging from 409 to 669 for six samples, respectively. Most of the OTUs (more than 93%) could be assigned at the genus or lower level, while a very small amount of OTUs was just assigned at the phylum or higher level (Table 4).
In order to maximize the high-throughput usable data for deep analysis, both the uniquely and nonuniquely matched reads were classified at phylum level. The percent of assigned reads was ranging from 58.8% to 65.6%. Compared with the results based on unique reads, the proportion of phyla Firmicutes and Actinobacteria had dramatically increased (data not shown).

Shared Genera of the Same Tongue Coating Type Samples.
We compared the shared genera within the same tongue coating type samples. There were 117 genera shared between the thin-white tongue coating type samples. Of these genera, 58 were from Firmicutes, 37 genera were from Proteobacteria, and 12 genera belonged to Bacteroidetes. For the white-greasy tongue coating type samples, 96 genera were observed to be shared. 48 genera belonged to Firmicutes, 28 genera belonged to Proteobacteria, and 11 genera belonged to Bacteroidetes, respectively.

Comparison of Bacterial Diversity between Different
Tongue Coating Type Samples. There were 77 genera shared across the six samples ( Figure 5). 35 genera belonged to Firmicutes, 23 genera belonged to Proteobacteria, and 8 genera belonged to Bacteroidetes. We compared the genera observed in one tongue coating type but not the other. Figure 6 showed the genera that appeared in all the thin-white tongue coating type but existed in none white-greasy tongue coating type sample. A total of 12 genera were observed only in the thin-white tongue coating type samples: Pelomonas, Haemophilus, Thioalkalispira, and Zoogloea (Proteobacteria); Jeotgalibacillus, Granulicatella, Lachnobacterium, Peptostreptococcus, Anaeromusa, Parasporobacterium, and Sporobacterium (Firmicutes); and Prevotella (Bacteroidetes). Figure 7 showed the genera that appeared in all the white-greasy tongue coating type but did not exist in all the thinwhite tongue coating type sample. The white-greasy tongue coating type samples were specifically unique for 6 genera: Anaerostipes, Lactobacillus, Anaerobacter, Ruminococcaceae Incertae sedis, and Oribacterium (Firmicutes) and Acidithiobacillus (Proteobacteria). Analyzed by Student'stest, 19 genera performed significantly different between two tongue coating types (Table 5).
Analyzed by UniFrac software, the thin-white tongue coating type and white-greasy tongue coating type were observably different at the genus level as shown in principal component analysis (PCA) plot (Figure 8). The three thinwhite tongue coating subjects had relatively similar microbial diversity (Bonferroni-corrected value is 0.25), while three  white-greasy tongue coating subjects behaved clear deviation (Bonferroni-corrected value, all >0.5).

Metagenome Prediction.
PICRUSt was used to predict a microbial community metagenome based on 16S gene data. PICRUSt results were then analyzed using LEfSe to identify microbial functions that were significantly different in their relative abundance among groups. The top of differentially abundant bacterial functions were "oxidative phosphorylation," "ribosome," "amino sugar and nucleotide sugar metabolism," and "secretion system."

Discussion
In this study, we explored a method to detect the microbial diversity using 16S rRNA gene by high-throughput SOLiD sequencing system. The ligation-based system features 2-base encoding, which is a proprietary mechanism that interrogates each base twice. The 2-base encoding algorithm filtered raw errors after sequencing, providing built-in error correction. The output of sequencing reads is in the format of color space, which means that the output reads must align to color-space reference. Although the short length is the main drawback of SOLiD platform compared to other sequencing technologies, our method could extend the effective length of sequencing read to more than 100 bp. First, the sequencing length was extended to 51 bp, by adding 16 nucleotides digested by Eco57I. Second, concerning that the two strands of library DNA could be sequenced in 5 to 3 direction, the usable sequencing length was equivalent to 102 bp, which was close to the length of one hypervariable region of 16S rRNA gene. Meanwhile, the validation of first four sequenced nucleotides, which were conserved from the designed primers, could enhance the sequencing accuracy in the validation step.
To evaluate the effect of this methodology, six tongue coating samples from different TCM tongue coating types were investigated. In TCM theories, the tongue coating reflects the status of physiological and clinicopathological changes of inner parts of body. As a common type, thinwhite tongue coating is a symbol of good health. A whitegreasy tongue coating like powder indicates turbidity and external pathogenic heat. The abundant bacterial groups BioMed Research International  Using the 16S rRNA gene, the core function of tongue coating microbiome could be predicted by the PICRUSt software. Pathways encoding for carbohydrate metabolism and oxidative phosphorylation metabolism were detected.  The oral cavity is a major gateway to the human body. Microorganisms colonizing in the oral cavity have a significant probability of spreading to the stomach, lung, and intestinal tract. The metagenome prediction validated the role of tongue coating. In our results, except for some common genera in human body, a few environmental bacteria were observed as well.  One of the explanations is that the sequencing results are not precise. The other hypothesis is that these genera are still unknown bacteria, which have similar sequences with the environmental bacteria from database. The more bacteria are sequenced, the more affirmatory genera could be defined.
The composition of the microbial communities on the tongue coating varies between individuals. UniFrac principal coordinates analysis showed no apparent clustering of microbial communities between two types of tongue coating. This may be indicative of the fact that, despite the many different habitats on the human tongue, many bacterial species are shared among those habitats.

Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.