rRNA Analysis Based on Long-Read High-Throughput Sequencing Reveals a More Accurate Diagnostic for the Bacterial Infection of Ascites

Traditional pathogenic diagnosis presents defects such as a low positivity rate, inability to identify uncultured microorganisms, and time-consuming nature. Clinical metagenomics next-generation sequencing can be used to detect any pathogen, compensating for the shortcomings of traditional pathogenic diagnosis. We report third-generation long-read sequencing results and second-generation short-read sequencing results for ascitic fluid from a patient with liver ascites and compared the two types of sequencing results with the results of traditional clinical microbial culture. The distribution of pathogenic microbial species revealed by the two types of sequencing results was quite different, and the third-generation sequencing results were consistent with the results of traditional microbial culture, which can effectively guide subsequent treatment. Short reads, the lack of amplification, and enrichment to amplify signals from trace pathogens, and host background noise may be the reasons for the high error in the second-generation short-read sequencing results. Therefore, we propose that long-read-based rRNA analysis technology is superior to the short-read shotgun-based metagenomics method in the identification of pathogenic bacteria.


Introduction
Distinguishing and identifying the microorganisms that cause infection are very important for the treatment and survival of patients. The traditional pathogenic diagnosis method generally involves culture, nucleic acid amplification assays, and serological analysis [1] and presents shortcomings such as a low positivity rate, inability to identify uncultured microorganisms, and time-consuming nature. Moreover, the existing molecular tests cannot detect the characteristics of newly evolved genes in pathogens that spread among humans, animals, and the environment [2].
Clinical metagenomics is the application of modern genomics technology to directly study microbial communities in samples from patients without the need for the isolation and laboratory cultivation of individual species [3]. Next-generation sequencing (NGS), also known as high-throughput sequencing, allows the simultaneous and independent sequencing of thousands to billions of DNA fragments [4]. Clinical metagenomics next-generation sequencing (mNGS) is a method for the comprehensive analysis of microorganisms and host genetic material (DNA and RNA) in specimens from patients [3]. mNGS can be used to detect any pathogen and provides auxiliary genomic information required for evolutionary tracking, strain identification, and drug resistance prediction [4]. mNGS has been applied in clinical and public health fields. For example, NGS was used to detect fungal pathogens in at-risk patients for invasive fungal disease (IFD) from cell-free DNA (cfDNA) in plasma, so as to guide the treatment decision in pediatric patients at high risk for IFD [5]. Additionally, in the field of public health, researchers used whole-genome sequencing (WGS) to conduct the epidemiological analysis of an Acinetobacter baumannii infection that broke out in a British hospital in 2010 [6].
At present, mNGS can be performed with either secondgeneration sequencing technology or third-generation sequencing technology. The available second-generation short-read sequencing platforms mainly include the Illumina, SOLiD, Ion Torrent, and BGISEQ systems, while the third-generation long-read sequencing platforms mainly include the Nanopore and PacBio systems [7]. PacBio sequencing is a long-read sequencing platform that uses a single-molecule real-time (SMRT) chip as a synthetic sequencing vector [8]. PacBio has two sequencing modes: highly accurate long reads (HiFi reads) and continuous long read (CLR). HiFi is referred to circular consensus sequencing (CCS). CCS obtains consistent sequences from multiple transmissions of a single-template molecule and generates accurate reads from a single subread of noise [9]. The accuracy of CCS is more than 99% [10], but the error rate of a CLR is relatively high (around 15%) [10]. The sequence data generated by PacBio sequencing are helpful for determining the species and genus of pathogenic microorganisms involved in infections and do not require a traditional pathogenic microorganism culture step. Compared with the traditional detection technology for pathogenic microorganisms, this technology presents great advantages in the detection of microorganisms that are difficult to isolate and culture in clinical settings and some uncultured microorganisms. PacBio sequencing has the potential to identify pathogenic microorganisms in the clinical environment, which is helpful for making clinical treatment decisions.
Here, we describe results of traditional clinical microbial culture for an ascitic specimen from a patient with liver ascites. The genome of ascites was extracted, and the PacBio sequencing and second-generation Illumina sequencing was used to analyse the ascitic microbiome. We compare the two types of sequencing results with the results of traditional clinical microbial culture to assess the advantages and disadvantages of different sequencing strategies for specific specimen types. This study may contribute to clinical microbial diagnosis and treatment strategy formulation in the future.

Materials and Methods
2.1. Sample Collection and DNA Extraction. According to clinical standard operating procedures, the medical staff performed abdominal puncture to collect 3-5 ml of ascites and injected the fluid into a culture bottle (Zhengzhou Antu Biological Co., Ltd., China) for enrichment culture. After the culture was shown to be positive, it was transferred to Columbia blood agar plates (Zhengzhou Antu Biological Co., Ltd., China) for anaerobic culture at 35°C for 48 hours. Subsequently, MALDI-TOF-MS (VITEK MS, BioMerieux SA, BioMerieux Inc., France) [11] was used to identify the cultivated pathogenic microorganisms.
The sodium chloride, Tris-HCl, EDTA (STE) DNA extraction method [12] was used to extract the total DNA of the microorganisms in the ascitic samples. First, 1 ml of ascites was centrifuged at 10,000 × g for 1 min, after which the supernatant was discarded, and 1× STE buffer was added to resuspend the pellet. Then, an MP Biomedicals sample preparation instrument was used to homogenize the ascitic specimen with beads, and the DNA of the ascitic specimen was extracted according to the STE method. We name the sample as M173251, which was used for subsequent secondand third-generation sequencing.
2.2. Next-Generation Sequencing and Data Analysis. We chose the Illumina platform to carry out short-read NGS. A library was constructed using the Hieff NGS Fast Tagment DNA Library Prep Kit for Illumina (YEASEN, China), and the library was sequenced in an Illumina NovaSeq sequencer.
There is an existing mature commercial analysis process for Illumina sequencing [13]. High-quality sequencing data were generated by removing low-quality reads, adapter contamination, and duplicated reads using Trim Galore [14]. Host contamination present in the sample required the comparison with the host sequence to filter out possible sequences from the host [15][16][17]. Human sequence data were excluded and mapped to a human reference genome (hg19) using Bowtie2 software (parameter settings: -endto-end, -sensitive, -I 200, and -X 400). After removing human sequences, Kraken2 software (v2.0.7-beta, https:// www.ccb.jhu.edu/software/kraken2/) was used to compare and annotate the remaining sequences with the MiniKra-ken2_v2 database, to integrate the comparison results for the sample, and to obtain species abundance statistics. Starting from the relative abundance tables for different classification levels, the top 9 species with the relative abundance in the sample (group) were selected, the remaining species were set as others, and the corresponding species annotation results for the sample were plotted in relative abundance histograms at different classification levels.

Third-Generation Sequencing and Data Analysis.
We chose the PacBio platform to carry out the third-generation high-throughput sequencing analysis of full-length rRNA sequences based on long reads. According to the full-length 27F and 1492R primers for 16S rRNA, the specific primers 27F_0074 (5 ′ TGACAGTATCACAGTGAGRGTTTGATYNT GGCTCAG 3 ′ ) and 1492R_0074 (5 ′ CACTGTGATACTGT CATASGGHTACCTTGTTASGACTT 3 ′ ) with barcodes were synthesized (where the italicized underlined sequence is the barcode sequence). PCR amplification was carried out, and the products were purified, quantified, and homogenized to generate a sequencing library (SMRT Bell). The constructed library was inspected first, and the qualified library was sequenced with PacBio Sequel.
For data preprocessing, after exporting the PacBio offline data to circular consensus sequencing (CCS) files (the CCS sequence was obtained using the Smrtlink tool provided by PacBio), the following three main steps were performed: (1) CCS identification: Lima v1.7.0 software was used to 2 BioMed Research International identify CCS through barcodes and to obtain barcode CCS sequence data; (2) CCS length filtering: the barcode CCS was filtered to obtain the effective sequence; and (3) chimera removal: optimization-CCS sequence was obtained by using UCHIME v4.2 software to identify and remove the chimera sequence. Divide OTU, diversity analysis and difference analysis are information analysis. The PacBio sequencing results were clustered with Usearch software [18] at a 97% similarity level to obtain OTUs, and the OTUs were annotated using ribosomal database project (RDP) classifier based on the Silva (bacteria) and UNITE (fungi) taxonomic databases. The comparison of representative OTU sequences with a microbial reference database by an RDP classifier provides species classification information corresponding to each OTU, after which the sample community composition can be quantified at each level (phylum, class, order, family, genus, and species). QIIME software was used to generate species abundance tables at different taxonomic levels, and R language tools were then used to draw community structure diagrams at each taxonomic level for the sample.
According to the existing taxonomic database of microbial species provided by the National Center for Biotechnology Information (NCBI), the species abundance information obtained by sequencing was regressed to the phylogenetic tree of the database by using MEGAN [18] software. The MEGAN taxonomic tree diagram was then drawn to comprehensively elucidate the evolutionary relationships and differences in abundance among all microorganisms in the samples according to the entire classification system.

Results and Discussion
Hundreds of single colonies were obtained through clinical bacterial culture, and the morphology of the colonies was similar. A single colony was selected to identify the cultivated pathogenic microorganism via MALDI-TOF-MS, resulting in its identification as Clostridium difficile (C. difficile).
Illumina sequencing generated 15,636,287 reads, each of which was 150 bp in length, and 637,196 reads remained after filtering out the host sequences. We selected the top nine species in terms of abundance after removing the host sequences, merged the remaining species into a group of others, and drew the species distribution map for the sample at the species level (Figure 1(a)). With the exception of the others group (40%), Staphylococcus aureus (S. aureus) showed the highest proportion at close to 40%, followed by Klebsiella pneumoniae at 22%, Burkholderia pseudomallei at approximately 14%, and Aeromonas hydrophila at approximately 7%. Small contributions of Pasteurella multocida, Campylobacter jejuni, Mycoplasma mycoides, Polynucleobacter necessarius, and Whitefly endosymbionts were also found, whereas C. difficile was not.
A total of 9385 CCS sequences were obtained from Pac-Bio sequencing after identification based on barcodes. The number of sample sequences at each stage was processed according to the statistical data to evaluate data quality. The evaluation results for the sequencing data of the sample are shown in Table 1. We filtered out sequences other than those of 1 kb-1.8 kb in length and obtained 8702 valid CCS sequences, among which 8562 CCS sequences were used for subsequent analysis, which accounted for 91.23% of the CCS reads obtained by sequencing. After statistical quality control and filtering, the number of sequences of reads in the corresponding length range in the sample was plotted in an effective tag length distribution map (Figure 2), and most of the sequences were distributed between lengths of 1400 bp and 1500 bp.
The sequencing results identified 31 OTUs through clustering. We selected the top nine species according to abundance and merged the remaining species into a group of others, after which the species distribution map of the sample was drawn at the species level (Figure 1(b)). C. difficile showed the highest proportion, of approximately 73%, followed by S. aureus, at approximately 14%; Escherichia coli (E. coli), at approximately 2%; Achromobacter xylosoxidans (A. xylosoxidans), at approximately 1%; and the others group, accounting for approximately 6%; small contributions of Stenotrophomonas sp., Pseudomonas poae, Alcaligenes faecalis, and unclassified microbes were also found.
A MEGAN taxonomy dendrogram was drawn on the basis of the PacBio sequencing results. The MEGAN taxonomy dendrogram (Figure 3) shows the predominance of the Peptostreptococcaceae (to which C. difficile belongs) and Staphylococcus groups, which was consistent with the results of cluster analysis.
C. difficile is an obligate anaerobic Gram-positive bacillus that is ubiquitous in nature and is transmitted orally through feces [20]. C. difficile produces metabolically inactive endospores, making it is resistant to gastric acid and most kinds of antibiotics [21]. C. difficile can produce two toxins: enterotoxin A and cytotoxin B [22]. The pathogenicity of C. difficile is based on the fact that at least one of the two toxins acts as a glycosyltransferase to modify guanosine triphosphatases in intestinal epithelial cells and cause the disruption of the actin cytoskeleton [23]. Clostridioides difficile infection (CDI) is a symptomatic infection caused by the spore-forming bacterium C. difficile, and it is one of the common infections found in hospitals [24]. The symptoms of CDI include watery diarrhea, fever, loss of appetite, nausea, and abdominal pain, which may lead to colitis in severe cases [25]. The complications of CDI include dehydration, low blood pressure, electrolyte imbalance, and more serious problems such as bowel perforation, kidney failure, or even death [26]. CDI is a common cause of diarrhea [27]. Among all cases of antibiotic-related diarrhea, a total of 20%-30% are caused by CDI [28]. Because C. difficile is an obligate anaerobe, it is difficult to isolate and cultivate in a clinical setting. The current methods for the clinical diagnosis of CDI include nuclear acid amplification tests (NAAT), toxigenic C. difficile culture (TC), stool cytotoxicity assay (CTA), enzyme immunoassays (EIA) for the A1B toxin and glutamate dehydrogenase (GDH), and the C. difficile cytotoxin neutralization assay [29].
The medical history of the patient is complicated. He suffered from primary liver cancer, hepatitis B, and cirrhosis. He was hospitalized again for hepatic ascites and abdominal cavity infection. C. difficile was cultured from his ascitic fluid, and the third-generation sequencing results also 3 BioMed Research International indicated the occurrence of CDI. Mild to moderate CDIs are treated with metronidazole, and for severe infections, vancomycin is administered orally [30]. Although only C. difficile was cultured from the ascites of this patient, the abdominal infection was indicated to be a compound infection. The PacBio sequencing results also showed that in addition to C. difficile, the ascites contained S. aureus, E. coli, A. xylosoxidans, and other microorganisms. Infections caused by methicillin-susceptible S. aureus (MSSA) are usually treated with β-lactam antibiotics, such as cefazolin; however, methicillin-resistant S. aureus (MRSA) infections show resistance to penicillin, oxacillin, and other β-lactam antibi-otics, so it is necessary to use more effective drugs such as linezolid against Gram-positive bacteria [31]. A. xylosoxidans is a nonfermentative aerobic Gram-negative bacillus that is widely distributed in the environment and mainly causes health care-associated infections [32]. Most of the isolates of this species show in vitro sensitivity to carbapenems and piperacillin/tazobactam [33]. After the patient was administered metronidazole, cefoperazone, sulbactam, moxifloxacin, linezolid, and biapenem for anti-infection treatment, his symptoms improved significantly. Metronidazole is used to treat C. difficile, cefoperazone, sulbactam, biapenem, and moxifloxacin and exerts obvious antibacterial activity against E. coli and A. xylosoxidans, while linezolid is used to treat infections caused by MRSA.
To our knowledge, this is the first study to apply PacBio sequencing to explore the ascitic microbiome. In this study, clinical bacterial culture resulted in the cultivation of only C. difficile. Both the Illumina and PacBio sequencing results showed that the patient's ascites contained more than one pathogenic microorganism, but the species distribution indicated by the second-and third-generation sequencing results was significantly different. The second-generation sequencing results showed that S. aureus accounted for the highest   proportion (approximately 40%) of the sample in the species distribution map at the species level (Figure 1(a)), while C. difficile was not detected. In contrast, the third-generation sequencing results showed that C. difficile accounted for approximately 73% of the species distribution map at the species level, followed by S. aureus, accounting for approximately 14% (Figure 1(b)). The microbial nucleic acids of the sample are dominated by background human host sequences in Illumina sequencing, where the vast majority of reads (usually >99%) come from human hosts [4], while relatively few microbial reads are obtained. The rRNA long-read sequencing method used in this study targets and amplifies 16S rRNA gene sequences in the sample before sequencing, and there is no background noise from the human host. First, the 16S rRNA gene sequence in the sample is amplified to achieve signal amplification, which is more conducive to the detection of some less-abundant microorganisms. In contrast, library construction is performed directly without amplification in Illumina sequencing, which may mean that some low-abundance microorganisms are not detected. More importantly, sequences from different microbial species are preferentially targeted when the secondgeneration shotgun macrogene library is constructed. For example, the secondary structure of some microorganisms is complex, and the GC content may be high. PCR amplification is also required for the construction of the library, and the same amplification efficiency is not achieved for these sequences as for other microorganism sequences during library construction. Even under PCR-free secondgeneration sequencing library construction, due to the exis-tence of special structures, preferences will arise in steps such as the addition of adapters. This is not the case for full-length rRNA third-generation sequencing because rRNA is quite conserved. When building a library for third-generation sequencing, the universal rRNA sequencing primers that are used are also fixed and conserved. Most of the rRNA sequences of different bacteria are the same, and only the sequences of variable regions are different. Therefore, compared with second-generation sequencing, it is easier to obtain consistent amplification efficiency under this method. The result of PacBio sequencing is a full-length sequence of the 16S rRNA gene, while the reads of Illumina sequencing are only 150 bp in length, meaning that long DNA strands are interrupted. After sequencing is complete, splicing and assembly are performed, which may also cause errors. It is worth mentioning that to obtain results in the shortest amount of time, the reads used in the current commercial and clinical trials of mNGS for clinical microbiological detection are generally only 50-75 bp in length, which is much shorter than the 150 bp reads used in this study. All of these factors may limit the sensitivity of Illumina sequencing for pathogen detection, resulting in differences in the microbial species and contents identified between Illumina and PacBio sequencing results. PacBio sequencing does not require the interruption of the genome and can produce longer reads [9], while second-generation sequencing interrupts long DNA strands during library construction, and after sequencing, the reads need to be spliced and assembled. At present, the reads generated by mainstream second-generation sequencers are generally 50-150 bp in length [7]. Nanopore sequencers are another third-generation long-read technology that can identify DNA or RNA sequences by detecting the characteristics of current signals caused by charged biomolecules (such as DNA or RNA) through nanopores [34], among which the MinION platform is most commonly used. Compared with PacBio and Illumina sequencing, the accuracy of Nanopore sequencing is much lower, the overall error rate prior to correction is about 7% [35], and sequencing errors are mainly caused by the insertion and deletion of bases [36]. When there are fewer consecutive identical bases in a DNA sequence, the number of bases recognized by Nanopore sequencing may produce errors, which is one of the reasons for the low accuracy of this approach.
We compared Illumina NovaSeq sequencing, PacBio sequencing, and Nanopore sequencing ( Table 2). PacBio is superior to NovaSeq and Nanopore in terms of reading length, accuracy, and cost while sequencing time is in favor of ONT.
The research results obtained via the 16S rRNA PCR approach and NGS (Illumina HiSeq) for the ascitic microbiota have been compared, and it was found that wholegenome shotgun-based NGS on the Illumina platform is more suitable for describing the microbiome of ascitic fluid or other low-abundance bacterial DNA samples than the 16S rRNA PCR approach based on the Illumina platform [39]. Our experimental results showed that 16S rRNA sequencing based on the PacBio platform was more accurate than mNGS based on the Illumina platform; thus, the

Conclusions
Compared with third-generation long-read sequencing technology, second-generation metagenomic DNA sequencing technology produces shorter reads, and it is difficult to obtain information such as the full-length sequences of the drug resistance genes of pathogenic microorganisms via this approach; furthermore, it may introduce bias and shows low detection rates of some low-abundance intracellular bacteria and fungi with thick cell walls, long sequencing times and other defects [8]. Long-read PacBio sequencing exhibits a higher accuracy rate than second-generation sequencing. Although the sequencing cost per GB of data is higher than that of Illumina sequencing, the amount of data required for pathogenic microorganism sequencing is much lower, and the overall cost is also lower than that associated with Illumina sequencing. In addition, the amount of thirdgeneration full-length rRNA data obtained via the method used in this research is small. Hence, compared with second-generation short reads, the dependence of the subsequent analysis on high-performance computers is low, few calculations are required, and the analysis results can be obtained faster. In contrast, traditional bacterial culture takes a long time, and many pathogenic microorganisms are not easy to cultivate. For example, only C. difficile was cultivated from the compound infection of the patient examined in the present study. The limitation of this study is that the number of samples is too small and only for ascites samples. We have not studied on other specimens such as blood and alveolar lavewater. In the next further study, we will expand the number of clinical samples and apply the current technology to the clinic. As PacBio sequencing technology continues to mature, the cost of sequencing will decrease, and the accuracy will be further improved. PacBio  Figure 3: Phylogenetic diversity of the ascitic sample sequences calculated by MEGAN on the basis of PacBio sequencing. Each circle represents a taxon in the NCBI taxonomy and is labeled with its name and the number of reads that were either directly assigned to the taxon or indirectly assigned via one of its subtaxa. The size of the circles is logarithmically scaled to represent the number of reads assigned directly to the taxon [19].

Data Availability
The data used to support the findings of this study have been deposited in NCBI-SRA repository (SRR12600489 and SRR12600488).