Genome Survey and SSR Analysis of Camellia nitidissima Chi (Theaceae)

Camellia nitidissima Chi (CNC), a species of golden Camellia, is well known as “the queen of camellias.” It is an ornamental, medicinal, and edible plant grown in China. In this study, we conducted a genome survey sequencing analysis and simple sequence repeat (SSR) identification of CNC using the Illumina sequencing platform. The 21-mer analysis predicted its genome size to be 2,778.82 Mb, with heterozygosity and repetition rates of 1.42% and 65.27%, respectively. The CNC genome sequences were assembled into 9,399,197 scaffolds, covering ∼2,910 Mb and an N50 of 869 base pair. Its genomic characteristics were found to be similar to those of Camellia oleifera. In addition, 1,940,616 SSRs were identified from the genome data, including mono-(61.85%), di-(28.71%), tri-(6.51%), tetra-(1.85%), penta-(0.57%), and hexanucleotide motifs (0.51%). We believe these data will provide a useful foundation for the development of novel molecular markers for CNC as well as for further whole-genome sequencing of CNC.


Introduction
Camellia nitidissima Chi (CNC), a species of golden Camellia, is well known as "the queen of camellias" [1,2]. It is largely grown in Guangxi province, China and has been introduced into Fujian province, China. C. nitidissima is a well-known ornamental plant because of its golden yellow fowers [2] that contain several favonoids and polyphenols [3]. In addition, C. nitidissima is a well-known medicinal and edible plant in China [4]. Te leaves and fowers of CNC have antioxidant and antimicrobial activities [1,[5][6][7] and are used as pancreatic lipase inhibitors [8] and potential anticancer drugs for gastric and colon cancers [9,10].
Simple sequence repeats (SSRs), also known as microsatellites, are stretches of DNA consisting of tandemly repeated short units, 1-6 base pairs (bp) in length [11], which have been identifed and characterized in the genus Camellia. In the last 15 years, several SSRs markers have been developed from microRNA (miRNA), mRNA, genome, and chloroplast sequences to study the genetic variation and population structure in diferent genera of Camellia , such as C. sinensis, C. osmanthus, C. vietnamensis, C. gauchowensis, C. huana, C. sasanqua, C. oleifera, C. japonica, and C. reticulata. In the last three years, SSR markers in the genus Camellia have emerged as a highly interesting research topic, with at least 14 studies on SSR markers [28][29][30][31][32][33][34][35][36][37][38][39][40][41], including both genome-wide SSR markers and SSR identifcation of single resistance genes, gene families, whole transcription factors, and the development of SSR databases. For example, an SSR marker was used as a molecular marker to tag the blister blight disease-resistance trait of C. sinensis [29,35]. Similarly, 72 SSR loci were detected in 14 and 15 phospholipase D gene families of C. sinensis for marker-assisted selection of resistance genes [37]. In addition, 3,687 SSR loci from 2,776 transcripts of transcription factor gene transcripts were identifed for potential implications in trait dissection [40]. TeaMiD was developed for simple sequence repeat markers of C. sinensis, including 935,547 SSRs [41]. However, only 15 polymorphic microsatellite loci have been isolated and characterized from C. nitidissima [42]. Genome-wide SSR markers of C. nitidissima have not been identifed because of a lack of genome sequences. Terefore, it is necessary to estimate the genome size and identify genome-wide SSRs in C. nitidissima using next-generation sequencing (NGS), which will be useful for further wholegenome sequencing and assessing genetic diversity within and among populations.

Plant Materials.
CNC was obtained from Longyan City, Fujian Province, China. Te leaf tissue was immediately collected from CNC, washed in sterile phosphate-bufered saline (PBS), frozen in liquid nitrogen, and stored at −80°C for further analysis.

DNA Extraction and Genome
Sequencing. Te total DNA of CNC was isolated using the cetyltrimethylammonium bromide (CTAB) DNA extraction protocol [43,44]. Te purity and concentration of the obtained gDNA were tested using a NanoPhotometer ® spectrophotometer (Implen, CA, USA) and a Qubit ® 2.0 fuorometer (Life Technologies, CA, USA), respectively [45]. Sequencing libraries for the qualitychecked gDNA were generated using a TrueLib DNA Library Rapid Prep Kit for Illumina sequencing (Illumina, Inc., CA, USA) [45]. Te libraries were subjected to size distribution analysis using an Agilent 2100 bioanalyzer (Agilent Technologies, Inc., CA, USA), followed by a real-time PCR quantitative test [45]. Te successfully generated libraries were sequenced using an Illumina NovaSeq 6000 platform (Illumina, Inc., CA, USA), and 150-bp paired-end reads with an insert of approximately 350 bp that was generated [45].

DNA Data Cleaning and Genome Assessment.
Te obtained raw reads were fltered to obtain clean reads using trimmomatic version 0.36 (https://www.usadellab.org/cms/ index.php?page�trimmomatic) [46]. Te quality control (QC) standards of reads from DNA were as follows: (1) Trimming adapter sequences, (2) Trimming low quality or 3 bases (below quality 3) in the front of the reads, (3) Trimming low quality or 3 bases (below quality 3) in the tail region for reads, (4) Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 15, (5) Removing reads with <51 bases.
To estimate the status of contamination from other species, 20,000 reads (10,000 reads from read 1 and 10,000 reads from read 2) were randomly selected from the resulting high-quality cleaned reads against the NCBI nonredundant nucleotide sequence (NT) database using the blastn software version 2.2.28 (https://blast.ncbi.nlm.nih. gov/Blast.cgi) [47,48], with an E-value threshold of 1 × 10 −5 .
Te resulting high-quality clean reads from DNA sequencing were subjected to K-mers analysis using Jellyfsh version 2.3.0 (https://genome.umd.edu/jellyfsh.html) [49] with savings in the hash-only canonical K-mers (−C) and K-mers values (−m 19, 21, and 23). Genome size, heterozygosity ratio, read duplication ratio, and read error ratio were estimated using GenomeScope version 2.0 (https://qb. cshl.edu/genomescope/) [50] with R version 4.1.3. Te repeat rate was estimated as the percentage of the number of K-mers after a 1.8 fold in the main peak depth over the total number of K-mers.

Genome Assessment.
We estimated the CNC genome size using the K-mers value (K � 19, 21, and 23) ( Table 2). According to the 21-mers recommendation [50], the CNC genome size and K-mer depth were 2, 778, 823, 868 bp and 101, respectively ( Figure 1). Te error and duplication rates of the reads were 0.248% and 0.706%, respectively. Te heterozygosity and repeat rates of the sequences were 1.42% and 65.27%, respectively. Te heterozygous peak K-mer frequency was 50, which indicates that the CNC genome has high heterozygosity (heterozygosity rate ≥0.8%) and high repetition (repetition rate ≥50%).

Genome Assembly and GC Content Analysis.
Te clean reads were assembled into 9,994,482 contigs and 9,399,197 scafolds using the SOAPdenovo software with 51-mers value (Table 3). Te total length of the contigs and scafolds was 2,844,296,380 and 2,910,885,755 bp, respectively. According to the signifcant peaks of the CNC contig distribution (Figure 2), the peak located halfway in front of the main peak was the heterozygous peak [44], which also proved the existence of high heterozygosity in the CNC genome. Because of the high heterozygosity, the assembled haploid genome was larger than predicted. Te maximum lengths of the contigs and scafolds were 73,907 bp and 88,303 bp, respectively. Te N50 lengths of the contigs and scafolds were 649 bp and 869 bp, respectively. Te GC contents of the contigs and scafolds were 36.00% and 34.00%, respectively. Te GC content of the scafolds was lower than that of the contigs owing to the presence of an N base. Te GC depth analysis (Figure 3) indicated that the GC content of the windows was mostly concentrated in the range of 20-60%, which did not show any apparent abnormalities or GC bias [44]. Te GC depth distribution was divided into two layers, which indicated the high heterozygosity of the CNC genome.

Discussion
In the genus Camellia, the genomes of C. sinensis and C. oleifera have been sequenced and assembled [53,54]. Te genome size of C. sinensis ranged from 3,062.62 Mb (C. sinensis var. assamica) to 3,113.46 Mb (C. sinensis isolate   Len, estimated total genome length; Uniq, unique portion of the genome (not repetitive); het, heterozygosity rate; Kcov, mean Kmer coverage for heterozygous bases; Err, error rate; and Dup, duplication rate. G240). Te CNC genome size was close to that of C. oleifera, which was 2889.51 Mb [54]. However, it was smaller than that of C. sinensis. Te GC content of C. oleifera was 34.5189% [54]. Te median GC content of C. sinensis was 38.5319% in the NCBI genome database. Te GC content of CNC was close to that of C. oleifera but lower than that of C. sinensis. Te result showed that C. oleifera is closer to CNC than C. sinensis in phylogenetic relationships, which is consistent with previous studies [55]. Te genome assembly strategies of other species in the genus Camellia can be applied to CNC, such as Illumina combined with PacBio (or Oxford Nanopore Technologies) and Hi-C-based assembly, and genome assembly should be as difcult as C. oleifera, but less difcult than C. sinensis. Te genome size estimated using NGS becomes more difcult in cases of high heterozygosity and high duplication, which can be further verifed by constant-value (C-value) using fow cytometry. Te motifs of SSRs including A or T were more abundant than those including C or G, the characteristics and distributions of which were similar to those reported in previous studies on C. sinensis [41]. Further validation studies of SSR markers are needed for the CNC population.
In the current study, the whole genome of CNC was sequenced using NGS for the frst time, which will play an important role in future whole-genome sequencing projects. Statistical analysis of the diferences in the quantity and motifs of SSRs provided a foundation for the further construction of high-density genetic maps of CNC. Te wild CNC is an endangered plant in China. Terefore, the CNC genome survey will have important ecological signifcance. In the fgure, the peak with the highest distribution was the main peak. Te heterozygosity of the genome was judged according to the peak of 1/2 position before the main peak.

Conclusions
In the present study, an approximate genome size of 2,778.82 Mb of CNS was estimated using the 21-mer analysis, with heterozygosity and repetition rates of 1.42% and 65.27%, respectively. Te results showed the genomic characteristics of CNS were similar to those of C. oleifera. In total, 1,940,616 SSRs were identifed in the genome data. We believe these results will provide meaningful data for conducting further genomic studies and a useful basis for the development of novel molecular markers. Hence, novel state-of-the-art genetic techniques, such as Illumina combined with PacBio HiFi and Hi-C-based assembly, need to be developed to obtain chromosomal-level scafolding genomes.

Data Availability
Te following information was supplied regarding the deposition of DNA sequences: the raw data can be obtained from the Sequence Read Archive at NCBI under accession numbers SRR19315149. Te associated BioProject, Bio-Sample numbers are PRJNA839723, SAMN28548419, respectively.

Conflicts of Interest
Te authors declare that they have no conficts of interest.