COVID-19 Whole-Genome Resequencing with Redundant Tiling PCR and Subtract-Based Amplicon Normalization Successfully Characterized SARS-CoV-2 Variants in Clinical Specimens

With an increasing number of severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) sequences gathered worldwide, we recognize that deletion mutants and nucleotide substitutions that may affect whole-genome sequencing are accumulating. Here, we propose an additional strategy for tiling PCR for whole-genome resequencing, which can make the pipeline robust for mutations at the primer annealing site by a redundant amplicon scheme. We further demonstrated that subtracting overrepresented amplicons from the multiplex PCR products reduced the bias of the next-generation sequencing (NGS) library, resulting in decreasing required sequencing reads per sample. We applied this sequencing strategy to clinical specimens collected in Bangladesh. More than 80% out of the 304 samples were successfully sequenced. Less than 5% were ambiguous nucleotides, and several known variants were detected. With the additional strategies presented here, we believe that whole-genome resequencing of SARS-CoV-2 from clinical samples can be optimized.


Introduction
Whole-genome sequencing data is an essential resource for coronavirus disease 2019 (COVID-19) genome epidemiology. Currently, an increasing amount of sequence data is collected worldwide, with up to 8,941,854 records as of March 18, 2022 (https://www.covid19dataportal.org/). ese sequencing data were used to estimate the infection path or virus evolution [1] and how the virus was transmitted at the early phase of the outbreak to aid public health decisionmaking [2]. Furthermore, they have been used to track viral lineages circulating globally to detect the emergence of variants of concern with the increased reproduction rate, increased virulence, or decreased effectiveness of control measures such as vaccines (https://www.who.int/en/ activities/tracking-SARS-CoV-2-variants/). e Global Initiative on Sharing All Influenza Data (GISAID) submission tracker [3] shows that the coverage of the sequences is dependent on the region (Africa, 0.75%; Asia, 0.55%; EU, 2.63%; North America, 3.44%; Oceania, 2.15%; and South America, 0.37% (% sequence shared in GISAID with complete dates per total reported cases as of March 20, 2022)), possibly reflecting overall sequencing capacity (reviewed in [4]). us, reducing the cost and the required sequencing capacity for determining the whole-genome sequence will help decrease regional bias in sequence coverage. e current whole-genome sequencing paradigm relies on tiling PCR (ex. ARTIC network: nCoV-2019 sequencing protocol V.1, https://dx.doi.org/10.17504/protocols.io.bbmuik6w, which amplifies the whole SARS-CoV-2 genome with at least two sets of multiplex PCR. One of the challenges in tiling PCR-based whole-genome sequencing is amplification bias, especially in clinical specimens with low viral titers [5]. One approach for nonbiased amplification is to use tiling PCR with long amplicons [6,7]. However, the long amplicon strategy also suffers from low genome coverage due to low virus titers from collected samples [8]. Another strategy is to optimize primer concentration during multiplex PCR. Increasing the concentration of primers in poorly amplified regions partially solved this issue, as reported previously [8]. Another solution to this is redesigning the primers, which showed a promising increase in genome coverage and variant calling efficiency [9,10]. Another challenge is when viral genome mutations are present at primer annealing sites. Amplicon dropouts due to single nucleotide variations (SNVs) have been reported in the original Artic primer schemes for virus variants of concern, and primer schemes have been updated to address this problem (https://community.artic.network/t/SARS-CoV-2-version-4-scheme-release/312). Using this version, actual genome coverage has been improved [10]. Along with SNVs, amplicon dropouts due to the long deletion variants of virus genomes have been reported [11,12]. In such a case, the genome region covered only by the dropout amplicon will be lost from the sequence data, so additional experiments should be performed to complete the virus genome sequence.
In this study, two additional strategies which can be combined with the tiling PCR resequencing are proposed. e first strategy is to rescue the amplicon drops caused by the mutation on the primer annealing site by the redundant amplicon scheme. e second strategy is to increase the sequencing depth in the amplicon region with low amplification efficiency by the subtraction of overrepresented amplicons. To establish the proof of concept for these two strategies, we established a new tiling PCR primer scheme with redundancies covering every position of the whole genome by at least two amplicons, except for the 5′-and 3′ends. We also combined this primer scheme with the subtraction of overrepresented amplicons to reduce bias in the sequenced library. Finally, we validated this sequencing pipeline in clinical samples from Bangladesh and analyzed the viral variants in the cohort.

Clinical Specimens.
Nasopharyngeal specimens were collected from 304 patients for diagnosis, and the remaining isolated RNAs were used in this study. e RNA was purified using an MagMAX Viral/Pathogen Nucleic Acid Isolation Kit ( ermo Fisher) according to the manufacturer's instructions. First-strand cDNA was synthesized using a ProtoScript II first-strand cDNA synthesis kit (E6560S, NEB) using oligo-dT and random 9-mer primers. An in vitro amplified virus isolate (EPI_ISL_408667) (labeled NIID) was used as the control virus [13]. e RNA of the control virus was reverse-transcribed with a SuperScript IV system ( ermo Fisher) using oligo-dT and random 6-mer primers.

Redundant
Tiling PCR. Redundant tiling PCR was performed using three separate multiplex PCR runs per sample. e primer sets for each multiplex PCR group (#1, #2, and #3) were mixed at a total concentration of 10 μM, and the forward primer was mixed with a biotinylated forward primer with the same sequence at a specified concentration (Supplementary Table 1). For each reaction, 1.2 μL 10× diluted cDNA template, 1.44 μL primer mix, 5 μL Q5 hot start 2× master mix (NEB), and 2.36 μL nuclease-free water were mixed and amplified with the following thermocycling conditions: 98°C for 30 s, 35 cycles of 95°C for 15 s, 63°C for 5 min, and holding indefinitely at 4°C. e PCR products of the three multiplex PCR groups for each specimen were pooled and used for NGS library preparation.

Subtraction of Overamplified Amplicons.
Biotinylated PCR amplicons were subtracted from multiplex PCR products as follows. MyOne Streptavidin C1 magnetic beads (13 µL; ermo Fisher) were washed and resuspended in 13 μL of binding buffer (1 M NaCl, 5 mM Tris-HCl (pH 7.5), 0.5 mM EDTA) and mixed with 13 μL of pooled tiling PCR amplicons. For "without subtraction" control experiments, 13 μL of binding buffer without magnetic beads was mixed with 13 μL of pooled tiling PCR amplicon. After incubation for 1 h at room temperature, the unbound DNA in the supernatant was transferred to a new tube and cleaned with AMPure XP beads (Beckman Coulter) for further NGS library preparation.

Library Preparation and Sequencing: Nanopore
Sequencing. Pooled and purified tiling PCR products were used to prepare the Nanopore sequencing library using a ligation sequencing kit (Oxford Nanopore Tech: SQK-LSK110) according to the manufacturer's instructions. e libraries were sequenced using a Flongle Flow Cell (Oxford Nanopore Tech: FLO_FLG001). An amplicon pool for each sample was sequenced with an individual Flongle Flow Cell. Base-calling with Guppy version 4.4.2 (Oxford Nanopore Tech) using a high-accuracy model was performed to obtain FASTQ files.

Library Preparation and Sequencing: Illumina MiSeq.
Pooled and purified tiling PCR products were used for sequence library preparation, as described previously [11]. Briefly, after introducing the index sequence for each sample via PCR, all 304 samples were pooled, and pooled libraries of approximately 500 bp were purified via agarose gel electrophoresis. MiSeq 300 bp paired-end sequencing was performed using the MiSeq Reagent Kit V3 (Illumina).

Data
Analysis. Nanopore sequences were used to characterize the amplicon bias and read depth for each genome position. Analysis tools were used with default parameters unless otherwise mentioned. Raw reads were preprocessed with duplex_tools version 0.2.7 (Oxford Nanopore Tech) and Porechop version 0.2.4 (https://github. com/rrwick/Porechop) with an adapter threshold � 95 to remove duplex reads and trim adapter sequences. For quality filtering, seqkit [14] was used to select sequences with minimal average quality over 7 and lengths greater than 80 bp but less than 700 bp. Preprocessed sequences were aligned to a reference sequence (Wuhan-Hu-1 Gen-Bank ID: MN908947.3) using minimap2 version 2.22 [15]. After mapping, the primer sequences for amplification were trimmed using iVar version 1.3.1 with the following parameters: -q, 0; -m, 80; -s, 50 [16]. e resulting bam file was further filtered to remove ambiguous reads that could not be assigned to the unique amplicon regions. To calculate the sequencing depth at each genomic position, IGVtools [17] was used to pile up the nucleotides called at each position, and the A, T, G, C, N, and deletion counts were summed as the depth. Normalization was conducted to obtain the depth or count value per 100 kbp mapped reads.
For the Illumina sequencing data, the following pipeline was used for variant detection in the clinical sample, as previously described [11]. e primer sequences were trimmed using pTrimmer version 1.3.3 [18]. Sequences with low-quality scores were also trimmed using Trimmomatic version 0.36 with the parameters LEADING: 20, TRAILING: 20, SLIDINGWINDOW:4:20, and MINLEN:150 [19]. e qualified reads were aligned using minimap2 version 2.17 on the reference sequence. Sequence variants were called using GATK HaplotypeCaller version V4.2.0.0 [20] and then filtered using an in-house script if the population of the polymorphism was less than 0.8 and the number of aligned reads was less than three. e tentative genome sequence for each sample was reconstituted using GATK FastaAlterna-teReferenceMaker using the filtered VCF dataset. Long deletions and insertions were manually corrected using IGV [17]. To polish the tentative genome sequences, the qualified Illumina reads were aligned using minimap2, and variants were called using GATK HaplotypeCaller, as described above. Finally, low-reliability sequences were masked by Pool multiplex PCR groups #1~#3 + Subtract biotinylated amplicon  Supplementary Table 1. After the first cDNA strand was synthesized via reverse transcription, it was subjected to tiling PCR. Using prior knowledge, consistently overrepresented amplicons were marked with biotin using biotinylated primers (red crossed marks) and subtracted from the multiplex PCR products via streptavidindependent subtraction. e subtracted multiplex PCR products were then used for NGS library preparation.
N-base. Briefly, read depths were obtained using the SAMtools depth command version 1.12 [21]. Sequences supported by less than five coverages were excluded, while sequences with an alternative genotype and a population of more than 0.3 were also masked. e clades, SNVs, insertions, and deletions were annotated using the Nextclade web service [22]. e uniqueness of the specified SNVs was examined using a dataset provided by RCoV19 [23].

Downsampling.
To simulate decreased sequencing reads per sample, the FASTQ raw MiSeq sequencing reads were randomly sampled to give 50% or 25% of total reads in downsampled fastq files. Genome coverages for each sample were determined as described above in the Data Analysis section using those downsampled fastq files.

e Redundant Amplicon Scheme Is Robust on Mutations on the Primer Annealing Site, and Amplification Bias was Consistent among Samples.
e first concept of our robust whole-genome sequencing strategy is redundancy. We designed redundant primer schemes, and the resulting 147 amplicons enabled the coverage of every position on the SARS-CoV-2 whole genome with at least two amplicon sets, except the 5′-and 3′-ends (Figure 1, Supplementary Table 1). To determine amplification bias throughout the amplicons among different biological samples, we amplified the whole SARS-CoV-2 genome with four samples, comprising one virus isolate (NIID: EPI_ISL_408667) and three samples from patients with COVID-19 (EPI_ISL_11811634, EPI_-ISL_11811733, and EPI_ISL_11811834). Log-transformed normalized amplicon counts showed a Pearson correlation coefficient of >0.9 for all comparisons (Figure 2(a)). is suggests that the amplification bias, i.e., the overamplified and poorly amplified regions, is largely consistent among samples. e difference between the reverse transcription methods in NIID and other samples should be noted. e Pearson correlation coefficients of the normalized amplicon counts among three clinical samples (0.985, 0.965 and 0.974 for BGD1 vs. BGD2, BGD1 vs. BGD3, and BGD2 vs. BGD3, respectively) were greater than those between NIID and other samples (0.949, 9.961, and 0.941 for NIID vs. BGD1, BGD2, and BGD3, respectively) (Figure 2(a)). is suggests that reverse transcription methods could be one of the factors which affect amplification biases. When we analyzed the SNVs detected in the primer annealing sites for these four samples, one sample (BGD3) had a nucleotide substitution of C1059T within 5 bp from the 3′-end of primer 003R. e normalized amplicon counts for region_003 (from 630 bp to 1054 bp, without primers) in BGD3 was approximately seven to eleven-fold lower than the other samples without SNVs (898, 1419, 1397, and 126 for NIID, BGD1, BGD2, and BGD3, respectively). Redundant primer schemes, as shown in Figure 2(b), successfully rescued the reduction of sequencing depth in the region by the neighboring amplicons, resulting in only two-to three-fold reductions in the average depth of 1642,1923,1925, and 725 reads for NIID, BGD1, BGD2, and BGD3, respectively. is result supports the concept that redundant amplicon schemes can rescue the dropout amplicons derived from mutations at the primer annealing site.

Targeted Subtraction of Overamplified Regions Using Biotinylated Primers.
e second concept is based on the subtraction of overamplified amplicons (Figure 1). To reduce the overrepresented amplicons from the multiplex PCR products, we introduced a biotinylated primer for the target amplicons (Supplementary Table 1). e tiling PCR products containing biotinylated amplicons were mixed with streptavidin magnetic beads to subtract biotinylated amplicons. Biotinylated amplicons were successfully subtracted from the PCR products from all four samples, independent of the amplicon count (Figures 3(a) and 3(b)). is indicates that subtracting overrepresented amplicons (target amplicons) increased the relative presence of nontarget amplicons with low amplification efficiency.
is improvement was confirmed by an increase in the sequencing depth per 100 k normalized mapped reads in regions with low amplification efficiency (3000-3500 bp and 8000-8500 bp) (Figure 3(c)).

Application of the Proposed Resequencing Workflow Using
Clinical Specimens from Bangladesh. We validated our resequencing workflow using 304 clinical specimens collected in Bangladesh between May 2019 and April 2020. Multiplex sequencing for all samples was executed simultaneously using MiSeq, and 8,368,026 paired reads were obtained. Using this workflow, we successfully acquired 294 whole-genome sequences and deposited them in the GISAID database. e median length of undetermined nucleotides was 542 bp (i.e., 98.2% of the whole genome was successfully covered), including the 95 bp long-terminal regions that were not covered by amplicons (Figure 4). We simulated the effect of sequencing depth on undetermined nucleotides using the downsampling method in our reads.
is resulted in a median of 887 and 1577 bp of undetermined nucleotides (i.e., 97.0% and 94.7% coverage, respectively) using 50% and 25% of the reads, respectively (Figure 4), suggesting the potential applicability of increased multiplexing in one MiSeq run. We also validated the biases of each amplicon in the total sequencing library after the subtraction ( Figure 5). Some amplicons were difficult to amplify, such as region_014, 037, 084, 105, 119, and 125 (Supplementary Table 1); however, their lack of coverage was recovered by their proximal amplicons, resulting in compensation in the whole genome, as expected. e genotypes detected in Bangladesh from May 2020 to September 2020 were 20A, 20B, and 20F, which consisted of 25,197, and 1 sequence, respectively. ose from March 2021 and April 2021 were 20B, 20I (alpha), 20H (beta), and 21A (delta), which consisted of 10, 6, 53, and 2 sequences, respectively. We also called 1,020 SNVs, but none of them were novel.

Discussion
In this report, we proposed two additional strategies, "the redundancy in the amplicon schemes" and "the subtraction to reduce overrepresented amplicons" to the tiling PCR-  region_000  region_020  region_030  region_040  region_050  region_060  region_070  region_080  region_090  region_100  region_110  region_120  region_130  region_140 amplicon type normal biotinylated Log 10 normalized depth Amplicons from region_000 to region_147 based SARS-CoV-2 whole-genome sequencing pipeline. ese strategies were further validated using 304 clinical samples from patients with COVID-19 in Bangladesh.
e first strategy which we propose was introducing redundant amplicon schemes with three-group multiplex PCR. Primers designed for normal tiling PCR (e.g., Artic pipeline) have short overlaps with neighboring amplicons, enabling whole-genome coverage if all amplicons can be detected via NGS. In other words, a dropout amplicon because of the SNVs in the primer annealing sites resulted in the loss of sequence information in the region. is problem has been reported to the Artic primer scheme V3. Comparing the V3 and V4 schemes using actual clinical samples revealed that the improved primer scheme outperformed the detection of several important variants, including D950N and G142D in the spike protein [10]. Mutations continue to accumulate in the circulating SARS-CoV-2.
e genetic epidemiology of the SARS-CoV-2 virus sequences in the USA demonstrated a mutation rate of 6.7 × 10 −4 per site per year, similar to other RNA viruses [24]. It also has some hotspots under positive selection, especially in the S and N regions [24,25]. At this rate, mutations in the primer sequence are inevitable. For example, our tiling PCR scheme (total primer sequence length: 6489 bp) will have 4.3 nucleotide changes per year.
is suggests that the primer scheme should be repeatedly modified. Because of the redundant amplicon scheme in our primer design, even if one amplicon is dropped due to SNVs, the neighboring amplicon can cover the dropout. e second strategy which we propose was the subtraction of the overrepresented amplicons from the multiplex PCR products. e counts of the untargeted amplicons, which were amplified with a primer set using biotinylated primer rate � 0 (Supplemental Table 1), increased by approximately two-fold after introducing 50 biotinylated primers. Uneven genome coverage using the tiling PCR approach has been reported in several studies and has motivated researchers to propose new primer schemes to expand sequencing coverage [9,26]. Although these improved primer schemes outcompete the original Artic primer schemes, an inevitable amplification bias still exists. Our subtraction methods are theoretically applicable to any primer scheme. Together with the primer scheme modification due to emerging novel variants, improved primer schemes can be made more uneven using the subtraction method we introduced.
With these additional strategies combined with tiling PCR resequencing, we successfully obtained 294 wholegenome sequences from 304 Bangladeshi clinical samples from May 2019 to April 2020. ey were classified into clades 20A, 20B, 20F, 20B, 20I (alpha), 20H (beta), and 21A (delta). Moreover, their detection periods were in line with global trends. Clades 20A and 20B were dominant from May to September 2020; subsequently, the first wave in Bangladesh was replaced with 20I (alpha) and 20H (beta) from March to April 2021. None of the 1,020 SNVs detected were novel, but more efforts are needed to clarify whether they diversified simultaneously or had migrated from around the world.

Conclusion
Maher et al. recently proposed a prediction method for variants of concern using sufficiently dense genomic surveillance data [27]. With the additional strategies for tiling PCR resequencing presented here, we believe that wholegenome resequencing of SARS-CoV-2 from clinical samples can be optimized and cost reduction can be achieved. ese optimizations would lead to solving the regional biases in whole-genome sequence coverage out of the total number of reported cases. Furthermore, we believe that comprehensive whole-genome sequence information obtained worldwide will aid in the prediction of variants of concern.
Data Availability e genome sequences reported in this study are available from the GISAID and International Nucleotide Sequence Database Collaboration (INSDC) databases under accession numbers EPI_ISL_11811552 to EPI_ISL_11811841, EPI_-ISL_11811846 to EPI_ISL_11811849, and PRJNA831650.

Conflicts of Interest
e authors declare that they have no conflicts of interest.