Genome Resequencing of the Honeybee Apis mellifera jemenetica (Hymenoptera: Apidae): A Key Tool towards Characterization, Conservation, and Genomic Selection

We report the whole-genome sequence of the Arabian honeybee ( Apis mellifera jemenetica ). Seven A. m. jemenetica samples were sequenced representing three distinct subpopulations. Generated sequence reads were mapped to the reference honeybee Apis mellifera genome (Amel_HAv3.1). Data revealed genome-wide patterns of genetic variation which can be useful in the characterization and assessment of positive selection of the Arabian honeybee using diferent genetic markers. In total, 75.16Gb of clean bases were generated, and the GC content of samples ranged between 31.9 and 35.3%. Te efective reference genome size is 223,937,270 bp. Te mapping rate of samples varied from 88.97% to 96.19%, and the efective mapping depth was between 41.80 and 48.84X. Single-nucleotide polymorphisms (SNPs) among sequenced individuals ranged between 2379499 and 2396116 with respect to the reference A. mellifera genome (Amel_HAv3.1), and 2% of the SNPs were nonsynonymous. Genome Analysis Toolkit (GATK) detected 1097962–1109829 InDels and 10090–11962 structural variations (SV) from which 22.1 to 33.8% were in the form of deletions. Copy number variation (CNV) ranged between 550 and 2824, and 45–91% of them were downregulated. Tese variations among interbreeding individuals or groups of the same species may refect an adaptive environmental response and ftness among diferent subpopulations and can be very useful for subspecies characterization, conservation, and selection of the Arabian honeybee.


Introduction
Te honeybee Apis mellifera (Linnaeus, 1758) (Hymenoptera: Apidae) comprises 31 geographical subspecies distributed in Asia, Africa, and Europe [1].Based on molecular and/or morphometric analyses, these subspecies have been classifed into separate evolutionary lineages and sublineages: the O and Y lineages occurring in Western Asia, the M lineage in Eurasia, the C lineage in Europe, and the A lineage and Z sublineage in Africa [1,2].Te Arabian honeybee A. m. jemenetica is the native honeybee subspecies of the Arabian Peninsula and tropical Africa.It is the only A. mellifera subspecies that is naturally occurring in Africa and Asia, exhibiting distinctive traits associated with different geographical ranges [2].It is assumed that beekeeping in Saudi Arabia started since for more than 4000 years ago [3]. A. m. jemenetica has been kept in diverse climatic zones of the country and acquired unique morphological and behavioral aspects among other A. mellifera subspecies.For example, it is the smallest and the most adapted to extreme temperatures compared to other European or African A. mellifera subspecies [2,4,5].Each honeybee lineage or sublineage can include several A. mellifera subspecies.Based on morphometric characterization and mitochondrial DNA analysis, A. m. jemenetica of Saudi Arabia belongs to the Y lineage (previously called Z sublineage or O lineage) and is separated into 3 distinctive subpopulations, with signifcant morphological/molecular variations [6][7][8].Recently, a comprehensive phylogenetic analysis of the Arabian honeybee, A. m. jemenetica, based on complete mitochondrial DNA sequences (n = 14), revealed three distinct subclusters within Saudi Arabia [8], with very close evolutionary relationship with the Syrian honeybee, A. m. syriaca, the Egyptian honeybee A. m. lamarckii, and east African subspecies such as A. m. capensis and A. m. scutellata [8].Tese subpopulations may exhibit diferent adaptive traits [9].Nevertheless, huge importations of exotic honeybee subspecies may describe Saudi Arabia as the foremost package bee importer worldwide with about 1.3 million packages being imported in 2021 [10], which may impact the presence and characteristics of A. m. jemenetica and entails urgent conservation strategies of this honeybee subspecies.
Te completion of the honeybee genome (A. m. ligustica) [11] and subsequent releases and updates opens the door for genome-wise comparisons with other A. mellifera subspecies and ecotypes.From the 31 reported A. mellifera subspecies, only a few A. mellifera subspecies genomes were published (A. m. ligustica, A. m. carnica, A. m. mellifera, A. m. caucasica, and A. m. intermissa).Recently, by analyzing 251 genomes of 18 native honeybee subspecies, scientists suggested a West Asian origin of A. mellifera with at least three adaptive radiations given to African and European lineages; consequently, 145 genes with unique signatures of selection across all bee lineages were documented [12].Tese fndings increased the focus on the characterization and conservation of the West Asian honeybee subspecies such as A. m. jemenetica and A. m. syriaca.In this study, the whole-genome sequences of seven honeybee samples representing three distinct subpopulations of the Arabian honeybee A. m. jemenetica have been reported, which will enrich our understanding of the evolutionary relationship among A. mellifera subspecies and provide a wealth of knowledge on molecular characterization, conservation, and selection of A. m. jemenetica [13].From each apiary, one sample consisting of 15 bees was collected, 10 bees were characterized morphometrically using body dimensions (proboscis, femur, tibia, metatarsus, hind leg, tergite-3, and tergite-4 cubital index 1 and 2) and wing characters (wing angles: a4, b4, d7, c9, g18, j10, j16, k19, l13, n23, and o26 and 20 wing landmarks) [2,14] to ensure accurate subspecies afliation.After that, three adult workers were taken from each sample.Bee samples were preserved in ethanol (96%) and stored in the freezer.Genomic DNA was extracted from one honeybee and purifed using Qiagen extraction and purifcation kits (CAT: 96504; 28106, respectively: Qiagen, Hilden, Germany).Te genomic DNA was then sent to BGI (https://www.bgi.com) for sequencing.Te genomic DNA was frst subjected to ultrasound shearing, end repairing, and end adaptor ligation and fragment selection for amplifcation and library construction.

Sequencing Assessment and Raw Data Processing.
Figure 1 charts the steps of sequencer raw data processing, sequence alignment, and sequence variation assessments.Raw sequencing data produced from the DNBseq-G400 pine line were subjected to three-step fltration using SOAPnuke v1.5.6 software [15].Data processing started with adaptor trimming, and any reads with an adaptor mapping rate higher than 50% were removed.Ten, low-quality reads with more than 50% of low-quality bases (Q20 < 50%) were removed.Finally, contiguous reads with more than 2% N bases were removed.

Alignment.
Sequence reads were aligned to the reference genome using the BWA analysis tool (https://bio-bwa.sourceforge.net/bwa.shtml)[11].Te honeybee Amel_-HAv3.1 (https://www.ncbi.nlm.nih.gov/assembly/gcf_003254395.5)reference genome was used for alignment of sample sequence reads.Te reference genome size is 225,250,884 bp while the efective size is 223,937,270 bp (N base excluded), with GC content at 32.34%.Te mapping rate of samples varied from 32.96% to 99.47%, and the effective mapping depth ranged between 26.51X and 59.33X (according to the sample selected).Te short-sequence reads aligned against the long reference genome were created in SAM (Sequence Alignment/Map) format.Picard tools (v1.118) were used to sort the SAM fles by coordinate and convert them to BAM fles.Picard tools software (v1.118) (https://broadinstitute.github.io/picard/) was also used to mark duplicate reads.

Sequence Variation Assessment. GATK (version 3.7-0-gcfedb67:
https://www.broadinstitute.org/gatk/,Real-nTimes = 1) was used in single-nucleotide polymorphism (SNPs) and short insertions and deletions (InDels) calling.SNPs annotation started with a consensus sequence which consisted of all bases of the target genome.SNPs that existed in both genotypes (our genotype and the reference genomes) were screened, and after that, a highly reliable SNP dataset was generated.Nonsynonymous mutation and large-efect SNPs were also recorded.Annotation and statistics of selected SNPs and InDels BGI were performed using an inhouse Perl script and ReSeqTools (https://github.com/BGIshenzhen/Reseqtools).Te BreakDancer/CREST method (https://breakdancer.sourceforge.net[11]).Based on reference genome, the BreakDncer/CREST method (https// breakdancer.sourceforge.net(chen et al., 2009)) enabled us to produce a list of structural variations (SV) that were detected across the whole genome.Copy number variations (CNVs) were detected according to Zheng et al. [16] method.Using SOAP alignment results, the depth of each base was calculated and standardized by the mean depth of its chromosome to calculate the copy number variation according to the standard method [16].

Sequences and Alignment Assessment.
In total, 75.16 G of clean bases were generated.About 96% of the clean data have quality values higher than 20 (q20 ≥ 96%) and from 85.2 to 90.4% of the data have quality values higher than 30 (q30 ≥ 96%) (Table 1).For alignment, the Apis mellifera ligustica reference genome (assemblyAmel_HAv3.1:https:// www.ncbi.nlm.nih.gov/genome/48?genome_assembly_id=403979) was used.Te reference genome size is 225,250,884 bp, while the efective size is 223,937,270 bp (N base excluded), with GC content at 32.34%.Te mapping rate of samples varies from 88.97% to 96.19%, and the effective mapping depth is between 41.80 and 48.84X (Table 2).GC content of samples ranged between 31.9 and 35.3 (Table 1).

SNPs Detection and Annotation.
All seven samples represented morphologically one honeybee subspecies A. m. jemenetica and three distinct subpopulations.Te polymorphic loci in A. m. jemenetica samples revealed a high total number of SNPs ranging between 2379499 and 2396116 in relation to the reference A. mellifera genome (Amel_HAv3.1) and 35.1-36.3% of which were heterozygous.Nonsynonymous SNPs counted about 2% of the total number of SPNs and ranged between 48892 and 45559.Polymorphic sites of exons ranged between 529533 and 540719.A total number of 2041833-2050427 genes had at least one polymorphic site among diferent honeybee samples.Sample MD101 showed the highest number of nonsynonymous SNPs and SNPs on exons.Tese variations may have a great impact on the biological or behavioral traits of the Arabian honeybee A. m. jemenetica.Te total number and distribution of SNPs are listed in Table 3.

Insertions and Deletions (InDels). Genome Analysis
Toolkit (GATK) detected 1097962-1109829 insertions and deletions (InDels) among diferent A. m. jemenetica samples, which were distributed almost equally between insertions and deletions.Table 4 shows the distribution and annotation of InDels.Insertion and deletions at coding regions (CDS-InDels) were 0.94-0.99 of the total InDels in the genomes of diferent samples.About 0.86% of the InDels occurred within genes.InDel annotation can reveal more details about the evolutionary history among A. mellifera subspecies.Psyche: A Journal of Entomology

SV and CNV. Structural variation is a large sequence
(1 kb or even larger) that can include sequence duplication, inversion, and transposition or insertions and deletions, commonly referred to as copy number variants (CNVs).Tese CNVs often overlap with segmental duplications, and regions of DNA>1 kb exist more than once in the genome, copies of which are >90% identical [17].If present at >1% in a population, a CNV may be referred to as copy number polymorphism (CNP).Te highest number of structural variations (SVs) and copy number variations occurred in the samples MD101 and B31.Te copy number variation in the MD101 genome was about 4 times more compared with other A. m. jemenetica genomes.About 22.1 to 33.8 of the structural variation in the samples' sequences were in the form of deletion, and Table 5 lists the detailed distribution and annotation of the structural variations in all A. m. jemenetica samples.Copy number variations among individuals of the same species may refect environmental response, for example, sensory perception and immunity [18,19], and is considered an impotent genetic variation among populations.About 45-91% of copy number variations were downregulated.Table 6 lists the detailed distribution and annotation of the copy number variation in the honeybee A. mellifera.

Discussion
One of the striking features of A. mellifera is high recombination between chromosomes, which favors effective natural selection.Data presented here are the wholegenome resequencing of the native honeybee subspecies of Saudi Arabia A. m. jemenetica.Results demonstrated high genetic variation in SNPs, SV, InDels, and CVN among A. m. jemenetica samples and in comparison to the reference A. mellifera genome sequence.Tese genetic variations can be used in the molecular characterization of A. m. jemenetica and related subpopulations [20].Furthermore, adaptive structural and behavioral traits associated with the natural occurrence of A. m. jemenetica can be explored when data are subjected to further analysis [11].
Te very high CNV in two samples (MD101 and B31) can be related to nonallelic homologous recombination during meiotic recombination and may demonstrate direct interaction with diferent environments [19].Furthermore, high GC content in some genomes of the native honeybee of Saudi Arabia may indicate higher recombination rates [11] or higher genetic diversity [11].Single-nucleotide polymorphisms (SNPs) are variations caused by the changing of a single nucleotide (A, T, C, or G) in the genome.Te SNPs, including switch and reverse of single-nucleotide bases, are responsible for genome diversity between species and among individuals of the sample species.Analysis of SNPs among honeybee subspecies across genomes has been used in testing hypotheses concerning the ancestral origin of the western honeybee  [11] and molecular adaptive characteristics [11].Previous studies indicated that the GC-rich regions have more mutations [11], and this may be the case for the native honeybee populations in some regions of Saudi Arabia.InDels refers to insertion mutation, deletion mutation, or both, including events in the early stage of evolution, and might help in explaining levels of gene expression that disrupt the production of essential proteins [21].Copy number variation (CNV) is an important form of structural variation among individuals of the same species which may indicate long-term adaption to the unique environment of the Arabian Peninsula; CNVs of genes coding to vitellogenin or heat shock proteins or drought tolerance may unveil molecular aspects of adaptation of this honeybee subspecies compared to others.

Conclusion
Tis is the frst genome resequencing of the Arabian honeybee A. m. jemenetica.A honeybee that is characterized as unique in many aspects (size, heat tolerance, and natural distribution).Whereas the complete mitochondrial analysis of the Arabian honeybee from Saudi Arabia focuses on subspecies afliation [8], the wholegenome sequences focus on in-depth genetic variation among the Arabian honeybee populations within Saudi Arabia.Te outcome of this study can be used in the characterization, selection, and breeding of A. m. jemenetica.

Figure 1 :
Figure 1: Graph of standard bioinformatic raw data cleaning and analysis based on the family SOAP software (TE: transposable elements; QC: quality control; CNV: copy number variation; SV: structural variation).

Table 1 :
Quality evaluation of clean reads data of diferent Arabian honeybee A. m. jemenetica sequences representing three subpopulations.

Table 2 :
Alignment analysis of the clean read sequences of the Arabian honeybee A. m. jemenetica mapped to the reference A. m. jemenetica genome (Amel_HAv3.1).

Table 3 :
Genome-wide distribution and annotation of single-nucleotide polymorphisms (SNPs) in the Arabian honeybee samples.

Table 4 :
Genome-wide distribution and annotation of insertions and deletions (InDel) in the Arabian honeybee samples.

Table 5 :
Genome-wide distribution and annotation of structural variation (SV) in the Arabian honeybee samples.

Table 6 :
Genome-wide distribution and annotation of copy number variation (CNV) in the Arabian honeybee samples.