K-mer-Based Motif Analysis in Insect Species across Anopheles, Drosophila, and Glossina Genera and Its Application to Species Classification

Short k-mer sequences from DNA are both conserved and diverged across species owing to their functional significance in speciation, which enables their use in many species classification algorithms. In the present study, we developed a methodology to analyze the DNA k-mers of whole genome, 5′ UTR, intron, and 3′ UTR regions from 58 insect species belonging to three genera of Diptera that include Anopheles, Drosophila, and Glossina. We developed an improved algorithm to predict and score k-mers based on a scheme that normalizes k-mer scores in different genomic subregions. This algorithm takes advantage of the information content of the whole genome as opposed to other algorithms or studies that analyze only a small group of genes. Our algorithm uses k-mers of lengths 7–9 bp for the whole genome, 5′ and 3′ UTR regions as well as the intronic regions. Taxonomical relationships based on the whole-genome k-mer signatures showed that species of the three genera clustered together quite visibly. We also improved the scoring and filtering of these k-mers for accurate species identification. The whole-genome k-mer content correlation algorithm showed that species within a single genus correlated tightly with each other as compared to other genera. The genomes of two Aedes and one Culex species were also analyzed to demonstrate how newly sequenced species can be classified using the algorithm. Furthermore, working with several dozen species has enabled us to assign a whole-genome k-mer signature for each of the 58 Dipteran species by making all-to-all pairwise comparison of the k-mer content. These signatures were used to compare the similarity between species and to identify clusters of species displaying similar signatures.


Introduction
DNA k-mers are short recurring elements in the genomes of all living species. ese elements are both conserved and diverged across species owing to their functional significance, which enables these k-mer signatures ideal for species identification. Several recent studies have described the distribution of statistically significant k-mers in the genomes and several regulatory subregions (core, proximal, distal promoters, and 3′ and 5′ UTRs) in a small number of plant species as well as modern and archaic humans [1][2][3]. A k-mer is a type of short oligonucleotide of length k. K-mers can be part of core segments of transcription factor binding sites or regulatory elements that take part in protein binding and gene regulation in different subregions of the genome. e present version of the algorithm is an alignment-free k-mer sequence comparison method. Such methods involve statistical analysis and comparison of k-mers between the genomes of two species. ese methods vary in the statistical measures applied, such as the comparison of word frequency, incorporation of information theory, universal sequence maps, and the measurement of complexity [4]. e advantages of k-mer-based alignment-free methods over alignment-based phylogenetic algorithms are that they can process the data much faster and eliminate biases that could be induced by using a priori-defined guide trees when performing the alignment, and subjective selection of alignment scoring parameters, such as gap opening and extension [5,6].
Related methods include metagenomic algorithms that are capable of identifying bacterial taxonomic groups based on metagenomic sequence data. Such methods focus on taxon identification but not taxon comparison. Such an algorithm is PhyloPythia, which applies a multiclass support vector machine (SVM) using relative frequency profiles of short oligonucleotides to classify genome fragments as short as 1 Kbp into taxonomic ranks between genus and phylum with high specificity [7]. Another method, TACOA, uses abundance profiles to represent whole-genome sequences and uses a k-nearest-neighbor-classification-based method [8]. It correctly classified fragments larger than 800 bp between 39% and 76% at the genus or superkingdom level, respectively [8]. ese methods perform quite well at oligonucleotide lengths as low as 4 bp. Yet another algorithm, RAIphy, calculates the log odds ratio between the observed and expected occurrence of each k-mer based on Markov assumptions for the k-mer probabilities [9]. is algorithm assigns a genomic fragment to a specific taxon based on a comparison between the two [9]. e possible weakness of these metagenomic methods is that they likely only use very short fragments as compared to the whole genome and thereby quite possibly skew their oligonucleotide frequency profiles. However, an analysis of the whole genome gives a certain and complete picture of these profiles.
Compared to our previous work in this area [1][2][3], the current improved algorithm scores k-mer significance based on a normalized scale of − 1 to +1, which is used to calculate k-mer signatures so as not only to predict statistically significant and biologically relevant k-mers, but also to make the genomes of two given species comparable based on their k-mer signatures. erefore, the goal of this study is to further develop a k-mer prediction method, which can be used to predict biologically significant k-mers and then to use these k-mers in species comparison and clustering. e present method is novel in that it measures the Pearson correlation coefficient values of the normalized k-mer relevance scores (not just the k-mer's frequencies) between the whole genomes of a number of species and assigns them to clusters. While the underlying algorithm is similar, certain changes were made to differentiate statistically significant over-and underrepresented k-mers (see Materials and Methods). Furthermore, the k-mer prediction algorithm is now applied to a wider range of animal species as compared to plant species in the previous studies. is is because different genera of species provide sufficient diversity for cross-comparison, and the whole-genome sequences for these species were also available.
In this study, the whole-genome sequences of 22 Anopheles species, 30 Drosophila species, and six Glossina species from the NCBI database were downloaded (58 in total), analyzed, and compared with each other. We also included the whole-genome sequences of Apis mellifera and Caenorhabditis briggsae as outliers. Aside from the previously mentioned 58 species, the whole genomes of two Aedes and one Culex species were also downloaded and compared with the three Dipteran genera. ese species were used as outlier species in order to measure how the genome of such unrelated species measures up to the species in the three genera under study. With a larger number of species involved, more general inferences can be made about k-mer content, as well as inferences about the phylogenetic aspects of two genera in relation to each other. ese analyses have become possible because the whole-genome sequences of 110 fly species are available now, thus facilitating comparative studies with regard to gene content, genetic mechanisms, and genome structure [10].
Anopheles is a genus of mosquitoes, belonging to the family Culicidae and suborder Nematocera. Anopheles has 485 species, 100 of which can transmit malaria via the genus Plasmodium, and 41of the 100 species cause human malaria. 14 of the 22 Anopheles in this study are among those species, which cause malaria [11]. ey are global in distribution and are studied mainly because of their epidemiological importance, having caused around 200 million deaths in 2013 [12]. e taxonomy of the subfamily Anophelinae is unstable. For example, a supposedly sister genus, Bironella, was classified by different groups either to be outside or within Anopheles. Morphological traits and DNA sequence data were studied to address the relationships between Anopheles, Bironella, and Chagasia, but were not able to produce stable results [13]. erefore, the genome k-mer analysis of Anopheles species is a timely task. e genus Drosophila (pomace, vinegar, or wine flies) [14] includes various multiple subgenera and clades and are widely distributed in the northern hemisphere. According to rockmorton [15], the faunal disjunction of Drosophila species between the Old and New World occurred in five lineages (the Scaptodrosophilan, Sophophoran, virilisrepleta, immigrans-tripunctata, and the Hirtodrosophilan) [16]. e fruit fly species Drosophila melanogaster is probably the most well-known and widely studied insect in the world, due to its easy culturing, high reproductive rate and generation time, and small body size.
On a genomic level, Anopheles and Drosophila have several marked genomic differences. In general, anophelines have greater intron loss compared to drosophilids. ey also have more genes, which are the result of gene fission and fusion events, affecting an average of 10.1% of all genes in the genomes of the 10 species with the most contiguous genome assemblies. Furthermore, codon usage is more uniform in anopheline genomes than in drosophilids [17].
Species of the genus Glossina (family Glossinidae, suborder Brachycera) or tsetse flies are characterized by difficult culturing, long generation times, and low reproductive rates. ese fly species are studied due to their medical and economic importance of parasitism and their role as vectors of trypanosomes [18]. e genus consists of three subgenera, Austenina, Nemorhina, and Glossina, represented by the species, G. fusca, G. palpalis, and G. morsitans. eir 22 species were previously classified within five species complexes [19]. Extent tsetse flies are distributed in sub-Saharan Africa as well as the Saudi Arabian Peninsula [20].
With a novel method in hand, the goal of this study is to predict biologically important k-mers of lengths 7-9 bp in species identification by statistically examining all lexicographically possible k-mers for the 58 Anopheles, Drosophila, and Glossina species. K-mers, which are 8 bp long, correspond to the typical length of DNA which is recognized by transcription factors. erefore, these k-mers are the length of typical core transcription factor binding sites [21,22]. We also allowed for a ±1 bp wobble; this is why we chose a range of 7-9 bp. We achieve this by scoring the k-mers of the 58 species' motifome (defined as all lexicographically possible k-mers of a given length in the genome) based on their whole-genome sequences. Furthermore, with such a wholegenome k-mer signature (WGKS) available for each species (that is, the available scores of all k-mers in the genome of a given species), it will be possible to do an all-versus-all comparison for all the study species. is way, we could assign the species into different species clusters based on high correlations of the WGKS among members of the same cluster.

Sequence Data.
e whole-genome sequences were downloaded from the NCBI Genome database (https://www. ncbi.nlm.nih.gov/genome/) for 22 Anopheles species, 30 Drosophila species, and six Glossina species. A genomic summary of these species including the names of the wholegenome sequences, the number of chromosomes/contigs/ scaffolds present in their individual genomes, the size of their genomes, as well as the A/C/G/T % of their genomes is in Supplementary File 1. Two Anopheles species, A. farauti and gambiae, have two separate genomes in the database. One of their genomes was broken up into a large number of shorter fragments. erefore, the genome with a smaller number of contigs was selected. e genome size and the ACGT% have been plotted for each of the 58 species in Supplemental Figures 1 and 2, respectively. e 5′and 3′ UTR sequences for seven Drosophila species and the intron sequences for 12 Drosophila species were downloaded from the FlyBase database (ftp://ftp.flybase.net/ genomes/). e 5′ and 3′ UTR sequence sets for Anopheles gambiae were also downloaded from the UTR database (http://utrdb.ba.itb.cnr.it/home/download) [23] as a comparison with the Drosophila species. Summary statistics for these species for 5′ and 3′ UTR regions as well as introns can also be seen in Supplementary File 1. e 28 mitochondrial genomes were downloaded from the NCBI database: https://www.ncbi.nlm.nih.gov/genome/ browse#!/organelles. e genomes were aligned with the CLUSTALW2 software and trimmed so as to make the alignment less variable at the ends.

K-mer Scoring Algorithm.
e original k-mer scoring algorithm was described in the study of Lichtenberg et al. [21] and Cserhati et al. [1]. e algorithm is briefly described below; however, more details on the mathematical background for scoring the significance of a given k-mer can be found in the original publications. A flowchart showing the individual steps of the algorithm with inputs and outputs is shown in Figure 1.
e adapted algorithm used in the analysis is an enumeration algorithm, which counts the total occurrence of all possible k-mers of a given length k bp. A k-mer is viewed, for example, as the core section of a transcription factor binding site (TFBS) that different kinds of regulatory factors bind to, but it could also be a k-mer with any other kind of functional relevance. e k-mer sequence corresponds to a DNA surface that can specifically bind a regulatory protein. K-mers of lengths 7-9 bp were analyzed in this study (heptamers, octamers, and nonamers). For any length k, there are 4 k combinatorically possible k-mers, all making up the so-called motifome, as mentioned in the introduction. e longer the k-mer, the more specific its sequence is and the more well-defined its binding surface gets. e observed occurrence O of each k-mer was calculated for each possible k-mer.
For each genome, the background base pair distribution was also calculated in percent (A/C/G/T%). With this information, the probability of any given k-mer can be calculated based on Markov assumptions: where p i is the percentage occurrence of the base at position i in the k-mer. ese probabilities are multiplied together to get the expected probability of the given k-mer. e expected occurrence E of a given k-mer is equal to the length of the genome multiplied by the probability of the k-mer: In the previous works, a scoring algorithm was used to measure how much the actual occurrence O of a k-mer deviated from the expected occurrence E: where S k-mer is the calculated score and O and E are the observed and expected occurrences of a given k-mer, respectively. e purpose of this score is to filter out meaningless k-mers that could simply occur by chance. If the expected and observed occurrences should be about the same, the score should be close to 0. However, if the observed occurrence of the k-mer is much greater than the expected occurrence, then the score is close to 1. If the expected occurrence E is much greater than O, then the score goes to infinity. In this current study, equation (3) was modified to differentiate overrepresented and underrepresented k-mers.
e new scoring equation is as follows: With this setup, there are three possible cases: O � E : S k− mer ≈ 0, randomly occurring k − mer.
is way, all possible k-mers 7-9 bp long were scored for all 22 Anopheles, all 30 Drosophila, and all six Glossina species, as well as the two outlier species, A. mellifera and C. briggsae. e input at this first stage of the algorithm is the wholegenome sequences of all species in the study. e wholegenome sequence is used to calculate the expected and observed occurrences of all 4 k k-mers. e output is the WGKS for all species, a two-column list including the k-mers and their scores. e list of k-mers and their occurrence and score values are all available in the Supplemental material online. e python script that performs the analysis is publicly available on github at https://github.com/csmatyi/motif_analysis for interested users.

Calculation of Correlation between Any Two Species Based on K-mer Scores and Heatmap.
e input of this stage of the algorithm is the WGKSs for all species in the study from the previous step. e output is a symmetric matrix of Pearson correlation coefficients (CC) for all species pairs showing how well their WGKSs correlate with one another. e Pearson correlation coefficient between any two species was calculated based on their k-mer scores for any given k-mer length (here 7-9 bp). For any two species under consideration, all possible k-mers of length k were sorted lexicographically from A k to T k (k � 7-9). If any k-mer was missing from either species, then it was omitted. e correlation coefficient was calculated based on the scores for each k-mer present in both species. ese correlation coefficient values were depicted in a heatmap, one for each k-mer length from 7 to 9 bp.
We have three genera (Drosophila, Anopheles, and Glossina) in this study. e group is defined as all species in a specific genus and the nongroup is defined as all the remaining species. To compare the statistical significance of the CC values between any two species within a group vs. all CC values of all correlations between any one species in a group and any one species in a nongroup, we performed the Welch's t-test (unequal variance) for each comparison.

Creation of Plots.
In the last step of the algorithm, the symmetric CC matrix is transformed into a heatmap to depict the relationship between all species in the study. Barplots, boxplots, and heatmaps were generated using the barplot, boxplot, and heatmap functions in R, version 3.4.3. Phylogenetic trees for the three insect genera were created in R using the library phangorn, using the commands upgma.
e CC values for octamers were subtracted from 1 to get distance values, which were then used in the upgma command. Venn diagrams were created with the online software at http://bioinformatics.psb.ugent.be/webtools/Venn/.

Phylogenetic
Trees. Phylogenetic trees were created for all three insect genera using the phangorn library in R. e distance metric was 1− CC for all species pairs. Trees were created using the UPGMA, WPGMA, and NJ methods, using the upgma, wpgma, and nj commands.

Taxonomical Comparisons.
e classification of Drosophila species was matched to data (Genus/Subgenus/ Group/Complex) in the TaxoDros Figure 1: Flowchart depicting the algorithm. First, the whole-genome sequences or subgenomic region of interest for all species are analyzed, and the WGKS is produced. is is a list of all possible k-mers together with their normalized score values. ese WGKSs are compared in an all-versus-all manner, using the Pearson correlation coefficient. is produces a CC matrix, which is then visualized in a heatmap, depicting species relationships.

Matching Biologically Relevant Genome K-mers against Position Weight Matrixes in the JASPAR Database.
Position weight matrixes (PWM) for 140 transcription factor binding sites (TFBS) in D. melanogaster were downloaded from the JASPAR website [25] at http://jaspar.genereg.net. For all 30 Drosophila species, all biologically relevant candidate octamer k-mers were matched against all of 140 of these PWMs in a sliding window-like manner (since not all octamers were as long as these PWMs). A cutoff of 80% sequence similarity was used to call a match between a given k-mer and a JASPAR k-mer (which is the default used by the JASPAR database). Each JASPAR database hit is listed next to each k-mer in Supplementary File 3.
Putative biologically relevant genome k-mers were determined for a given species by calculating the mean score and standard deviation for each species and using the mean ± 2SD value as a cutoff. All k-mers with a score value above this limit were predicted as biologically relevant. is is because in the normal distribution, 5% of all values lie above 1.96 z-score limit. is cutoff was also used in the k-mer prediction study of modern and archaic humans [3].

Whole-Genome Sequence Analysis.
For each species, the whole-genome motifome was enumerated and scored for k-mers of lengths 7-9 bp. en, the whole-genome k-mer content was compared in an all-versus-all pairwise fashion, to determine correlation coefficients of 1,953 comparisons in total (including comparisons between the same species). ese values are available in Supplementary File 2 for k-mers of lengths 7-9 bp. e CC value represents how similar the WGKSs are between two species. Similar WGKSs of two species in turn reflect how similar the genomes are between these two species. Obviously, a more similar pair of species will contain more similar distribution of k-mers throughout their genomes and thus have a higher CC value. is is because on a macroscopic level, the genomes of similar species have not had enough time to diverge and accumulate too many mutational differences. Conversely, the distribution of k-mers in the genomes of dissimilar species is different, and thus, their WGKSs are also different. erefore, they also contain k-mers (for example, transcription factor binding sites) with different functions. For example, in a study of D. melanogaster, D. simulans, D. erecta, and D. yakuba, 5% of functional Zeste transcription factor binding sites were gained and/or lost compared to the other lineages [26]. e CC matrixes for the 63 species are depicted in the heatmaps in Figure 2 based on octamers. In the heatmap, a lighter, yellower color denotes a higher CC value, closer to 1, denoting species whose WGKS is similar to one another.
Darker, redder colors denote CC values closer to 0, denoting species pairs with an unrelated WGKS. What is very clear is that the three genera, Anopheles, Drosophila, and Glossina clearly separate from one another quite well and also from the two outliers.
3.1.1. Drosophila. Within the Drosophila cluster, a smaller subgroup can be seen including eight species: D. albomicans, D. americana, D. arizonae, D. grimshawi, D. mojavensis, D. nasuta, D. navajoa, and D. virilis. ese species represent a separate monophylogenetic group within the genus Drosophila and correspond to the subgenus Drosophila. All of the other species belong to the subgenus Sophophora. Within Sophophora, four species can be seen which themselves form a small, compact group: D. miranda, D. obscura, D. persimilis, and D. pseudoobscura. ese four species belong to the obscura species group within Sophophora. e phylogenetic tree for the genus Drosophila can be seen in Figure 3(a). Trees using the UPGMA, WPGMA, and NJ methods were drawn as described in the Materials and Methods section.
e Drosophila and Sophophora separate well from one another.
e outlier species D. ananassae, busckii, and willistoni also separate well from all of the other species.
is algorithm was used not only for measuring species similarity based on correlation of k-mer content, but also for predicting biologically relevant genome k-mers in all three Dipteran genera, as described in the Materials and Methods section. A list of all putative biologically relevant octamers is provided in Supplementary File 3. A summary of these predicted k-mers can be seen in Table 1.
It was found that shorter k-mers are more conserved because it is harder to conserve longer stretches of DNA. However, the shorter the k-mer, the less possible number of k-mers can be studied. Shortening k-mers loses information and precision because longer k-mers increase the k-mer signature, making the calculation of the CC value more precise. For octamers, the mean CC value was 0.857 (std. dev. 0.07). A p value of unequal variance of 2.3 × 10 − 247 was calculated for CC values within Drosophila and CC values between Drosophila and non-Drosophila species. A Cohen's d-value of 3.18 (CI of 3.03-3.32, 95% confidence level) was calculated, which is very high.

Drosophila Species with High Repeat Content in eir
Genomes. Another species of Drosophila (D. busckii) is seemingly misplaced on the heatmap, between Anopheles and Glossina, away from the other Drosophila species. is species has the lowest average CC value compared to all other Drosophila species (0.753, octamers, Supplementary File 2). However, when the CC value of D. busckii was compared to that of the six members of the genus Glossina, the average CC value was 0.699 (looking at octamers). When comparing CC values between D. busckii and Glossina versus D. busckii and all other Drosophila, the p value was 0.006. When comparing D. busckii only with the eight members of this small monophyletic group within Drosophila, an average CC of 0.891 can be calculated, with a p value of 4.7 × 10 − 9 , when comparing CC values between D. busckii and Glossina versus D. busckii and these eight Drosophila species. e TaxoDros Database also classifies D. busckii in its own separate species group (the busckii species group, which is part of the Dorsilopha subgenus). It is not exactly certain why D. busckii clusters the way it does. Zhou and Bachtrog [27] have observed that 60% of the neo-Y-linked genes have become nonfunctional in D. busckii. erefore, it is possible that due to this, the regulatory motifs in their promoter regions have also undergone differential mutations, thereby altering the k-mer content of this species.
D. ananassae, a species belonging to the Drosophila subgenus, shows lower resemblance to the members of the Sophophora subgenus. is can be seen well in Figure 2. D. ananassae is the species with the next lowest average CC value (0.800, octamers, Supplementary File 2) to all other Drosophila species. is could be due to the fact that its genome has the highest percent content of repetitive elements (24.93%), followed by D. willistoni (15.57%), also with the fifth lowest average CC value among the drosophilids (0.832, octamers, Supplementary File 2). A high repetitive element content in a species' genome means that the observed occurrence of many k-mers will be increased, thereby skewing the score for that specific k-mer. is in turn will also decrease the CC value between the given species and  D. mojavensis is another species that clusters well with other species from the subgenus Drosophila, but still had the third lowest CC value (0.823, octamers, Supplementary File 2). 592 octamer k-mers for this species without any mismatches were selected from the Tandem Repeats Database (TRDB) [24]. ese k-mers were filtered if they had a score less than 0.333. According to equation (4)    and Methods section, this corresponded to a k-mer which occurred twice as many times as its expected occurrence and therefore serves as a good cutoff CC value to gauge functional biological relevance. 245 of these 592 k-mers (41.4%) from the WGKS of D. mojavensis had a score higher than or equal to 0.333. D. mojavensis had 86 abundant, high-scoring reverse compliment k-mers (see filtering criteria in the previous paragraph). Five other Drosophila species had at least half as many (43) such specific k-mers, including D. busckii, which as seen before had the lowest mean CC value with all other Drosophila species (Supplementary File 4). D. mojavensis and D. busckii have a CC value of 0.88 (octamer level), which is above both the mean and the median within Drosophila. is also indicates that the high repeat k-mer content of this species may be skewing its CC values with other Drosophila species as well.

Glossina.
e Anopheles and Glossina clusters are much more compact than Drosophila. e mean CC between all six Glossina species was 0.978 with a std. dev. of 0.02 (looking at octamers, see Table 2), whereas the average CC between Glossina and non-Glossina species was 0.761 with a std. dev. of 0.143 (Table 2). e p value is 1.5 × 10 − 18 comparing within-Glossina CC values versus CC values between Glossina and non-Glossina species. A Cohen's dvalue of 8.48 (CI of 7.79-9.18, 95% confidence level) was calculated, which is very high. Supplementary Figures 1 and 2 also show that both the genome size (315-380 Mbp) and the ACGT% are also relatively invariable compared to the other two Dipteran families. is might be due to the relatively small number of species examined and also the close relationship of the six species examined. On the heatmap, G. brevipalpis is correctly classified into its own group, corresponding to the subgenus Austenina. G. morsitans morsitans and G. pallipides on the heatmap correctly cluster together and belong to the subgenus Glossina. G. fuscipes and G. palpalis gambiensis also cluster together on the heatmap as part of the subgenus Nemorhina. One species, G. austeni, however clusters together with the palpalis group, whereas according to NCBI taxonomy it belongs to the subgenus Glossina. ese species relationships are also mirrored in the phylogenetic tree in Figure 3(b). All three phylogenetic algorithms produce the same species relationships as described previously.

Anopheles.
e mean CC value calculated for octamers between Anopheles species was 0.948 (std. dev. 0.023). A p value of 0.0 was calculated for CC values within Drosophila and CC values between Anopheles and non-Anopheles species (meaning that the p value was too low that the neglog value cannot be displayed). A Cohen's d-value of 3.18 (CI of 5.03-5.47, 95% confidence level) was calculated, which is very high.
Hao et al. [28] performed a phylogenetic analysis based on 13 conserved mitochondrial protein-coding genes from 50 mosquito species. Based on their phylogenetic tree, the species from the Anopheles cluster as well as the Aedes and C. quinquefasciatus clustered similarly in the Hao et al.'s study and also in the present study. For example, A. darlingi is located on a separate major branch in the Hao study, and in the present study, this species as well as A. albimanus grouped together within the Anopheles cluster, within the subgenus Nyssorhynchus (within the genus Anopheles, subfamily Anophelinae). ese two species had a CC value of 0.989 (looking at octamers, see Supplementary File 2), whereas the average CC value between these two species and the rest of the Anopheles cluster is 0.955 (see Supplementary File 2). In both the heatmaps and Figure 6 of Hao et al. [28], A. arabiensis, gambiae, melus, and merus cluster together, corresponding to the gambiae species complex of the subgenus Cellia of the genus Anopheles. In the heatmaps, A. farauti and koliensis cluster together, and in the phylogenetic tree of Hao et al.'s study, these two species also cluster on the same major branch. Also, the species A. cracens and dirus cluster together closely in both the heatmap and the phylogenetic tree. In the Hao et al. study [28] and also on the heatmap, the species A. sinensis and atroparvus also cluster together. ese two species are members of the Anopheles subgenus of the genus Anopheles.
In another study by Freitas et al. [29], cytochrome oxidase subunits I and II (COI and COII) as well as the 5.8 S ribosomal subunit were analyzed to study the phylogenetic relationships between 47 Anopheles species. In their study as well as the present one, A. farauti, koliensis and punctulatus all clustered together, which are part of the Anopheles punctulatus group, which are major malaria vectors in the Southwest Pacific [30]. A. arabiensis, gambiae, melus, and merus also clustered closely in both studies, just as they did in the Hao et al.'s study [28]. However, whereas in the heatmaps A. dirus and stephensi clustered together, they were located on separate branches of the phylogenetic trees in both the Hao and the Freitas studies.
is difference might be due to both the Hao and Freitas studies having analyzed only the mitochondrial genome as opposed to the whole-genome studies in this paper. Nevertheless, these close clusterings between all three studies are remarkable in that very similar results were derived from analyzing a handful of mitochondrial proteins as well as from a global sequence analysis of the whole genome. e phylogenetic tree for Anopheles can be seen in Figure 3(c). e UPGMA and WPGMA trees look similar to one another, whereas the NJ tree looks somewhat different.

Species Relationships Based on Alignment of Mitochondrial DNA.
e mitochondrial whole-genome sequence for 29 species (19 Anopheles, 6 Drosophila, 2 Aedes, 1 Culex, and 1 Apis) was downloaded from the NCBI database. ese sequences were aligned, and then, the percent identity was calculated for each possible pairwise species pair. ese identity values are depicted in Figure 4 and are also available in Supplementary File 1. Figure 4 depicts species from the genera Anopheles and Drosophila segregating into two well-defined groups. e p value for Anopheles is 6.2 × 10 − 81 , whereas for Drosophila it is 8.9 × 10 − 9 . e two Aedes species group together, along with Culex quinquefasciatus. e outlier species, Apis mellifera, groups well away from all of the other species.
Within the genus Drosophila, only D. albomicans belongs to the subgenus Sophophora, whereas the other five species belong to the subgenus Drosophila, supporting previous results coming from the analysis of the WGKS.
Within the genus Anopheles, four species, A. arabiensis, gambiae, melas, and merus are very similar according to both their mtDNA, which reinforces the previous results from the analysis of the WGKS. Another group of species which cluster tightly together are A. farauti, punctulatus, cracens, and dirus. ese four species also cluster tightly on Figure 2. Five other species, A. culicifacies, epiroticus, funestus, minimus, and stephensi. ese species do not cluster together in Figure 2. is difference could simply be due to the fact that the k-mer profiles of the 28 species in question here reflect CC values were calculated for the genera Anopheles, Drosophila, and Glossina as well as between these three genera and between two outliers, Apis mellifera and Caenorhabditis elegans, and these two genera. For each combination, the minimum, mean, median, maximum CC values were calculated as well as the standard deviation and the number of species comparisons.
the k-mer distribution of the mtDNA only, and not that of the whole entire genome.

Classification of New Species Based on WGKS.
Since the taxonomy of many insect groups is in flux, it was interesting to see how several species from different genera were classified according to this algorithm. e WGKS of two Aedes species, A. aegypti and A. albopictus, and of Culex quinquefasciatus (all three being mosquito species) were analyzed and compared to species of Anopheles, to see if they form a separate group, or if they possibly form a monophyletic group together with Anopheles. Whole-genome sequences for species in the genera Bironella or Chagasia, the two closest genera to Anopheles in the subfamily Anophelinae were not available at NCBI. In Figure 2, all three species separate from the genus Anopheles. e two Aedes species have an average CC of 0.651 with Anopheles, whereas they have a CC of 0.847 between themselves (when looking at octamers, see Supplementary File 2). When comparing the CC values between Aedes and Anopheles to the CC values within the genus Anopheles itself (Supplementary File 2), a p value of 9.1 × 10 − 54 can be calculated. us, it can be concluded that Aedes form a group separate from Anopheles. When comparing C. quinquefasciatus to Anopheles, the mean CC value is 0.706. is is significantly different than the mean CC of Anopheles species among themselves (0.948, see Table 2, also looking at octamers). e p value between these two sets of CC values is 5.7 × 10 − 23 . erefore, it can be inferred that C. quinquefasciatus is also separate from the genus Anopheles. is shows that the present method is useful in classifying as of yet unknown organisms for which only the whole-genome sequence is available. e utility of comparing WGKS is greater than methods which analyze only groups of genes, which make up only a fraction of the entire genome sequence. Phylogenies based on different genes often conflict with each other [6].

Divergence and Similarities between Genera.
In order to measure the divergence of the two genera from each other, boxplots were created comparing the range of CC values within the genera Anopheles, Drosophila, and Glossina as well as between the three genera themselves, as well as between C. briggsae and the three insect genera individually, and also between A. mellifera and the three insect genera individually. is was done for k-mers of size 7-9 bp, and the boxplots can be seen in Figure 5 (octamers) and Supplementary Figures 4(a) and 4(b) (heptamers and nonamers). e minimum, median, average, and maximum CC value for each of the seven comparisons as well as their standard deviations can be seen in Table 2.
e mean CC values within the three genera are much higher than for all other comparisons (e.g., 0.955 within Anopheles and 0.869 within Drosophila for heptamers, Table 2), whether they are between the two genera Anopheles and Drosophila or between either one of the two outlier species and either one of these two genera.
is trend is consistent for all k-mer lengths. e minimum, mean, median, and maximum CC values decrease with increasing motif length, but this is due to the fact that as the motif length increases, the number of possible k-mers also increases proportionally, and therefore, CC values also tend to decrease. ese tendencies all illustrate the clear genomic content differences between the genera Drosophila, Anopheles, and Glossina.
It was also interesting to see which nonrepetitive (i.e., k-mers which do not consist of dimer or trimer repeats) genome k-mers were the most common between Anopheles, Drosophila, and Glossina for k-mers of lengths 7-9 bp. For this, all k-mers with a score of at least 0.5 (such k-mers occur three times more frequently than expected) and which occurred in at least half of all species in a given genus were selected (at least 11 Anopheles species and at least 15 Drosophila species, but at least 5 species in Glossina). ese k-mers are listed in Supplementary File 5 for lengths 7-9 bp. Common k-mers between all three genera are also listed in Supplementary File 5 and are visualized in Figure 6 (octamers) and Supplementary  Figures 5(a) and 5(b) (heptamers and nonamers).

Analysis of 5′ and 3′ UTRs.
Besides the whole genome, k-mer analysis was done for 5′ and 3′ UTRs for seven Drosophila species (D. ananassae, erecta, grimshawi, melanogaster, mojavensis, pseudoobscura, and simulans) and also Anopheles gambiae as an outlier species which was compared to these Drosophila species. Besides the WGKS, a species' 5PKS, 3PKS, and also IKS (5′ prime k-mer signature, 3′ k-mer signature, and intron k-mer signature) can also be defined. Sequence statistics for the selected Drosophila species and A. gambiae are available in Supplementary File 1. However, since 5′ and 3′ UTR sequences were not available for many species besides Drosophila, we could only do a restricted analysis, instead of analyzing species relationships on a heatmap as in Figure 2.
Figures 7(a) and 7(b) depict the CC ranges in boxplots for both within the genus Drosophila and between A. gambiae and the genus Drosophila for k-mer lengths 7-9 bp for 5′ and 3′ UTRs, respectively. Both figures show that the CC range for comparisons between A. gambiae and Drosophila is much lower than that for within Drosophila itself.
is difference between the two genera is more pronounced in 3′ UTRs as compared to 5′ UTRs. e CC values are present in a matrix for both 5′ and 3′ UTRs in Supplementary Files 6 and 7, respectively.
Summary statistics for CC values within Drosophila and between A. gambiae and Drosophila can be seen in Table 3.
e p values for 5′ and 3′ UTRs for k-mer lengths 7-9 bp are all statistically significant at the 5% level. is reflects that the same kind of genetic difference between the two genera is also present in the 5′ and 3′ UTR regions. Figure 8    motifs of lengths 7-9 bp. is is reflective of the lower overall CC range for 3′ UTR k-mers than 5′ UTR k-mers seen earlier (Figures 7(a) and 7(b)). e number of common k-mers increases in a roughly proportionate manner as the length of the k-mer increases, due to increasing k-mer space (e.g., there are more possible nonamers than octamers). ese common k-mers are listed in Supplementary Files 6 and 7 for 5′ and 3′ UTRs.

Analysis of Introns.
e intron regions of twelve Drosophila species (ananassae, erecta, grimshawi, melanogaster, mojavensis, persimilis, pseudoobscura, sechellia, simulans, virilis, willistoni, and yakuba) were analyzed in a way similar to the whole genome as well as the 5′ and 3′ UTR regions. Intron sequences were available for only Drosophila; therefore, we could not perform any species comparisons between this genus and Anopheles or Glossina. Figure 9 depicts the range of CC values for k-mer lengths 7-9 bp for these twelve species. Summary statistics for all k-mer lengths are available in Table 3. e number of common k-mers to all twelve species is depicted in Figure 8 (37, 344, and 1890 for motif lengths 7-9 bp) and is listed in Supplementary File 8, where the CC matrix is also available for k-mer lengths 7-9 bp.   As with the 5′ and 3′ UTR regions, the number of common k-mers also increases with increasing k-mer length, from 7 to 9 bp. e number of common intron k-mers is also less than the number of common 3′ UTR k-mers, which in turn is less than the number of common 5′ UTR k-mers (for heptamers and octamers), but not for nonamers (see Figure 8). is indicates that, for these two k-mer lengths, as the size of the sequence regions decreases, the number of common k-mers increases.

Conclusion
e motif prediction algorithm presented in previous works has been refined, expanded, and applied to a lot larger selection of species, allowing broader inferences to be made from the analysis. Furthermore, by defining the WGKS of yet unknown species, they can be classified into existing taxonomical categories. is algorithm is one more tool with which to characterize and classify new species, as in the case of A. aegypti and albopictus and C. quinquefasciatus. e WGKS, but also motif signatures from other subgenomic regions, can be useful in separating species into individual genera, sharply separated from one another. We believe that this algorithm can be put to use to not only predict biologically relevant whole-genome and subgenomic motifs, but also cluster species into taxonomic groups based on similarities and differences among their motif signatures.
is algorithm has only been used to analyze insect species, but could also be applied to compare species from other phyla.

Data Availability
Processed data and results are available in the supplementary files. Raw genomic data will be made available upon request.

Conflicts of Interest
e authors declare no conflicts of interests.

Authors' Contributions
MC designed the entire analysis, performed all of the calculations, created all of the figures, and wrote the manuscript. CG and PX contributed to the conception of this work and made essential improvements to the manuscript.

Supplementary Materials
Supplemental Figure 1: genome size for all 58 studied species. e size of the genome of each species is given in Mbp. Anopheles species colored in blue, Drosophila species in red, and Glossina species in green. Supplemental Figure 2: ACGT% content for all 58 studied species. e ACGT% for all 58 species is given for all species, adding up to one in a stacked barplot. Supplemental Figure 3(a): heatmap depicting species relationships between the 63 species included in the analysis based on the whole-genome k-mer signature for heptamers. Supplemental Figure 3 Anopheles, 15 Drosophila, and 5 Glossina species. Each included heptamer had a minimum score of 0.5. Supplemental Figure 5(b): common nonrepetitive (nondimer and nontrimer) nonamer content between 11 Anopheles, 15 Drosophila and 5 Glossina species. Each included nonamer had a minimum score of 0.5. Supplemental File 1: statistics of whole genome, 5′ and 3′ UTR, and intron sequences for the studied species. e species, file name, number of contigs, genome/subgenomic region size, and ACGT% are provided for each species. e pairwise sequence identity for all species pairs is included for the mitochondrial genome comparisons. Supplemental File 2: Pearson correlation matrix for whole-genome k-mer signatures.
e Pearson correlation matrix between all pairs of the studied species is provided for k-mers of lengths 7-9 bp. Supplemental File 3: predicted biologically relevant whole-genome k-mers (octamers). Biologically relevant octamers were predicted by the k-mer prediction algorithm for the 22 Anopheles, 30 Drosophila, and six Glossina species. For the Drosophila species, all predicted octamers were matched against 140 Drosophila PWMs from the JASPAR database with a cutoff of 0.8. Supplemental File 4: high-scoring repetitive motif content of three Drosophila species. High-scoring and highoccurring octamer palindrome k-mers are listed for Drosophila ananassae, mojavensis, and willistoni. ese k-mers were matched with the k-mers from other Drosophila species for comparison. Nonamers from D. mojavensis were also matched with motifs from the TRDB. Supplemental File 5: nonrepetitive frequent k-mers from the three genera. Nonrepetitive (nondimer/trimer repeats) were found in a majority of the species from the three genera for k-mers of lengths 7-9 bp. K-mers common to all three genera were also noted. Supplemental File 6: Pearson correlation matrix for 5′ UTR k-mer signatures.
e Pearson correlation matrix between all pairs of species involving A. gambiae and seven Drosophila species is provided for k-mers of lengths 7-9 bp in the 5′ UTR regions. Common heptamers, octamers, and nonamers are also provided. Supplemental File 7: Pearson correlation matrix for 3′ UTR k-mer signatures. e Pearson correlation matrix between all pairs of species involving A. gambiae and seven Drosophila species is provided for k-mers of lengths 7-9 bp in the 3′ UTR regions. Common heptamers, octamers, and nonamers are also provided. Supplemental File 8: Pearson correlation matrix for intron k-mer signatures. e Pearson correlation matrix between all pairs of species involving 12 Drosophila species is provided for k-mers of lengths 7-9 bp in the intron regions. Common heptamers, octamers, and nonamers are also provided. (Supplementary Materials)