Genetic Structural Differentiation Analyses of Intercontinental Populations and Ancestry Inference of the Chinese Hui Group Based on a Novel Developed Autosomal AIM-InDel Genotyping System

In the present study, we investigated the genetic polymorphisms of 39 ancestry informative marker-insertion/deletion (AIM-InDel) loci in the Chinese Hui group using a previously self-developed panel, further clarified the genetic relationships between the Hui group and other reference populations, and assessed the ancestry inference efficiency of the AIM-InDel panel based on the worldwide population data from 1000 Genomes Phase 3. The results of the locus-specific informativeness (In) and pairwise fixation index (Fst) values, multidimensional scaling analysis, and success ratio of estimation with cross-validation showed that the novel panel could well reveal the genetic structural differentiations of the East Asian, European, African, and South Asian populations. Besides, the biogeographical ancestry origin inference both at the individual and population levels was conducted on the Chinese Hui group by principal component analysis and STRUCTURE analysis, and the results revealed that the Hui group had the East Asian origin, and the East Asian component ratio of Hui group was approximately 88.87%. Furthermore, the population genetic analyses among the Hui group and reference populations were performed based on the insertion allele frequency heat map, population pairwise Fst values and phylogenetic tree, and the results indicated that the Hui group was genetically closer to East Asian populations, especially two Chinese Han populations (CHS and CHB populations).


Introduction
In recent years, ancestry informative inference has uncovered important information and provided a new perspective in biomedical fields such as anthropological research, forensic genetic application, and genetic epidemiology study [1][2][3]. In particular, ancestry inference based on ancestry informative genetic markers could also help to correct for population stratification [4][5][6]. Genetic variation describes the genotypic differentiations between different individuals or populations at the genomic DNA level, which was resulted from genetic mutation in connection with genetic drift, natural selection, and so on. The accumulation of genetic differences among populations, especially for intercontinental populations, is the basis of individual ancestor information inference. In the forensic genetic field, ancestry information inference could provide the valuable clue to the criminal case when the traditional genetic markers for individual identification failed to indicate the suspect. Currently, there were still some new challenges that need to be solved in ancestry inference research, such as elucidating the genetic variation estimations within or between populations and clearing the admixture proportions in an individual of mixed origin [7].
In the past decade, several panels on single nucleotide polymorphisms (SNPs) were developed for ancestry informative inference applications based on capillary electrophoresis (CE) platform, as well as massively parallel sequencing (MPS) technology [8,9]. SNPs showed the advantages of favorable stability, widespread distributions, and relatively polymorphic allele frequency patterns in different populations [10], but several limitations (for example, SNP genotyping is a relatively complicated process and demands for a high-quality research platform) still existed in ancestry informative marker-single nucleotide polymorphism (AIM-SNP) analysis [11]. As for the mitochondrial DNA and Y chromosome genetic markers, although they separately possess highly ancestral information of the maternal and paternal inheritances, there are usually both no gene recombinations in these two genetic markers, and their variations show only the maternal or paternal genetic characteristics, respectively. Besides, the databases of these two kinds of genetic markers are limited; sometimes, it may lead to the deviations in genetic population analysis.
InDel is proposed as a new kind of genetic marker which combines the advantages of both short tandem repeats (STRs) and SNPs, i.e., extensive distribution, short amplicon size, and low mutation rate; besides, the length polymorphic characteristic makes it easy to be genotyped on the CE platform by fragment size differentiations [12,13]. Another advantage of the InDel maker is the simple genotyping workflow which could reduce the risk of DNA contamination and save the genotyping time to a great extent. Compared with the AIM-SNP typing method based on SNaPshot technology, the technology of labeling the InDel primers by multicolor fluorescence materials, and combining with the CE platform, has the advantages of easy popularization and forensic application in the primary DNA laboratory [11]. Although MPS technology has provided a very effective genotyping method to simultaneously detect hundreds of genetic markers [14], it still required unified standards to make MPS technology as a routine method in forensic application. Hence, developing the small-scale ancestry information marker sets for a universally applicable CE analysis system is still needed. In the consideration of the superiorities of the InDel marker, a 39 autosomal AIM-InDel panel was developed in our previous study [15]. In the present study, the effectiveness evaluation of this panel was extended to further analyze the populations from the five intercontinental regions (Africa, Europe, South Asia, East Asia, and America).
China is a multiethnic country with 56 populations, and the Hui group is one of the largest ethnic minorities which lives in Chinese many regions such as the Ningxia, Gansu, Qinghai, Xinjiang, Henan, Anhui, Liaoning, Heilongjiang, and Shaanxi provinces. There were few previously genetic polymorphic studies of different genetic markers on the Hui group, so the Hui group was chosen as the research object in this study. Genetic evidences of ancestry inference markers such as SNPs indicated that the Hui group had closer genetic relationships with East Asian populations [16]. But to this day, the ancestry informative component of the Hui group is still unclear. And the present study is aimed at exploring the Hui group's genetic background and revealing the ancestral components of the Hui group based on this self-developed 39 AIM-InDel panel.

Sample Collections and Population Data Filtration.
In this study, the 509 adults of the Hui group who lived in the Xinjiang Uyghur Autonomous region were involved, and all the volunteers who had given their written informed consents were healthy, unrelated, and selected from the local Hui group randomly. The collection procedure of all the samples was conducted under the human and ethical research principles of Southern Medical University and Xi'an Jiaotong University Health Science Center.
Besides, the reference population data were from the 1000 Genomes Phase 3 [17], and the detailed information of 26 reference populations (a total of 2504 individuals) in the five intercontinental regions (Africa, Europe, East Asia, South Asia, and America) was shown in Supplementary  Table 1. 2.2. Sample Genotyping Using the 39 AIM-InDel Panel. In the present study, 509 DNA samples were prepared and amplified using the novel 39 AIM-InDel directed amplification kit without the DNA extraction step, and the PCR amplifications were conducted using the GeneAmp PCR System 9700 (Applied Biosystems, Foster City, USA) with the total 25 μl volume of the reaction system, and all the reagent dosages as well as PCR reaction condition were performed according to the previous study [15]. The AIM-InDel PCR products were separated and detected by the CE platform using the ABI 3500 xL Genetic Analyzer (Applied Biosystems, Foster City, USA). The 39 AIM-InDel genotyping was performed by GeneMapper ID-X software version 1.5 (Applied Biosystems, Foster City, USA). In order to ensure the accuracy of AIM-InDel genotyping results, a negative control and positive control (9947A) and allelic ladder were involved in the experimentation.

Multiple Statistical
Analyses. The allele frequencies, forensic parameters, and P values for Hardy-Weinberg equilibrium (HWE) tests of 39 AIM-InDel loci in the Hui group were calculated by the STRAF online program (version 1.0.5) [18]. Since the rs3034941 locus was excluded due to the lack of population genotype data in 1000 Genomes Phase 3, the raw genotype data of the same 38 AIM-InDel loci of the 2504 individuals from 26 worldwide populations were obtained. The pairwise F st values of five intercontinental populations in pairs, herein, the same intercontinental populations as a whole, were assessed using by Arlequin software (version 3.5) on the basis of 38 InDel loci, respectively. The success ratio of population origin with crossvalidation estimation, the population-specific divergence (PSD) values, and the principal component analysis (PCA) of the same 38 AIM-InDel loci among the different populations were performed in the online Snipper software (version 2.5) (http://mathgene.usc.es/snipper/analysispopfile2_new.html), and the informativeness (I n ) values which also called Rosenberg's I n values were calculated by the PSD values multiplied with 0.693, i.e., converting the natural log to log (2) [19,20]. The multidimensional scaling (MDS) analysis [21] was conducted by SPSS software (version 20.0). Population genetic structure analysis among the Hui group and reference populations was calculated by STRUCTURE software (version 2.3.4) with the length of burn-in period 10,000 times followed by 10,000 MCMC repetitions [22]. Besides, the optimal K value was determined by the online software Harvester program (http://taylor0.biology.ucla.edu/structureHarvester/). The bar plots based on the results of STRUCTURE analysis were conducted by DISTRUCT software (version 1.1) [23]. Andtheanal-ysisforpairwiseF st valuesbasedonthesame38InDellociamong22 worldwidepopulations(Americanpopulationsexcluded)andthe Hui group were assessed using Genepop software (version 4.0). ThepairwiseD A distancesoftheabovepopulationswereconducted by DISPAN software, and the phylogenetic tree was conducted usingMEGAsoftwareversion7.0onthebasisofpopulationpairwise D A distances. The box plot conducted based on Rosenberg's I n values,theheatmaps(oneinsertionallelefrequencyheatmapand two F st heat maps), and the scatter diagram of MDS analysis were drawnbyRsoftware(version3.4.4).

Results
3.1. Ancestral Information Inference Synthetic Evaluation of the Novel AIM-InDel Panel. In the present study, the ancestry inference synthetic efficiency and forensic practicability of this novel panel were conducted by assigning the population genotype data of the same 38 AIM-InDel loci in the 2504 worldwide individuals from the 1000 Genomes Phase 3, and the pairwise F st and locus-specific Rosenberg's I n values, the cross-validation estimation success ratios, and the MDS analysis were involved in these populations.
The PSD values of all the AIM-InDel loci were calculated by the online software Snipper, and then, these values were converted to the more widely used Rosenberg's I n values [19,20]. As shown in Figure 1, the box plot of I n values at the same 38 AIM-InDel loci showed distribution differences in five intercontinental populations from 1000 Genomes Phase 3, and the essential information of the total 39 AIM-InDel loci and Rosenberg's I n values of the same 38 AIM-InDel loci in five intercontinental populations were shown in Supplementary Table 2. In the box plot, eight AIM-InDel loci (rs10538061, rs146391383, rs16432, rs3044252, rs36038238, rs3831885, rs4647655, and rs5788637) showed higher I n values (>0.1) in East Asians; and eight AIM-InDel loci including the rs10569275, rs3029066, rs3216799, rs34477782, rs34921138, rs3840222, rs5891435, and rs5896844 could be regarded as African-informative markers with higher I n values (>0.1) in Africans. As for Europeans, seven loci, i.e., rs11273905, rs147090496, rs3047538, rs34477782, rs35434967, rs57406754, and rs5891435 showed relatively higher I n values (>0.06), which contributed greatly to differentiate the European populations and other intercontinental populations. In this panel, the locus-specific I n values of South Asians and Americans were relatively lower than those of other three intercontinental populations mentioned above.
The   The success ratios of population origin estimation with cross-validation of this panel were calculated using the Snipper software, and the results were shown in Table 1. In the AIM-InDel panel, three out of five intercontinental populations had the success ratios of ancestral information assignments over 90%, i.e., 98.49% (Africans), 91.25% (Europeans), and 99.80% (East Asians), while the South Asian and American populations represented relatively lower proportions for 84.67% and 61.96%, respectively.
The MDS analysis of five different intercontinental populations was conducted on the population level via SPSS software, and the MDS result was shown in Figure 3. The multivariable relationships of 26 reference populations were represented in a two-dimensional scatter plot; each dot represented one population, and different colors were provided on behalf of different intercontinental populations. As for the discernibility effective-ness of this panel, the African, South Asian, East Asian, and European populations exhibited distinct clusters, respectively. And the populations from the same continent gathered together in the abovementioned four intercontinental populations, and separated from the other three intercontinental populations, whereas four American populations scattered around the South Asian clusters.

Ancestry
Inference of the Hui Group Performed by a set of AIM-InDel Loci. The allelic frequencies and forensic parameters of the total 39 AIM-InDel loci in the Hui group were shown in Table 2. And the HWE tests for 39 loci were conducted as well; there were no significant deviations after the Bonferroni correction at all loci. The insertion allele frequencies were a range from 0.0285 (rs5896844) to 0.9293 (rs146391383) with the mean value of 0.5196. The matching probability, power of discrimination, polymorphic information content, power of exclusion, typical paternity index, observed heterozygosity, and expected heterozygosity of the 39 AIM-InDel loci ranged from 0.3539 (rs11273905) to 0.8992 (rs5896844), 0.1008 (rs5896844) to 0.6461 (rs11273905), 0.0538 (rs5896844) to 0.3748 (rs5788207), 0.0022 (rs5896844) to 0.2272 (rs5788207), 0.5258 (rs5896844) to 1.0923 (rs5788207), 0.0491 (rs5896844) to 0.5422 (rs5788207), and 0.0554 (rs5896844) to 0.5002 (rs5788207), with the mean values of 0.5017, 0.4983, 0.2850, 0.0967, 0.7864, 0.3405, and 0.3564, respectively.    In order to explore the ancestry components of the Hui group, population genetic structure analysis was conducted by STRUCTURE software based on the 26 reference populations. Firstly, the bar plots were conducted based on the raw genotype data of the total 3013 individual samples at K = 2-7, herein, only shown at K = 3-5. In Figure 4(a), when K = 3, the African populations were occupied mostly with color pink, European populations were almost blue, and East Asian populations were purple, but the American and South Asian populations showed mixed colors with blue and purple. The Hui group was accordant with East Asian populations which occupied mostly with color purple. When K = 4 and 5, the Hui group was still consistent with the ancestry information components with East Asian populations, while the American and South Asian populations could be distinguished with each other to a certain extent. The optimum K value was considered based on both the biogeographical factor and the result of delta K calculated by the online software Harvester program on the basis of the same 38 InDel loci in the total 27 populations from five different intercontinental populations, and the K value was finally determined at 3. As shown in Supplementary Table 3, when K = 3, the Hui group showed the ratios of ancestral informative components with the values of 0.8887 of cluster 1, 0.0786 and 0.0327 of cluster 2 and cluster 3, respectively, which were very similar to those of East Asian populations. The present study further assumed the Europeans, East Asians, and Africans as the three main ancestral origins to explore the ancestry proportions of unknown individuals and populations. As shown in Figure 4(b), the results were conducted on the population level, and the Hui group shared a relatively higher East Asian ancestry proportion (88.87%).
A scatter PCA plot of the total 3013 individuals from 27 populations in five continents was conducted at the individual level by the online software Snipper based on raw genotype data of the same 38 AIM-InDel loci. As shown in Figure 5, only Hui individuals were labeled by the dark blue, but other individuals from five intercontinental populations were marked in five different colors according to their located continents. All the individuals except Americans were clustered into four respective main clusters, and almost all Americans were scattered between the European, East Asian and South Asian clusters. As for the studied Hui group, almost of the Hui individuals were scattered into the East Asian cluster, whereas few of which overlapped with the American and South Asian clusters.

Population Genetic Analyses of the Hui Group and Other
Reference Populations via Multiple Methods. The population genetic analyses were conducted among the Hui group and reference 22 populations from 1000 Genomes Phase 3 (American populations were excluded) and the reference Xinjiang Uyghur (XJU) group in our previous study [15]. The insertion allele frequencies of the same 38 AIM-InDel loci were compared among 22 different populations from four different intercontinental populations (African, European, East Asian, and South Asian), the XJU and Hui groups. As shown in Figure 6, the heat map intuitively displayed the insertion allele frequency distributions of 38 AIM-InDel loci by the different colors, which showed not only the genetic relationships of the total 24 different  BioMed Research International populations but also the clusters of 38 AIM-InDel loci. As shown in the heat map, the rs3028822, rs3044252, rs16432, and rs3045215 loci exhibited distinct lower insertion allele frequencies while the rs10538061, rs146391383, rs3840222, and rs34921138 loci showed relative higher insertion allele frequencies in East Asian populations. The rs5896844, rs3842715, rs3831885, rs10534050, rs3029066, and rs10569275 loci showed relatively higher insertion allele frequencies, whereas the rs2307783, rs2307840, rs3840222, and rs34921138 loci showed lower insertion allele frequencies in African populations. The insertion allele frequency distributions of the rs3044252, rs5788637, rs3835409, and rs36038238 loci ranged from 0.400 to 0.600 in South Asian populations. As for European populations, they showed relatively lower insertion allele frequencies in six loci rs35434967, rs34477782, rs3047538, rs3033760, rs10538061, and rs3840794 but higher insertion allele frequency  Figure 7, there were three main branches in the phylogenetic tree which included the African, European, and Asian branches, and herein, the pop-ulations in different continents were marked by different colors. As for Asian populations, the main branch could also be divided into two subbranches which included the East Asian and South Asian populations. The studied Hui group was located in the East Asian subbranch. The length of each branch represented the genetic distance between different populations. For further analyses, the Hui group had closer genetic relationships to the East Asian populations; oppositely, the largest genetic relationships were found among the Hui group and seven African populations.
In this study, the population pairwise F st genetic distances were calculated among the 24 populations in pairs using the Genepop software, and a heat map intuitively represented the pairwise F st value differences in Figure 8.

Discussion
This study chose the Xinjiang Hui group as the research object, and the Hui group is one of the largest ethnic minorities in China that spread across several provinces. The Xinjiang province was an important region along the historic Silk Road, and the Hui group was documented as being descended from Silk Road travelers according to the records [24,25]. Exploring the genetic background and migration history of the Hui group is helpful to understand the complex population history of Xinjiang province. In recent years, ancestral informative inference can usually be used to correct the effect of population stratification in a genome-wide association study and also be applied to forensic anthropological research. Especially in the field of forensic genetic application, it is still necessary to investigate the population genetic diversity, further clarify population structure and background, and explore the biogeographic ancestor of the individual to which the biological materials from the crime scene belonged. The ancestral information inference research is helpful to narrow the criminal investigation scope and provide very valuable directional clues for the case investigation in forensic application. The most of the previously published panels for forensic ancestral inference have provided the important information for ancestral inference to some extent, but there were still some defects; for example, the genotyping for some AIM panels was a relatively complex process or required a specific or expensive detecting platform, which was difficult to be widely popularized and applied in the primarylevel forensic DNA labs. Compared with the ancestor   Figure 6: A heat map of insertion allele frequencies at the same 38 AIM-InDel loci among the Hui group and the reference populations drawn by R software.  BioMed Research International informative SNP, mitochondrial DNA and Y chromosome genetic markers, the novel AIM-InDel panel established previously by ourselves has the advantages of simple typing process, multiple amplification and capillary electrophoresis platform, and high efficiency of ancestor inference. The ancestry informative estimation of the Hui group was analyzed both at the individual and population levels. The ancestral origin components of the Hui group were inferred by STRUCTURE software based on an admixture ancestry model, and the results revealed that the Hui group shared the relatively higher East Asian ancestry proportion (88.87%). The PCA could be applied to describe some tangle genetic data with several principal components [26], and the PCA results also confirmed that the Hui group had East Asian ancestry origin. Usually, F st values could be also regarded as a measure of population differentiation [27]. And the phylogenetic tree was a branching diagram showing the evolutionary relationships based on similarities and differences in genetic characteristics [28]. Furthermore, the results of the phylogenetic tree and population pairwise F st values were conducted to further support the above results.
It has been pointed out that the small-scale panels with highly ancestral informative genetic markers could achieve the same effect on ancestral inference efficiency as the system with a great many of loci [29,30]. Therefore, our group independently developed the 39 AIM-InDel system in the previous research and evaluated its ancestral information inference efficiency at three main intercontinental populations (East Asian, European, and African). And we extended the estimation of ancestral information inference efficiency to five intercontinental populations in this study. First of all, the pairwise F st analyses among five intercontinental populations in pairs and Rosenberg's I n values in five intercontinental populations were calculated on the same 38 AIM-InDel loci, and the obtained results showed that most of the 39 loci in the AIM-InDel system had high discrimination ability in four intercontinental populations except the American populations. In addition, the MDS analysis and success ratios of estimation with cross-validation verified that the novel panel could give satisfactory results in the population stratification of four intercontinental populations, i.e., the African, East Asian, European, and South Asian populations. Although the American populations showed relatively lower success ratios (61.96%) of estimation with crossvalidation and pairwise F st values, it might be due to the genetic background or structure of these reference American populations themselves, rather than the AIM-InDel loci we chose. The previously reported researches indicated that American populations have mixed and complex ancestral origins due to extensive gene exchange and population migration [31,32]. Therefore, the lower I n values of some loci and the ancestral inference efficiency of the AIM-InDel panel in the American populations were largely due to the mixed ancestral origins of the American populations. In general, the 39 AIM-InDel panel developed by ourselves was an effective, practical, and easy-operated tool, which could be successfully used to infer the ancestral informative inference of five intercontinental populations except the relatively lower efficiency in American populations. At the same time, it could be well applied in the current forensic DNA laboratory.
Besides, many genetic studies conducted by different genetic markers such as STR, Y chromosome haplogroup, and HLA-DRB1 also revealed that the Hui group had closer genetic relationships with East Asian populations [33][34][35]. As for ancestral inference, the present result was relatively similar to the finding of He et al. [16] which claimed 96.34% of East Asian ancestry component in the Hui group based on AIM-SNPs. Of course, the current research is not enough yet; in order to comprehensively and deeply reveal the population genetic relationships in Xinjiang province, more groups in this region and more molecular genetic markers should be studied in the future.

Conclusion
In this study, we assessed the ancestral inference efficiency of a self-developed 39 AIM-InDel panel and also explored the  ancestral components of the Hui group. Multiple statistical analyses were conducted in order to assess the efficiency and to validate ancestry inference of this novel AIM-InDel panel. And this panel showed the satisfactory distinctions in four intercontinental populations and could be applied in forensic genetic analysis, anthropological research, and genetic epidemiology. The results of ancestral inference and population genetic analyses revealed that the Hui group shared relatively higher East Asian ancestry proportion (88.87%) and was genetically closer to East Asian populations (especially CHS and CHB populations). As for the Chinese Hui group in different regions, to further reveal its genetic background and migration history, more reference populations need to be involved in our future study.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare that they have no conflicts of interest.