Gaussian Fuzzy Number for STR-DNA Similarity Calculation Involving Familial and Tribal Relationships

We performed locus similarity calculation by measuring fuzzy intersection between individual locus and reference locus and then performed CODIS STR-DNA similarity calculation. The fuzzy intersection calculation enables a more robust CODIS STR-DNA similarity calculation due to imprecision caused by noise produced by PCR machine. We also proposed shifted convoluted Gaussian fuzzy number (SCGFN) and Gaussian fuzzy number (GFN) to represent each locus value as improvement of triangular fuzzy number (TFN) as used in previous research. Compared to triangular fuzzy number (TFN), GFN is more realistic to represent uncertainty of locus information because the distribution is assumed to be Gaussian. Then, the original Gaussian fuzzy number (GFN) is convoluted with distribution of certain ethnic locus information to produce the new SCGFN which more represents ethnic information compared to original GFN. Experiments were done for the following cases: people with family relationships, people of the same tribe, and certain tribal populations. The statistical test with analysis of variance (ANOVA) shows the difference in similarity between SCGFN, GFN, and TFN with a significant level of 95%. The Tukey method in ANOVA shows that SCGFN yields a higher similarity which means being better than the GFN and TFN methods. The proposed method enables CODIS STR-DNA similarity calculation which is more robust to noise and performed better CODIS similarity calculation involving familial and tribal relationships.


Introduction
Genetics is the study of genes, genetic variation, and heredity in living organisms. Population genetics is a part of evolutionary biology and is a subfield genetic that deals with genetic differences within and between populations [1]. Variations in traits among human populations represent genetic differences that can be inherited from generation to generation. Population genetics is learning about genetic variation in the population, involving the examination and modeling of changes in the frequency of genes and alleles in populations over time and space [2]. Population genetics gives us the opportunity to step back and observe patterns of genetic change over time. Comparing populations to one another can lead to capturing how external factors trigger the evolution of a trait, as well as mapping variants associated with various traits within the population. Population genetics is another way of looking at DNA that can generate insight into the potential to benefit everyone. Many of the genes found in a population will be polymorphic, that is, will occur in a number of different alleles. Mathematical models are used to investigate and predict the occurrence of specific alleles or combinations of alleles in the population; the focus is by comparing data groups or populations or species, not individuals.
A population is a group of individuals with the same characteristics (species) that live in the same place and have the ability to reproduce among each other; evolution also works through populations [3]. Geneticists, on the other hand, view the population as a means or container for the exchange of alleles owned by the individuals of its members. The dynamic frequency of alleles in a population is of major concern in the study of population genetics 2 Advances in Bioinformatics DNA regions with short repeat units (usually 2-6 base pairs in length) are called Short Tandem Repeats (STR). STRs are found surrounding the chromosomal centromere. STRs have proven to have several benefits that make them especially suitable for human identification [4]. STRs have become popular DNA markers because they are easily amplified by polymerase chain reaction (PCR) without the problem of differential amplification; that is, the PCR products for STRs are generally similar in amount, making analysis easier. An individual inherits one copy of an STR from each parent, which may or may not have similar repeat sizes. The number of repeats in STR markers can be highly variable among individuals, which make these STRs effective for human identification purposes [5].
Beginning in 1996, the FBI Laboratory launched a nationwide forensic science effort to establish core STR loci for inclusion within the national database known as CODIS (Combined DNA Index System). The 13 CODIS loci are CSF1PO, FGA, TH01, TPOX, VWA, D3S1358, D5S818, D7S820, D8S1179, D13S317, D16S539, D18S51, and D21S11. These loci are nationally and internationally recognized as the standard for human identification. DNA STR markers used in this research were 15 CODIS loci with two additional loci, i.e., D19S433, and D2S1338 has additional loci for an extensive and powerful STR testing battery if required [6,7].
A person's DNA profile can match DNA profile data similarity to another person. DNA profile plays an important role in solving problems related to the family's father and other family members [8,9]. This method is a way that is legally used for solving to prove the validity of kinship or family ties of the person, identifying unknown body of war or natural disaster victims, and studying human population [10,11].
In previous research, it has been noted that although M. R. Widyanto et al. [12][13][14] are quite sufficient in setting with triangular fuzzy number similarity of the size between the two alleles, the statistical information on the ethnicity of the two profiles' information is lost. To overcome the problem, this research employs novel methods to measure similarity between tribes that gives a better result than previous method. We proposed shifted convoluted Gaussian fuzzy number (SCGFN) and Gaussian fuzzy number (GFN) to represent each locus value as improvement of triangular fuzzy number (TFN) as used in previous research.
Research method was proposed in Section 2. Experimental results on three methods are shown in Section 3. Analyses of statistical and comparison tests are summarized in Section 4.

Proposed Research Method
2.1. Fuzzy Sets. Fuzzy sets are held as a basis for the theory of possibility. A fuzzy set A in x is formally defined as follows [15]: where x is the universe of discourse and is the membership degree of the x in A. When fuzzy set theory was presented, researches considered decision-making as one of the most attractive application fields of that theory [16].

Measurement of Similarity Values of Two-Individual STR-
DNA. The calculation to find the STR-DNA similarity of two individuals (as shown by Figure 1) is the STR-DNA value of allele 1 of the individual in comparison with the allele value 1 STR-DNA of the reference and the STR-DNA value of allele 2 of the individual with the STR-DNA value of allele 2 of each reference locus. Then we find the intersection point value of the two alleles for each locus and then calculate the average value of the similarity of each locus.

Two-Individual
where 0 ≤ 1 ≤ 2 ≤ 3 ≤ 1, 1 and 3 stand for the lower and upper values of the support of , respectively, and 2 stands for the modal values. Value of every DNA loci is represented by fuzzy triangular number where the fuzziness value is set to be 0.4 through experiments and the center of the fuzziness is the value of the corresponding loci. The similarity value between an allele of DNA profile evidence and DNA profile reference is given by where the value of the first allele < value of the second allele; t is intersection of the two alleles, a2 is STR-DNA value of the first allele, a3 is a2 + 0.2, and a1 is STR-DNA value of the second allele -0.2.
If t is a result of a negative value calculation, then t is considered zero because it means there is no intersection on both STR values; therefore, t values stay at interval [0, 1].
The following example shows the geometric calculation of the individual intersection points and the reference (as shown by Figure 2 The similarity between two DNA alleles is thus calculated as the average of the similarity of the entire locus, which in turn is arithmetic mean, which is expressed as  where is the value of similarity between DNA profile individual and DNA profile reference of the th individual, is a vector of DNA profile individual, and is a vector of DNA profile reference. The vectors and are the (∈ )-dimensional vector consisting of the value of 15 loci without amelogenin as has been used by Federal Bureau of Investigation (FBI).

Two-Individual Matching: Evidence versus Reference with GFN Similarity.
To improve the capability of locus matching, we propose GFN (Gaussian fuzzy number) similarity.
Compared to traditional triangular fuzzy number (TFN), GFN is more realistic to represent uncertainty of locus information because the distribution is assumed to be Gaussian. The Gaussian fuzzy function transforms the original values into a normal distribution. The midpoint of the normal distribution defines the ideal definition for the set, assigned a 1, with the remaining input values decreasing in membership as they move away from the midpoint in both the positive and negative directions. The input values decrease in membership from the midpoint until they reach a point where the values move too far from the ideal definition and definitely not in 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28  the set and are therefore assigned zeros. The fuzzy Gaussian function is given below [18]: A Gaussian Membership function is specified by two parameters: a Gaussian membership function is determined complete by and ; represents the membership ship center (the peak of the curve) and determines the membership function width. Intersection from two Gaussian functions is as follows [19]: where 1 is value of STR-DNA from individual, 2 is value of STR-DNA from reference, 1 is the sigma value of the individual, and 2 is the sigma value of the reference.
The following example shows Gaussian of the individual intersection points and the reference (as shown by Figure

Two-Individual Matching: Evidence versus Reference with SCGFN Similarity.
To improve the capability of locus matching in which ethnic information is involved, we propose SCGFN (shifted convoluted Gaussian fuzzy number) similarity. The original Gaussian fuzzy number (GFN) is convoluted with distribution of certain ethnic locus information. Therefore, the new SCGFN more represents ethnic information compared to original GFN.
The convolution function is a multiplication of the individual fuzzy Gaussian locus function and the fuzzy Gaussian approximation of the population locus. The fuzzy Gaussian function of the population locus approximation is obtained by extracting the mean value at which the mean value of the fuzzy Gaussian population locus is the STR-DNA value of the most population density and deviation of the particular population. Fuzzy Gaussian individual locus obtained, where the mean is the STR-DNA value of an individual locus, with the standard deviation value is the mean value minus 2.
The convolution is a mathematical operation on two functions (f and g) to produce a third function, that is, typically viewed as a modified version of one of the original functions, giving the integral of the pointwise multiplication of the two functions as a function of the amount that one of the original functions is translated [20]. The function f is obtained from the individual and the function g is derived from the reference and where a is the height of fuzzy = 1. For counting means, and standard deviation is as follows: The convoluted fuzzy number will replace the fuzzy number as individual locus representation value. Therefore, to get a stronger tribal relationship individual similarity calculation value is involved then the convoluted Gaussian fuzzy number is shifted approaching to the tribal population reference fuzzy number. The new shifted convoluted fuzzy number is called shifted convoluted Gaussian fuzzy number (SCGFN). This SCGFN is a new representation value of individual locus.
Obtaining the mean value of SCGFN is to compare the mean value of individual fuzzy Gaussian ( ) with the mean value of fuzzy Gaussian approximation of the locus population ( ). And then, the convoluted fuzzy number is shifted approaching to the fuzzy Gaussian approximation of locus population of certain tribe. The algorithm for shifting SCGFN is given below: The standard deviation value of SCGFN is the sum of the standard deviation value of the individual fuzzy Gaussian number ( ) with the standard deviation value of the fuzzy Gaussian approximation of the population locus of certain tribe ( ). The following formula is for the SCGFN standard deviation: 2.6. Measuring Tribal Relative Value from a DNA Profile. In general, the work flow of the tribal inference system is to find the value of the tribal similarity done by calculating the average value of the point of intersection of the value of the individual similarity to the value of the tribal population in the database of 15 loci. The tribe having the highest similarity value to the individual profile will be selected as the ethnic estimation of the profile. Workflow process can be seen in Figure 4. Tribal matching with triangular fuzzy number is obtained from the intersection of fuzzy triangular individual and triangular fuzzy population approximation. From each tribal population the mean intersection value of the individual fuzzy triangular and fuzzy triangular population approximation for each locus are calculated, and to determine the ethnic population of the individual the greatest value of the mean value of each locus of a tribal population triangular fuzzy individuals is obtained by using formula (3). Fuzzy triangular [10] [15] [27] [16] [31] [17] [8] [18] [1] population approximation is also obtained by using formula (3) where a2 is the STR-DNA value of the largest population of density, a3 is a2 + 0.2, and a1 is a2 -0.2. Example is shown in Table 1.

Tribal Matching with GFN.
Tribal matching with Gaussian Fuzzy Similarity was obtained from individual fuzzy Gaussians with fuzzy Gaussian population approximation. Gaussian fuzzy individuals are obtained by using equation of formula (7), where is individual STR-DNA value and = 1. Gaussian fuzzy population is obtained by using equation of formula (7), where means the distribution of the number of individuals and is the standard deviation from the distribution of the number of individuals. Calculation of standard deviation is as follows: where is number of STR-DNA values and is the number of individuals of a locus population.

Experimental Results
The DNA profile data to be entered into the database system is a PCR based DNA identification profile consisting of 15 loci, excluding amelogenin, each consisting of two alleles for each locus. The DNA profile used as an input to a tribal inference system is an Indonesian DNA with a total of 240 DNA data comprising Java (A), Malay (B), Mentawai (C), and Toraja 6 Advances in Bioinformatics

People Who Have
Family Relationships. DNA profiles tested in both biological parents are father and mother. From the experiment, the average individual similarity with family reference is obtained: SCGFN 87%, TFN 39%, and GFN 74%. Table 2 shows ten instances of the result of the similarity of an individual with a reference being a mother or father.

People Who Belong to the Same Tribe.
From the experiment, the average similarity of two individuals who have the same tribe is obtained: SCGFN 89.6%, TFN 21%, and GFN 65.14%. Table 3 shows ten instances of the result of the similarity of two individuals from the same tribe.

Certain Tribal Populations.
Population data consists of four tribes where the number of people in tribe A is 80, the number of people in tribe B is 100, the number of people in tribe C is 20, and the number of people in tribe D is 40. Figure 6 shows an example of the tribal population similarity calculation of the individual identity JT19. From Figure 6 it can be seen that SCGFN, GFN, and TFN displaying the tribe of JT11 are tribe A. Table 4 shows the result of similarity values with a certain tribal population with SCGFN, TFN, and GFN. From 80 experiments on the A tribe, the average tribe population was found to be 79% with fuzzy convolution, 45% with fuzzy triangular, and 71% with fuzzy Gaussian. The average B population of 100 experiments were 81% with fuzzy convolution, 46% with fuzzy triangular, and 76% with fuzzy Gaussian. The average C population of 40 experiments were 80% with fuzzy convolution, 45% with fuzzy triangular, and 73% with fuzzy Advances in Bioinformatics 7   Gaussian. The average D population of 20 experiments were 84% with fuzzy convolution, 64% with fuzzy triangular, and 79% with fuzzy Gaussian.

Individuals Who Have Family Relationships.
To find out the different methods used in this method, we used ANOVA and comparative tests. In the one-way ANOVA test, there is only one independent variable for this case as independent variables are individuals who have family relationships. The summary of variance analysis in Algorithm 1 was obtained; P value ≤ 0.001. Associated with the level of significance ( ) = 0.05, obtained p < means H0 is rejected so it can be concluded that there is a difference between the three methods. To determine which method is better than the other method, it is further tested by the Tukey method. In Algorithm 2 it can be seen that (i) TFN < SCGFN, because it does not contain zero and center negative; (ii) GFN < SCGFN, because it does not contain zero and center negative; (iii) GFN > TFN, because it does not load zero and center positive.
Then it can be concluded that TFN < GFN < SCGFN. From the result of similarity test with three methods and boxplot obtained in Figure 7, it can be seen that SCGFN method used in this research has higher similarity value in comparison with GFN and TFN method. 0.001. Associated with the level of significance ( ) = 0.05 or confidence level 95%, obtained p < means H0 is rejected so it can be concluded that there is a difference between the three methods. To determine which method is better than the other method, it is further tested by the Tukey method.

Individuals Who
In Algorithm 4 it can be seen that Then it can be concluded that TFN < GFN < SCGFN. From the result of similarity test with three methods and boxplot obtained in Figure 8, it can be seen that SCGFN method used in this research has higher similarity value in comparison with GFN and TFN method.

Certain Tribal Populations.
Testing is done with 720 data, that is, 240 data with 3 methods. The summary of variance analysis for A, B, C, and D in Algorithm 5 was obtained; P value ≤ 0.001. Associated with the level of significance ( ) = 0.05, obtained p < means H0 is rejected so it can be concluded that there is a difference between the three methods.
The experiment was conducted with 3 methods, where the method of SCGFN is method 1, TFN method is method 2, and GFN is method 3. To determine which method is better than other method then each tribe is tested further. In Algorithm 6 the comparison of three methods in ANOVA test results in tribal population A can be explained: (1) TFN < SCGFN because it does not load zero and center negative. So it can be concluded that TFN < GFN < SCGFN which means SCGFN is better than GFN and GFN is better than TFN. In Algorithm 7 the comparison of three methods in the tribal population B can be explained: (1) TFN < SCGFN because it does not load zero and center negative. (2) GFN = SCGFN because it loads zero. So it can be concluded that TFN < (GFN = SCGFN), which means that SCGFN and GFN methods are equal and better than TFN. In Algorithm 8 the comparison of 3 methods in tribal population C can be explained: (1) TFN < SCGFN because it does not load zero and center negative.
(2) GFN < SCGFN because it does not load zero and center negative.
(3) GFN > TFN because it does not contain zero and positive center.
So it can be concluded that TFN < GFN < SCGFN which means SCGFN is better than GFN and GFN is better than TFN.
In Algorithm 9 the comparison of three methods in the tribal population D can be explained: (1) TFN < SCGFN because it does not load zero and center negative.
(2) GFN = SCGFN because it loads zero.  So it can be concluded that TFN < (GFN = SCGFN), which means that SCGFN and GFN methods are equal and better than TFN.

Conclusions
In this research, the experiments were conducted to find a better method to obtain higher individual similarity values and to find stronger tribal properties. To improve the capability of locus matching, SCGFN and GFN have been proposed. It performed fuzzy number similarity of the size between the two alleles. Experiments were done for the following cases: people with family relationships, people of the same tribe, and certain tribal populations. In these three cases, ANOVA shows the difference in similarity between SCGFN, GFN, and TFN with a significant level of 95%. In the case of people with family relationship and the case of people of the same tribe with Tukey method in ANOVA shows that SCGFN yields a higher similarity which means better than the GFN and TFN methods. While in the case of certain tribal population with Tukey method in ANOVA shows in population A and population C, SCGFN better than GFN and TFN, whereas, in population B and population D, SCGFN is equal to GFN and better than TFN. The proposed method enables CODIS STR-DNA similarity calculation which is more robust to noise and performed better CODIS similarity calculation involving familial and tribal relationships.

Data Availability
The STR-DNA data used to support the findings of this study are available from the corresponding author upon request.