The Y-chromosome short tandem repeat (Y-STR) data are mainly collected for a performance benchmarking result in clustering methods. There are six Y-STR dataset items, divided into two categories: Y-STR surname and Y-haplogroup data presented here. The Y-STR data are categorical, unique, and different from the other categorical data. They are composed of a lot of similar and almost similar objects. This characteristic of the Y-STR data has caused certain problems of the existing clustering algorithms in clustering them.
Y-chromosome short tandem repeats (Y-STRs) are the tandem repeats on the Y-chromosome. The Y-STR represents the number of times an STR motif repeats and is often called the allele value of the marker. Most of the markers begin with a prefix D that stands for DNA, Y that stands for Y-chromosome, and S that stands for a single copy sequence, then followed by the location on the Y-chromosome or often known as locus. This nomenclature is based on an international standard body called Human Gene Nomenclature Committee (HUGO;
The Y-STR data are now being actively adapted as a remarkable method in genetic genealogy and anthropology studies such as in Hart [
The Y-STR data have been applied and used in clustering Y-surname and Y-haplogroup applications. Initial benchmarking results of clustering Y-STR data have been reported (see, e.g., [
Clustering accuracy scores of each dataset item.
Algorithm | Dataset item | |||||
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | |
|
0.70 | 0.79 | 0.84 | 0.84 | 0.74 | 0.62 |
|
0.79 | 0.83 | 0.87 | 0.78 | 0.87 | 0.72 |
|
0.65 | 0.75 | 0.83 | 0.87 | 0.56 | 0.54 |
|
0.67 | 0.81 | 0.85 | 0.77 | 0.80 | 0.64 |
|
0.56 | 0.82 | 0.83 | 0.79 | 0.81 | 0.70 |
Fuzzy |
0.56 | 0.74 | 0.74 | 0.97 | 0.76 | 0.66 |
|
0.80 | 0.90 |
|
1.00 | 0.97 | 0.84 |
New Fuzzy |
0.71 | 0.84 | 0.77 | 1.00 | 0.77 | 0.69 |
|
|
|
0.96 |
|
|
|
Clustering accuracy scores of the six Y-STR items.
|
Mean | Standard deviation | 95% confidence interval of the mean | Min | Max | ||
---|---|---|---|---|---|---|---|
Lower bound | Upper bound | ||||||
|
600 | 0.76 | 0.13 | 0.75 | 0.77 | 0.45 | 1.00 |
|
600 | 0.81 | 0.11 | 0.80 | 0.82 | 0.56 | 1.00 |
|
600 | 0.70 | 0.17 | 0.69 | 0.71 | 0.38 | 1.00 |
|
600 | 0.76 | 0.13 | 0.75 | 0.77 | 0.38 | 1.00 |
|
600 | 0.75 | 0.14 | 0.74 | 0.76 | 0.45 | 1.00 |
Fuzzy |
600 | 0.74 | 0.16 | 0.73 | 0.75 | 0.32 | 1.00 |
|
600 | 0.91 | 0.09 | 0.91 | 0.92 | 0.59 | 1.00 |
New Fuzzy |
600 | 0.80 | 0.13 | 0.79 | 0.81 | 0.44 | 1.00 |
|
|
|
|
|
|
|
|
Thus, the objective of this paper is to give the detailed insight of the six Y-STR dataset items used in the previous benchmarking results of clustering applications. This is because the scope of the previous reported Y-STR dataset was limited to the summary of the six Y-STR data only. No further descriptions on the methodological aspects have been reported, for example, data acquisition, filtration, distribution, similarities, and so forth. Certainly, the detailed descriptions of these Y-STR data are important for future references and further benchmarks of any relevant applications.
The Y-STR data are secondary data. They were taken and established from the raw data of the results of the DNA genealogical testing reported in various Y-DNA projects. Most of the DNA genealogical testing results can be accessed publicly through a genealogical portal or a database called WorldFamilies.net (see
Most of the results however did not restrict to any specific number of the DNA testing markers. Therefore, there was no uniformity of the reported results because there is no standard in terms of the number of markers chosen by participants. This is because the companies that provide the DNA testing services usually offer the DNA testing from a minimum of 6 DNA markers to a maximum of 67 markers. Thus, some participants who wish to know their familial relatedness more stringently may choose up to 67 testing markers; otherwise they only require a few markers.
There are two groups of data representations: the Y-STR data for Y-haplogroup representation and the Y-STR data for Y-surname representation. Three dataset items were established to represent each group. For the purpose of clustering analysis, each datum was given a prefix attached to the original kit number. For instance, for the Y-surname data, a prefix of an alphabet that belongs to his surname or group is normally attached to his kit number. For example, if the datum belongs to a family of Donald surname, the prefix D is attached to his kit number such as D-15868. For the haplogroup data, the prefix of its haplogroup was given along with the kit number such as A-23456, which represented haplogroup A. These naming conventions were used in order to maintain the original references if any questions arise in future. In addition, it was also used in the process of analyzing the clustering accuracy results in the misclassification matrix during the experimental analysis. The misclassification matrix is a method proposed by Huang [
The Y-STR data are treated as categorical data rather than numerical data, even though the allele values are in numeric. This is because the distance between two Y-STR objects is measured by comparing each allele (attribute) value of the Y-STR objects and their modal haplotype. Thus, the total of the mismatch values is the measurement of the genetic distance between two Y-STR objects. In fact, an initial experimental result showed that the Y-STR data were more favorable to be treated as categorical objects, rather than numerical objects [
where
The Y-STR data were filtered based on 25 similar markers according to the Y-DNA 25-marker test. The chosen markers included DYS393, DYS390, DYS19 (394), DYS391, DYS385a, DYS385b, DYS426, DYS388, DYS439, DYS389I, DYS392, DYS389II, DYS458, DYS459a, DYS459b, DYS455, DYS454, DYS447, DYS437, DYS448, DYS449, DYS464a, DYS464b, DYS464c, and DYS464b. The justifications to choose 25 markers are as follows. The 25 markers are considerably good enough for running out a genetic connection between two people. According to Fitzpatrick [ The results based on the 25 markers are found to be moderate and chosen by many participants. Therefore, the results were mostly available for establishing such dataset.
Table
The detailed description of the 25 Y-STR markers.
Marker’s name | Repeat motif | Alleles range | Mutation rate | Note |
---|---|---|---|---|
DYS393 | AGAT | 9–17 | 0.00076 | DYS393 is also known as DYS395 |
DYS390 | (TCTA) (TCTG) | 17–28 | 0.00311 | — |
DYS394 | TAGA | 10–19 | 0.00151 | DYS394 is also known as DYS19 |
DYS391 | TCTA | 6–14 | 0.00265 | — |
DYS385a |
GAAA | 7–28 | 0.00226 | — |
DYS426 | GTT | 10–12 | 0.00009 | — |
DYS388 | ATT | 10–16 | 0.00022 | — |
DYS439 | AGAT | 9–14 | 0.00477 | DYS439 is also known as Y-GATA-A4 |
DYS389i |
(TCTG) |
9–17 |
0.00186, |
DYS389 is a multicopy marker which includes DYS389i and DYS389ii. DYS389ii refers to the total length of DYS389 |
DYS392 | TAT | 6–17 | 0.00052 | — |
DYS458 | GAAA | 13–20 | 0.00814 | — |
DYS459a |
TAAA | 7–10 | 0.00132 | This is a multicopy marker which includes DYS459a and DYS459b |
DYS455 | AAAT | 8–12 | 0.00016 | — |
DYS454 | AAAT | 10–12 | 0.00016 | — |
DYS447 | TAAWA | 22–29 | 0.00264 | — |
DYS437 | TCTA | 13–17 | 0.00099 | — |
DYS448 | AGAGAT | 17–24 | 0.00135 | — |
DYS449 | TTTC | 26–36 | 0.00838 | — |
DYS464a |
CCTT | 9–20 | 0.00566 | DYS464 is a multicopy palindromic marker. Men typically have four copies known in such cases as DYS464a, DYS464b, DYS464c, and DYS464d. There can be less than four copies, or more such as DYS464e and DYS464f, etc. |
In the case of Y-surname, the data were filtered to obtain just the members of the main group of the family by comparing their allele values to the modal haplotype. Therefore, the final data were limited to the group of 0 to 5 mismatches only. This is because the fewer mismatches for a given number of markers, the more possibility for two people to share the common ancestor [
The first, second, and third dataset items represent category 1, the Y-STR data for haplogroup applications, whereas the fourth, fifth, and sixth dataset items represent category 2, the Y-STR data for Y-surname applications.
Table
The summary of the distributions of the dataset items.
Category | Dataset items | Number of objects | Number of classes | The distribution of objects |
---|---|---|---|---|
1 | 1 | 751 | 5 | E (24), G (20), L (200), J (32), and R (475) |
2 | 267 | 4 | L (92), J (6), N (141), and R (28) | |
3 | 263 | 3 | G (37), Group N (68), and Group T (158) | |
| ||||
4 | 236 | 4 | D (112), F (64), M (42) and W (18) | |
2 | 5 | 112 | 8 | G2 (30), G4 (8), G5 (10), G8 (18), G10 (17), G16 (10), G17 (12), and G29 (7) |
6 | 112 | 14 | G2 (9), G10 (17), G15 (6), G18 (6), G20 (7), G23 (8), G26 (8), G28 (8), G34 (7), G44 (6), G35 (7), G46 (7), G49 (10), and G91 (6) |
Besides the distribution of the objects, the main difference between two Y-STR data is that the haplogroup data were characterized by the objects that had lower degree of similarity (quite distant) to each other, whereas the Y-STR surname data comprised the objects that had higher degree of similarity (similar or almost similar) to each other. For further comparison, Tables
The genetic distance of Dataset Item 1.
Class | Genetic distance | |||
---|---|---|---|---|
Min | Max | Average | Range | |
E | 2 | 12 | 7.9 | 10 |
G | 2 | 11 | 5.7 | 9 |
L | 6 | 18 | 8.3 | 12 |
J | 15 | 22 | 18.6 | 7 |
R | 5 | 16 | 12.0 | 11 |
The genetic distance of Dataset Item 2.
Class | Genetic distance | |||
---|---|---|---|---|
Min | Max | Average | Range | |
L | 0 | 10 | 4.4 | 10 |
J | 1 | 17 | 7.5 | 16 |
N | 0 | 19 | 5.3 | 19 |
R | 6 | 13 | 9.5 | 7 |
The genetic distance of Dataset Item 3.
Class | Genetic distance | |||
---|---|---|---|---|
Min | Max | Average | Range | |
G | 3 | 14 | 8.4 | 11 |
N | 0 | 18 | 6.3 | 17 |
T | 1 | 16 | 8.3 | 16 |
The genetic distance of Dataset Item 4.
Class | Genetic distance | |||
---|---|---|---|---|
Min | Max | Average | Range | |
D | 0 | 6 | 1.4 | 6 |
F | 0 | 3 | 0.9 | 3 |
M | 0 | 4 | 1.6 | 4 |
W | 0 | 5 | 2.1 | 5 |
The genetic distance of Dataset Item 5.
Class | Genetic distance | |||
---|---|---|---|---|
Min | Max | Average | Range | |
G2 | 0 | 3 | 0.8 | 3 |
G4 | 0 | 1 | 0.5 | 1 |
G5 | 0 | 2 | 0.9 | 2 |
G8 | 0 | 5 | 1.8 | 5 |
G10 | 0 | 1 | 0.2 | 1 |
G16 | 0 | 2 | 0.8 | 2 |
G17 | 0 | 2 | 0.4 | 2 |
G29 | 0 | 2 | 0.4 | 2 |
The genetic distance of Dataset Item 6.
Class | Genetic distance | |||
---|---|---|---|---|
Min | Max | Average | Range | |
G2 | 0 | 2 | 0.6 | 2 |
G10 | 0 | 5 | 0.9 | 5 |
G15 | 1 | 6 | 2.7 | 5 |
G18 | 1 | 5 | 3.2 | 4 |
G20 | 0 | 4 | 1.7 | 4 |
G23 | 0 | 2 | 0.6 | 2 |
G26 | 0 | 2 | 0.8 | 2 |
G28 | 1 | 6 | 2.5 | 5 |
G34 | 0 | 2 | 0.6 | 2 |
G35 | 0 | 2 | 1.0 | 2 |
G44 | 0 | 1 | 0.2 | 1 |
G46 | 0 | 3 | 1.3 | 3 |
G49 | 0 | 9 | 3.8 | 9 |
G91 | 2 | 5 | 3.0 | 3 |
Tables
The average distance of Dataset Item 2 is 4.4–9.5 as shown in Table
The average distance of Dataset Item 3 is 6.3–8.4 as shown in Table
In the case of Y-STR surname dataset items, the average distance of Dataset Item 4 is 0.9–2.1 as shown in Table
In Dataset Item 5, the average distance is 0.2–1.8 as shown in Table
In Dataset Item 6, the average distance is 0.2–3.8 as shown in Table
In addition, the range values also indicate that the Y-STR surname dataset items consist of higher degree of similarity of the Y-STR surname objects. The range value of Dataset Item 4 is 3–6 (Table
The dataset associated with this Dataset Paper consists of 6 items which are described as follows.
The Y-STR data are a bit unique. They are characterized by a lot of similar and almost similar objects to each other. This uniqueness of the Y-STR data makes them different from the other common categorical datasets such as Soybean, Zoo, and Credit. In addition, this is considered the first effort to document Y-STR datasets, so that they are not limited to be used for clustering application only. The availability of the data will benefit researchers for further use in any method or application.
The dataset associated with this Dataset Paper is dedicated to the public domain using the
The authors declare that they have no competing interests.
The authors would like to extend their gratitude to many contributors toward the completion of this paper including Engineer Azizian Mohd Sapawi and their research assistants: Syahrul, Azhari, Kamal, Hasmarina, Nurin, Soleha, Mastura, Fadzila, Suhaida, and Shukriah.