First Y-Short Tandem Repeat Categorical Dataset for Clustering Applications

The Y-chromosome short tandem repeat (Y-STR) data are mainly collected for a performance benchmarking result in clustering methods. There are six Y-STR dataset items, divided into two categories: Y-STR surname and Y-haplogroup data presented here. The Y-STR data are categorical, unique, and different from the other categorical data. They are composed of a lot of similar and almost similar objects. This characteristic of the Y-STR data has caused certain problems of the existing clustering algorithms in clustering them.


Introduction
Y-chromosome short tandem repeats (Y-STRs) are the tandem repeats on the Y-chromosome.The Y-STR represents the number of times an STR motif repeats and is often called the allele value of the marker.Most of the markers begin with a prefix D that stands for DNA, Y that stands for Ychromosome, and S that stands for a single copy sequence, then followed by the location on the Y-chromosome or often known as locus.This nomenclature is based on an international standard body called Human Gene Nomenclature Committee (HUGO; http://www.hugo-international.org/).For example, if there are eight allele values for the DYS391 marker, the STR would look like the following fragments: The Y-STR data are now being actively adapted as a remarkable method in genetic genealogy and anthropology studies such as in Hart [1], Smolenyak and Turner [2], Pomery [3], Sykes [4], Shawker [5], Fitzpatrick [6], and Fitzpatrick and Yeiser [7].The method is used to trace similar groups of Y-surname projects as to support the traditional genealogical study.Furthermore, in wider perspectives such as in the anthropological studies, the method is also being utilized in establishing groups of males, often called haplogroups, across the geographical areas throughout the world.The haplogroups are the study in reference to mitochondria DNA and Y-chromosomes [1].As a consequence, a reputable reference, known as modal haplotype, used for defining groups of males all over the world has been made available (see http://www.isogg.org/for the details).The modal haplotype is actually a haplotype diversity where the degree of relatedness has become spread out.
The Y-STR data have been applied and used in clustering Y-surname and Y-haplogroup applications.Initial benchmarking results of clustering Y-STR data have been reported (see, e.g., [8][9][10][11][12]).Furthermore, the Y-STR data and their clustering results have also been published in a journal called Journal of Genetic Genealogy, a journal of genetic genealogical community [13].A more comprehensive benchmark, involving six Y-STR dataset items and eight existing partitional algorithms, has also been reported [14].The outcomes of this result indicate that the Y-STR data are quite unique compared to other categorical data, characterizing many similar and almost similar objects.This uniqueness of the Y-STR data has caused the existing clustering algorithms to produce poor clustering results (see the detailed problems of clustering Y-STR data in [15]).As a result, we have recently proposed a new algorithm called -Approximate Modal Haplotype (-AMH) for clustering six Y-STR data [15].Letting these Y-STR dataset items be a benchmark, the -AMH algorithm has been proven as an efficient clustering algorithm for partitioning Y-STR data.Tables 1 and 2 show the clustering results, comparing the -AMH algorithm and the other eight clustering algorithms as reported in [15].Thus, the objective of this paper is to give the detailed insight of the six Y-STR dataset items used in the previous benchmarking results of clustering applications.This is because the scope of the previous reported Y-STR dataset was limited to the summary of the six Y-STR data only.No further descriptions on the methodological aspects have been reported, for example, data acquisition, filtration, distribution, similarities, and so forth.Certainly, the detailed descriptions of these Y-STR data are important for future references and further benchmarks of any relevant applications.

Methodology
The Y-STR data are secondary data.They were taken and established from the raw data of the results of the DNA genealogical testing reported in various Y-DNA projects.Most of the DNA genealogical testing results can be accessed publicly through a genealogical portal or a database called WorldFamilies.net (see http://www.worldfamilies.net/).The data were retrieved from the respective websites in April 2010.
The results were reported in the form of spreadsheet and grouped in accordance with surnames or haplogroups.The reported sheets were commonly arranged in several columns that began with the Kit Number, Paternal Ancestor Name, and Haplogroup, followed by columns of markers.Normally, the test markers are up to 67 markers.Thus, the reported sheets provided all 67 columns of markers.However, in the case of lower testing markers, the columns were left empty without allele values.For each column of the markers, the allele values were presented in numeric.
Most of the results however did not restrict to any specific number of the DNA testing markers.Therefore, there was no uniformity of the reported results because there is no standard in terms of the number of markers chosen by participants.This is because the companies that provide the DNA testing services usually offer the DNA testing from a minimum of 6 DNA markers to a maximum of 67 markers.Thus, some participants who wish to know their familial relatedness more stringently may choose up to 67 testing markers; otherwise they only require a few markers.
There are two groups of data representations: the Y-STR data for Y-haplogroup representation and the Y-STR data for Y-surname representation.Three dataset items were established to represent each group.For the purpose of clustering analysis, each datum was given a prefix attached to the original kit number.For instance, for the Y-surname data, a prefix of an alphabet that belongs to his surname or group is normally attached to his kit number.For example, if the datum belongs to a family of Donald surname, the prefix D is attached to his kit number such as D-15868.For the haplogroup data, the prefix of its haplogroup was given along with the kit number such as A-23456, which represented haplogroup A. These naming conventions were used in order to maintain the original references if any questions arise in future.In addition, it was also used in the process of analyzing the clustering accuracy results in the misclassification matrix during the experimental analysis.The misclassification matrix is a method proposed by Huang [16] in the process of calculating the clustering accuracy scores.
The Y-STR data are treated as categorical data rather than numerical data, even though the allele values are in numeric.This is because the distance between two Y-STR objects is measured by comparing each allele (attribute) value of the Y-STR objects and their modal haplotype.Thus, the total of the mismatch values is the measurement of the genetic distance between two Y-STR objects.In fact, an initial experimental result showed that the Y-STR data were more favorable to be treated as categorical objects, rather than numerical objects [8,13].The dissimilarity measure between a Y-STR object  and the modal haplotype ℎ can be formalized as described in subject to where  is the number of markers.The Y-STR data were filtered based on 25 similar markers according to the Y-DNA 25-marker test.The chosen markers included DYS393, DYS390, DYS19 (394), DYS391, DYS385a, DYS385b, DYS426, DYS388, DYS439, DYS389I, DYS392, DYS389II, DYS458, DYS459a, DYS459b, DYS455, DYS454, DYS447, DYS437, DYS448, DYS449, DYS464a, DYS464b, DYS464c, and DYS464b.The justifications to choose 25 markers are as follows.
(i) The 25 markers are considerably good enough for running out a genetic connection between two people.According to Fitzpatrick [6], 12 markers (Y-DNA 12 test) are already sufficient to determine who does or does not have a relationship to the core group of family.
(ii) The results based on the 25 markers are found to be moderate and chosen by many participants.Therefore, the results were mostly available for establishing such dataset.
Table 3 shows the detailed description of the 25 markers.
In the case of Y-surname, the data were filtered to obtain just the members of the main group of the family by comparing their allele values to the modal haplotype.Therefore, the final data were limited to the group of 0 to 5 mismatches only.This is because the fewer mismatches for a given number of markers, the more possibility for two people to share the common ancestor [7].It means that these two people are much related to each other.Note that the DNA genealogical testing results included the results of greater than 5 mismatches.For the haplogroup only, the data that had been confirmed by SNP analysis were chosen.In the result sheets, the data that had been confirmed by SNP were marked in green color.As a result of the filtration, the final data were much smaller as compared to the original data.
The first, second, and third dataset items represent category 1, the Y-STR data for haplogroup applications, whereas the fourth, fifth, and sixth dataset items represent category 2, the Y-STR data for Y-surname applications.
Table 4 shows the distribution of each Y-STR dataset item.The largest number of the dataset items is 751 which belongs to Dataset Item 1.The smallest number of the dataset items is 112 which belongs to Dataset Items 5 and 6.In terms of classes, the largest number of classes is 14 classes and the smallest is three classes.The distribution of the objects is indicated by the values in the parentheses.The distributions for the haplogroup dataset items are observably unbalanced.The unbalanced distribution was caused by the filtration process as discussed before.However, this situation is known as a data reduction process that is much smaller in volume; yet it closely maintains the integrity of the data as suggested by Han and Kamber [17].The unbalanced distributions can be seen through Dataset Items 1, 2, and 3.For example, in Dataset Item 1, the class R consists of 475 objects that cover 63% as compared to the other classes.Meanwhile, the class N of Dataset Item 2 consists of 141 objects that cover 53% as compared to the other classes.In fact, this item also contains the lowest number of the objects in a class, which are 6 objects (about 2% of the total objects) in Group J.In Dataset Item 3, the class T consists of 158 objects, which is about 60% larger than the other classes.However, the Y-surname dataset items are much balanced in terms of the object distribution among the classes.This is because the Y-surname data are usually represented by the group of their family relatedness.See the detailed characteristics and the object distributions of each dataset item as shown in Table 4.
Besides the distribution of the objects, the main difference between two Y-STR data is that the haplogroup data were characterized by the objects that had lower degree of similarity (quite distant) to each other, whereas the Y-STR surname data comprised the objects that had higher degree of similarity (similar or almost similar) to each other.For further comparison, Tables 5-10 provide the detailed values of the minimum, maximum, average, and range of the genetic distances.The genetic distances were calculated and based on the mismatched values between the Y-STR objects of that particular dataset item and their modal haplotypes as formalized in (1a) and (1b).Note that the modal haplotypes here were the modes established from their respective classes.
Tables 5, 6, and 7 show the genetic distances of the Y-STR haplogroup data.The average distance of Dataset Item 1 is 7.9-18.6as shown in Table 5.This item is considered as having a lower degree of similarity of objects among themselves.
The average distance of Dataset Item 3 is 6.3-8.4 as shown in Table 7.This item is also considered as having a lower degree of similarity of objects among themselves.The low degree of similarity of the Y-STR haplogroup dataset items indicates that the objects in the datasets are considerably distant to each other.
In the case of Y-STR surname dataset items, the average distance of Dataset Item 4 is 0.9-2.1 as shown in Table 8.This item is considered as having a higher degree of similarity of objects among themselves.In Dataset Item 5, the average distance is 0.2-1.8 as shown in Table 9.This item is also considered as having a higher degree of similarity of objects among themselves.
In Dataset Item 6, the average distance is 0.2-3.8 as shown in Table 10.This table is also considered as having a higher degree of similarity of objects among themselves.The higher degree of similarity of the Y-STR surname dataset items as compared to the haplogroup dataset items indicates that the objects in the Y-surname dataset items are considerably similar or almost similar to each other.In addition, the range values also indicate that the Y-STR surname dataset items consist of higher degree of similarity of the Y-STR surname objects.The range value of Dataset Item 4 is 3-6 (Table 8); Dataset Item 5, 1-5 (Table 9); and Dataset Item 6, 1-9 (Table 10).These values are obviously different as compared to the range values of the Y-STR haplogroup dataset items.For example, the range value of Dataset Item 1 is 7-12 (Table 5); Dataset Item 2, 7-19 (Table 6); and Dataset Item 3, 11-17 (Table 7).

Dataset Description
The dataset associated with this Dataset Paper consists of 6 items which are described as follows.).This table consists of 751 objects of Y-STR haplogroup belonging to the Ireland Y-DNA Project (http://www.familytreedna.com/public/IrelandHeritage/).After filtration, this table is composed of only five haplogroups: E (24), G (20), L (200), J (32), and R (475).Note that the raw data are approximately 3419 data divided into 29 groups.The values in the parentheses indicate the number of objects belonging to that particular group.This table is considered as having a lower degree of similarity of objects among themselves, which indicates that the objects in the table are considerably distant to each other.In the table, the first column is the Kit Number followed by the 25 markers.Note that the Kit Number is actually the extended Kit Number that combined a prefix of its haplogroup name separated by the dash and followed by the original Kit Number.).This table consists of 267 objects of Y-STR haplogroup obtained from the Finland DNA Project (http://www.familytreedna.com/public/Finland).After filtration, this table is composed of only four haplogroups: L (92), J (6), N (141), and R (28).Note that the raw data are approximately 906 data divided into 7 groups.The values in the parentheses indicate the number of objects belonging to that particular group.This table is considered as having a lower degree of similarity of objects among themselves, which indicates that the objects in the table are considerably distant to each other.In the table, the first column is the Kit Number followed by the 25 markers.Note that the Kit Number is actually the extended Kit Number that combined a prefix of its haplogroup name separated by the dash and followed by the original Kit Number.).This table consists of 112 objects belonging to the Brown Surname project (http://brownsociety.org/).After filtration, the data are composed of only 14 family groups: Group 2 (9), Group 10 (17), Group 15 (6), Group 18 (6), Group 20 (7), Group 23 (8), Group 26 (8), Group 28 (8), Group 34 (7), Group 44 (6), Group 35 (7), Group 46 (7), Group 49 (10), and Group 91 (6).Note that the raw data are approximately 543 data taken from 126 groups.The values in the parentheses indicate the number of objects belonging to that particular group.This table is considered as having a higher degree of similarity of objects among themselves, which indicates that the objects in the table are considerably similar or almost similar to each other.In the table, the first column is the Kit Number followed by the 25 markers.Note that the Kit Number is actually the extended Kit Number that combined a prefix of its surname separated by the dash and followed by the original Kit Number.

Concluding Remarks
The Y-STR data are a bit unique.They are characterized by a lot of similar and almost similar objects to each other.This uniqueness of the Y-STR data makes them different from the other common categorical datasets such as Soybean, Zoo, and Credit.In addition, this is considered the first effort to document Y-STR datasets, so that they are not limited to be used for clustering application only.The availability of the data will benefit researchers for further use in any method or application.

Table 1 :
Clustering accuracy scores of each dataset item.

Table 2 :
Clustering accuracy scores of the six Y-STR items.

Table 3 :
The detailed description of the 25 Y-STR markers.

Table 4 :
The summary of the distributions of the dataset items.

Table 5 :
The genetic distance of Dataset Item 1.

Table 6 :
The genetic distance of Dataset Item 2.

Table 7 :
The genetic distance of Dataset Item 3.

Table 8 :
The genetic distance of Dataset Item 4.

Table 9 :
The genetic distance of Dataset Item 5.

Table 10 :
The genetic distance of Dataset Item 6.

Table ) .
This table consists of 263 objects obtained from the Y-haplogroup project (http://www.worldfamilies.net/yhapprojects).After filtration, this final table is composed of only three haplogroups: Group G (37), Group N (68), and Group T (158).Note that the raw data are approximately 516 data taken from haplogroups G, N, and T. The values in the parentheses indicate the number of objects belonging to that particular group.This table is considered as having a lower degree of similarity of objects among themselves, which indicates that the objects in the table are considerably distant to each other.In the table, the first column is the Kit Number followed by the 25 markers.Note that the Kit Number is actually the extended Kit Number that combined a prefix of its haplogroup name separated by the dash and followed by the original Kit Number.

Table ) .
This table consists of 236 objects combining four surnames: the Donald surname (112), the Flannery surname (64), the Mumma surname (42), and the William surname (18).The Donald surname data were obtained from Clan Donald's DNA Projects (http://dnaproject.clan-donald-usa.org/).The raw data are approximately 896 data.The Flannery surname data were obtained from the Flannery Clan Y-DNA project (http://www.flanneryclan.ie/).The raw data are approximately 896 data.The Mumma surname data were obtained from the Mumma-Moomaw Project (http://www.mumma.org/).The raw data are approximately 78 data.The William surname data were obtained from the Williams DNA Project (http://williams.genealogy.fm/).The raw data are approximately 626 data taken from 94 groups.The values in the parentheses indicate the number of objects belonging to that particular group.This table is considered as having a higher degree of similarity of objects among themselves, which indicates that the objects in the table are considerably similar or almost similar to each other.In the table, the first column is the Kit Number followed by the 25 markers.Note that the Kit Number is actually the extended Kit Number that combined a prefix of its surname separated by the dash and followed by the original Kit Number.

Table ) .
(7)) table consists of 112 objects belonging to the Philips DNA project (http://www.phillipsdnaproject.com/).After filtration, the final data are composed of only 8 family groups: Group 2 (30), Group 4 (8), Group 5(10), Group 8 (18), Group 10 (17), Group 16(10), Group 17(12), and Group 29(7).Note that the raw data are approximately 341 data taken from 64 groups.The values in the parentheses indicate the number of objects belonging to that particular group.This table is considered as having a higher degree of similarity of objects among themselves, which indicates that the objects in the table are considerably similar or almost similar to each other.In the table, the first column is the Kit Number followed by the 25 markers.Note that the Kit Number is actually the extended Kit Number that combined a prefix of its surname separated by the dash and followed by the original Kit Number.