Efficient Haplotype Block Partitioning and Tag SNP Selection Algorithms under Various Constraints

Patterns of linkage disequilibrium plays a central role in genome-wide association studies aimed at identifying genetic variation responsible for common human diseases. These patterns in human chromosomes show a block-like structure, and regions of high linkage disequilibrium are called haplotype blocks. A small subset of SNPs, called tag SNPs, is sufficient to capture the haplotype patterns in each haplotype block. Previously developed algorithms completely partition a haplotype sample into blocks while attempting to minimize the number of tag SNPs. However, when resource limitations prevent genotyping all the tag SNPs, it is desirable to restrict their number. We propose two dynamic programming algorithms, incorporating many diversity evaluation functions, for haplotype block partitioning using a limited number of tag SNPs. We use the proposed algorithms to partition the chromosome 21 haplotype data. When the sample is fully partitioned into blocks by our algorithms, the 2,266 blocks and 3,260 tag SNPs are fewer than those identified by previous studies. We also demonstrate that our algorithms find the optimal solution by exploiting the nonmonotonic property of a common haplotype-evaluation function.


Introduction
Single-nucleotide polymorphisms (SNPs) play important roles in disease association studies owing to their high abundance in the human genome, low mutation rate, and amenability to high-throughput genotyping. A small subset of SNPs directly influences the quality or quantity of the gene product and increases the risks of certain diseases or of severe side effects of drugs. Alleles of SNPs that are close together tend to be inherited together. A haplotype refers to a set of SNPs found to be statistically associated on a single chromosome. Haplotypes defined by common SNPs have important uses in identifying disease association and human traits [1][2][3][4][5].
Genome-wide association studies based on linkage disequilibrium (LD) offer a promising approach to detect genetic variation responsible for common human diseases. The patterns of linkage disequilibrium (LD) observed in human chromosome show a block-like structure [1,6,7], such that the entire chromosome can be partitioned into high-LD regions interspersed with low-LD regions. The high-LD regions are called haplotype blocks and the low-LD regions are referred to as recombination hotspots. A few common haplotypes account for most of the variation from person to person in haplotype blocks. Furthermore, each haplotype block, comprising large regions of low diversity, can be characterized with a small number of SNPs, which are referred to as tag SNPs [8]. Identification of tag SNPs is aimed at tagging candidate genes which can capture the most information in the haplotype blocks. The new technologies allow to genotype rarer variants than before [9]; therefore, there are more and more genotyping data needed to be analyzed, and the structure of haplotype blocks will be more complicated. Despite great progress in current genotyping developments which allow intensive genotyping at cheap prices, the concept of tag SNP selection is more and more significant due to exploded genotyping data. Most tag SNP selection strategies are based on haplotype blocks and have 2 BioMed Research International the aim of identifying a minimal subset of SNPs able to tag the most common haplotypes [7,10].
Several methods have been used to identify haplotypeblock structures, including LD-based [6,11], recombinationbased [12,13], information-complexity-based [14][15][16], and diversity-based [7,17,18] methods. The result of block partitioning and the meaning of a haplotype block may differ according to the assessment criteria. The diversity-based test methods can be classified into two categories: those that divide strings of SNPs into blocks on the basis of the decay of LD across block boundaries and those that delineate blocks on the basis of some haplotype-diversity measure within the blocks. Patil et al. [7] defined a haplotype block as a region in which a fraction of a percent or more of all the observed haplotypes are represented at least times or at a given threshold in the sample. They applied the optimization criteria outlined by Zhang et al. [10,19] and described a general algorithm that defines block boundaries in a way that minimizes the number of SNPs that are required to identify all the haplotypes in a region. They identified a total of 4,563 tag SNPs and 4,135 blocks defining the haplotype structure of human chromosome 21 [7]. In each block they required that at least 80% of the haplotypes must be represented more than once in the block.
In this paper, we propose two dynamic programming algorithms, incorporating several haplotype diversity evaluation functions, for haplotype block partitioning with constraints on diversity and the number of tag SNPs. Part of data related to this paper have been open to public domain http://www.cs.pu.edu.tw/∼yawlin/ (web-site page of a coauthor), these data are also published at local conference [20].

Diversity Functions
Several operational definitions have been used to identify haplotype-block structures [6,7,[11][12][13][14][15][16][17][18] and result in different results of block partitioning and meanings of a haplotype block, depending on the objective function. Haplotype blocks are genome regions with high LD, so that distinct haplotype patterns within the block are few and the diversity of the block is low. In terms of diversity functions, the block selection problem can be viewed as finding a segmentation of a given haplotype matrix such that the diversities of chosen blocks satisfy a certain value constraint. We use the following definitions.

Definition 1 (haplotype block diversity). Given an interval
[ , ] of a haplotype matrix A and a diversity function, : [ , ] → ( , ) ∈ R is an evaluation function measuring the diversity of the submatrix A( ; ).

Given an
× haplotype matrix , a block ( , ) of matrix is viewed as haplotype strings; they are partitioned into groups by merging identical haplotype strings into the same group. The probability of each haplotype pattern is defined accordingly, such that ∑ = 1. Li [21] proposes a diversity function defined by (1) Note that ( ) is the probability that two haplotype strings chosen at random from are different from each other. Other measurements of diversity can be obtained by choosing different diversity functions; for example, to measure information complexity one can choose the information entropy (negative-log) function [14][15][16] as follows: Patil et al. [7] and Zhang et al. [10,18] define a haplotype block as a region where at least 80% of observed haplotypes within a block must be common haplotypes. Using the same definition of common haplotype, the coverage of common haplotypes of the block can be formulated as a form of diversity as follows: where denotes unambiguous haplotypes, denotes common haplotypes, and denotes singleton haplotypes. In other words, Patil's method requires that ( ) ≤ 20%.
Some methods [6,22,23] presented a definition of a haplotype block based on the LD measure ; however, there is no consensus definition as yet. Zhang and Jin [22] defined a haplotype block as a region in which no pairwise | | values are lower than a threshold . Let denote a haplotype interval [ , ]. We define the diversity as the complement of minimal | | of . By this definition, is a haplotype block if its diversity is lower than 1 − . Consider Zhang et al. [22] also propose a definition for haplotype block; they require a proportion at least of SNP pairs having strong LD (pairwise | | greater than a threshold) in each block. We use this definition to redefine the function as the proportion of SNP pairs without strong LD. ( , ) denotes the number of SNP pairs without strong LD in the interval [ , ]. The diversity function is as follows: Diversity measurement usually reflects recombination events that occurred during evolution. Generally, haplotype blocks with low diversity indicate conserved regions of the genome.
Definition 2 (monotonic diversity). A diversity function is said to be monotonic if for any haplotype block (interval) = [ , ] of haplotype matrix A, it follows that ( , ) ≤ ( , ) whenever [ , ] ⊂ [ , ]; that is, the diversity of any subinterval of is no larger than the diversity of .
The diversity functions (1) and (2) are monotonic. However, the evaluation function for common haplotypes BioMed Research International 3 21907 21902 21900 n n n n n n n n a a c g g t g a n n n n n n n n n n n n g n g a g g g g g n n n n n n n n n n n g g g g g t c g g g g g g t c g n n n n g n n n n n n n n n n n g n g n n c c n g g n g g c c n a g n t a c c g a g c t a c c g n a n g g t g a n n n n n n n n n g n n n n n n n n n n g n n n g g g n g t c g a a c g g t n g proposed by Patil et al. [7] does not satisfy the monotonic property when the haplotype sample has missing data. For example, Figure 1 shows a small portion of human chromosome 21 haplotype sample provided in [7], where denotes missing data. In the sample, the common-haplotype coverage of interval [21900, 21907] is 9/10, which is greater than 80%.  distinguish uniquely a certain percentage of all the haplotypes in a region. The greedy algorithm [7] identified a total of 4,563 tag SNPs and a total of 4,135 blocks to define the haplotype structure of human chromosome 21. In each block, they required that at least 80% of haplotypes are represented more than once in the block. In addition, Zhang et al. [10] used a dynamic programming approach to reduce the numbers of blocks and tag SNPs to 2,575 and 3,582, respectively. Both of the algorithms [7,10] fully partition the haplotype sample into blocks with the objective of minimizing the tag SNPs. However, when the resources are limited, investigators and biologists may be unable to genotype all the tag SNPs and instead must restrict the number of tag SNPs to be identified by the algorithms. In this paper, we propose two dynamic programming algorithms for the haplotype-block partitioning problem.
Problem 3 (longest--blocks). Given a haplotype matrix A and a diversity upper limit D, we wish to find disjoint blocks whose diversity is less then D such that the total length is maximized. That is, output the set = { 1 , 2 , . . . , }, with ( ) ≤ for each ∈ , such that | 1 | + | 2 | + ⋅ ⋅ ⋅ + | | is maximized. Here | | denote the length of block .
Assuming that the given diversity function is monotonic and the given haplotype matrix is preprocessed for finding the indices of the farthest site, called good partner site, indices from current site, the longest--block problem can be solved in ( ) space and ( ) time. The good partner of locus refers to the left farthest locus from , such that [ , ] is a haplotype block whose diversity is less then the upper limit constraint. The idea of left good partner is shown in Figure 2.
Problem 4 (longest-blocks--tags). Given a haplotype matrix A and a diversity upper limit D, we wish to find a list of disjoint blocks whose total tag SNP number is less than such that the total length is maximized. That is, output the set = { 1 , 2 , . . . , | | } such that (for all ∈ ) ( ( ) ≤ ) and ∑ tag( ) ≤ ; tag( ) denote the number of tag SNPs required for block , so that | 1 | + | 2 | + ⋅ ⋅ ⋅ + | | | | is maximized. Here | | denote the length of block .
Assuming that all of the feasible blocks and tag SNPs required for each block have been preprocessed, the longestblocks--tags problem can be solved in ( ) time, where denotes the total number of feasible blocks. For the same sample used, based on the same criteria adopted by [23], our algorithm identifies a total of 2,266 blocks, which can be tagged by 3,260 tag SNPs. The number of blocks and tag  SNPs we identified are 45.2% and 28.6% fewer than those identified by [7]. These results are also better than those by Zhang's method with respect to the number of tag SNPs used and the total block numbers.
The definition of the haplotype-block diversity evaluation function ( ) we use in this paper is equal to the ratio of singleton haplotypes to unambiguous haplotypes in the blocks. It is also equal to 1 minus the ratio of common haplotypes to unambiguous haplotypes; in other words, 80% of common-haplotype coverage is equal to 20% (or 0.2) of haplotype diversity by the definition presented in [7]. That is, we require the diversity of each block to be ≤ 0.2. Here we propose two linear-space algorithms for these two problems.

Method
We propose two dynamic programming algorithms to partition haplotype blocks with constraints on diversity and number of tag SNPs. The proposed algorithms are able to find the longest segmentation into blocks such that the diversity of each block is less than an upper limit and the total number of tag SNPs required for these blocks does not exceed a specified number , and they are time-efficient and linear-space algorithms for solving Problems 3 and 4. In the first algorithm, the longest segmentation consisting of feasible blocks can be found in ( ) time and linear space after the preprocessing of the leftmost site [ ] (good partner site) and the rightmost site [ ] for each SNP marker . After partitioning blocks, we select tag SNPs in each block. Using this method, we can partition a haplotype into a minimum number of blocks with a modest number of tag SNPs. In the second algorithm, the longest segmentation covered by tag SNPs can be found in ( ) time after the preprocessing of left good partners [ ] for each marker and tag SNPs required for each feasible block. Using this method, we can partition a haplotype into a minimum number of blocks with a minimum number of tag SNPs.

Tag SNP Selection Algorithm.
Our algorithms begin with the preprocessing of the farthest site (good partner) for each SNP marker. According to the haplotype block definition defined by [7], at least 80% of unambiguous haplotypes must be represented more than once. Using the same criteria as in [7], for each block we want to minimize the number of SNPs that distinguish uniquely at least 80% of the unambiguous haplotypes in the block. Those SNPs can be thought of as a signature of the haplotype-block partition.
In general, the number of tag SNPs required increases as the length of the haplotype block increases. But an exception is shown in Figure 3. The block consisting of 3 SNPs needs 3 tag SNPs to distinguish each haplotype uniquely, but the block consisting of 4 SNPs needs only 2 tag SNPs (column 2 and column 4).
The problem of finding the minimum number of tag SNPs within a block that uniquely distinguishes all the haplotypes is known as the MINIMUM TEST SET problem and has been proven to be NP complete [24]. Thus, there is no polynomialtime algorithm that guarantees to find the optimal solution for any input, though approximate, greedy algorithms have been proposed [25][26][27]. In order to find the optimal solution, we adopt a brute-force method to find tag SNPs within a block. Our strategy for selecting the tag SNPs in haplotype blocks is as follows. First, the common haplotypes are grouped into distinct patterns by merging the compatible haplotypes in each block. After the missing data are assigned in each group, we determine the smallest number of tag SNPs required based on the smallest number of haplotype groups needing to be distinguished such that haplotypes in these groups contain at least 80% of the unambiguous haplotypes in the block. Finally, we select a locus set consisting of the minimum number of SNPs in the haplotypes such that at least 80% of the unambiguous haplotypes can be uniquely distinguished. An exhaustive search can be used very efficiently, given that the number of tag SNPs needed for each block is usually modest. The exhaustive-search algorithm shown in Algorithm 1 enumerates the -combinations in lexicographic order to generate the next candidate tag SNP set until each pattern can be uniquely distinguished.

A Linear-Space Algorithm for Haplotype Block Partitioning.
Patil et al. [7] studied the global haplotype structure on chromosome 21 and identified 20 haplotypes for 24,047 SNPs (MAF ≥ 0.1) spanning over about 32.4 Mbps. By the sample, they applied a greedy algorithm to partition the haplotype into blocks of limited haplotype diversity. Using the same criteria as in Patil et al., Zhang et al. [18,23,28] provided a dynamic programming algorithm to partition the same sample totally into 2.575 blocks and identify a total of 3,582 tag SNPs that are 37.7% and 21.5% smaller, respectively, then those identified by Patil  ⊳ Find the number of tag SNPs required for haplotype block such that percentage of unambiguous haplotypes in can be distinguished uniquely. Input: A percentage and haplotype block with unambiguous haplotype pattern = { 1 , 2 , . . . , }; the haplotypes number in each pattern is 1 , 2 , . . . , . Output: The tag SNPs required for haplotype block .
⊳ 's are listed in decreasing order of the number of haplotype strings.
⊳ is the number of percentage of unambiguous haplotypes.
⊳ is the minimum number of tag SNPs required.
⊳ ℎ is the total haplotype strings that can be distinguished. (9) ← ⊳ generate the next -combination in lexicographic order. (10) while [ ] = − + do ⊳ is the length of haplotype string.
1 and with the constraint of using at most tag SNPs will have two cases, either the site is included in the last block of or not. If site is not included in the last block of , it will find between sites 1 and − 1; otherwise there will exist a site , 1 ≤ ≤ , such that [ , ] is the last block of . In the latter case, the tag SNPs required for block [ , ] is tag( , ), so it can find other blocks which are covered by other − tag( , ) tag SNPs between site 1 and site − 1.
Assuming a monotonic diversity function, the recurrence relation is   Finding a segmentation that consists of feasible blocks and maximum total length can be completed using dynamic programming based on the recurrence relation. However, it is difficult to retrieve the intervals in linear space. To solve this problem, we can use a concept similar to that of [29]. We find a cut point * to divide SNP sites into two parts, 1 and 2 , and then there are ⌊ /2⌋ blocks in 1 and ⌈ /2⌉ blocks in 2 and 2 = − 1 . We now have the following recursion relation. While = 1, the boundaries of the block can be found by scanning the leftmost marker array and appending the longest feasible block in [ , ] to a global data structure. The algorithm is shown in Algorithm 2.
In this algorithm, we use six global data structures involving arrays , , , ,C, and -list. and are used to store the good partner points [ ] and [ ] that have been calculated in preprocessing. -list is used to store the boundaries of blocks. Arrays and are used to store the results of the (⌊ /2⌋, , ) and (⌈ /2⌉, + 1, ). During the computation of the (⌊ /2⌋, , ) and the (⌈ /2⌉, +1, ), we use array , replacing a × table to store temporary results that will be used to calculate further results. The size of each of arrays , , , , and is . The size of -list is , × in the general case, so that the space used by the algorithm is ( ).
The time complexity of the algorithm is ( ) as shown in the following by induction. Let ( , ) denote the time needed for Lis( , 1, ). Assume that ( ; ) ≤ 2 for all < , < . According to the algorithm, we have By induction, Letting 2 = 3 1 , the above inequality is satisfied, so that we can prove the time complexity of the algorithm to be ( ). Although we assume that the block diversity evaluation function we used here is monotonic, we can modify the algorithm slightly such that it can be applied to nonmonotonic blocks. In the case of nonmonotonic blocks, for each SNP , we use to denote the set of all such that [ , ] is a feasible haplotype block. Let = = ∑ =1 | |, where is the average number of | | for each marker . It can be shown that the modified algorithm uses ( ) time and ( ) space.

A Linear Space Algorithm for Haplotype Block Partitioning with Limited Number of Tag SNPs.
Using a similar concept as in [31], we find a cut point * to divide SNP sites into two parts, 1 and 2 , and use * tag SNPs for 1 and the other − * tag SNPs for 2 such that the total size of blocks covered by * tags in 1 and − * tags in 2 is maximized. We obtain the following recurrence relation: ( , , ) = ( , * , * ) + ( * + 1, , − * ) .
The idea behind the recurrence relation is illustrated in Figure 5. Note that in order to maximize the total size of blocks tagged by * SNPs and − * ¡sup/¿SNPs, we are unable to assign half of to * directly, because in some cases the ⌊ /2⌋th and the (⌊ /2⌋ + 1)th SNP will be used to tag the same block that is a member of the longest segmentation. If we use the first to the ⌊ /2⌋th SNPs to tag the blocks in 1 and use the (⌊ /2⌋ + 1)th to the th SNPs to tag the blocks in 2 , we will not obtain the longest segmentation for SNPs. In the general case there will be many pairs of * and * solutions that satisfy our requirement. For time efficiency, we want to make * and * approximate half of and as nearly as possible. Let 0 denote the maximum number of tag SNPs required among all feasible blocks. In order to find the appropriate value of * and * , we can examine * in 0 continuous possible values, ⌊ /2⌋−⌊ 0 /2⌋ ≤ * ≤ ⌊ /2⌋+⌊ 0 /2⌋ and examine * in all SNPs loci for each selection of * . Since 0 is small in the general case, we can find * and * quickly. After finding appropriate values of * and * , we can execute the steps recursively to partition the original problem to two subproblems repeatedly. Until ≤ 0 , we simply use the dynamic programming algorithm to solve each subproblem. The algorithm traces back to output the boundaries of each block. The algorithm is shown in Algorithm 3.

, is a constant. Given a haplotype matrix , a diversity upper limit , and a number of tag SNPs t, find a segmentation S consisting of k feasible blocks such that (for all ) ( ( ) ≤ )
and ∑ tag ( ) ≤ , so that maximizing the total length of can be done in ( ) time and using linear space after the preprocessing of , , and tag( , ) , ∈ , for each SNP .
In the algorithm, named HBPTS, we use five global data structures involving arrays , , , , and . Arrays and are used to store the good partner points and for each SNP , and array is used to store the tag SNPs required for each feasible block. The data in arrays , , and were calculated in preprocessing, and the size of each array is , the number of all feasible blocks. In addition, we use a twodimensional array for computing the ( , , 0 ⋅ ⋅ ⋅ ⌊ /2⌋ + ⌊ 0 /2⌋) and for computing the ( + 1, , 0 ⋅ ⋅ ⋅ ⌊ /2⌋ + ⌊ 0 /2⌋). Note that the computation of ( , , ) will compare the values of ( , − 1; − tag( , )), ∈ , and ( , − 1, ). Therefore, if 0 denotes the maximum tag( , ), the maximum number of tag SNPs required among all feasible blocks, we need to store at most the values of (. . . , . . . , ( − 0 ) ⋅ ⋅ ⋅ ( −1)) and ( , − 1, ) while computing the value of ( , , ). In our experience, the 0 will be equal to 8 at most, as seen, for example, in the haplotype data of Patil et al. [7]. Thus the space of two dimensional arrays and is 0 × , so the space complexity for the algorithm is ( + 0 ). Since 0 is generally a constant and > in most practical cases, we can prove that the space used by the algorithm is ( + ). The flowchart of HBPTS is shown in Figure 6. ⊳ and are used to store the good partner pointers and which have been preprocessed dependent on diversity constraints , is used to store the tag SNPs required for each feasible blocks, two dimensional array and are global temporary working storages.
for ← down to + 1 do (12) Compute ( 0 , 0 ) ≤ 2 0 0 for all 0 < , 0 < . According to the algorithm, we have By induction, Letting ≥ 5 0 + 10, the above inequality will be satisfied, so that we can prove the time complexity of the algorithm to be ( ).

Experiments
We applied our dynamic programming algorithms, which find the longest segmentation covered by a specific number of tag SNPs, to the haplotype data for chromosome 21 provided by Patil et al. [7]. The data contain 20 haplotype samples and each contains 24,047 SNPs spanning 32.4 Mb of chromosome 21. The minor-allele frequency at each marker locus is at least 10%. Using the proposed algorithms with the same criteria as in [7] with ≥ 80% coverage of common haplotypes in blocks, 3,260 tag SNPs and 2,266 haplotype blocks are identified. In contrast, 4,563 tag SNPs and 4,135 blocks are identified in [7], and 3,582 tag SNPs and 2,575 blocks are identified in [10]. The proposed algorithms reduce the number of tag SNPs and blocks by 28.6% and 45.2% compared to [7]. We also demonstrate that the results shown in [10] are not optimal. Table 1 shows a comparison of properties of haplotype blocks found by [7,10] and our algorithms with 80% coverage of common haplotypes. The proposed algorithms discover 736 blocks containing more than 10 SNPs per block. Blocks with >10 SNPs account for 32.5% of all blocks. The average number of SNPs for all of the blocks is 10.6. The largest block contains 128 common SNPs, which is longer than the largest block (containing 114 SNPs) identified by [7] and the same as identified by [10]. Tables 2 and 3 show more experimental results. According to these results, we can partition 38.6% of the genome region into blocks that require no tag SNPs. This is because most of these blocks contain only a few common SNPs, and 80% of the unambiguous haplotypes have the same haplotype pattern (are compatible) in these blocks. We term these SNP loci as uninformative markers because they are the same among most (80%) of the population. These data also show that as the covered genome region increases, we need to add more and more tag SNPs to capture the haplotype information of the blocks, and the number of zerotagged blocks becomes fewer. Although the average length of non-zero-tagged blocks becomes shorter as the covered chromosome region increases, the average length of all blocks becomes longer. Figure 7(a) shows the percentage of tag SNPs identified by the proposed algorithms when blocks cover a specified percent of the genome region. According to our experimental results, when blocks cover 70% of the genome region, the proposed algorithm required only 19.1% (about 623) of the tag SNPs to capture most of the haplotype information. This also indicates that the proposed method discovers that only a few tag SNPs are needed to capture most of the genome-region information. Figure 7(b) shows that the percentage of covered genome region increases while the tag SNPs identified by the proposed methods increase by 5%. Note that as the number of tag SNPs increases, the marginal percentage of genome region covered decreases. This indicates that, as the genome region    One of the main reasons is that their algorithms presume that the common-haplotypes evaluation function satisfies the monotonic property. However, when the haplotype sample has missing data, the diversity function does not satisfy the monotonic property. For example, Table 5 shows the analysis results of Zhang's and our algorithms on the same haplotype sample; this sample has just 69 SNPs, which is a small part of Patil's haplotype data [7]; the contig number is NT 001035. Zhang's and our methods find 2 blocks, but our method needs only 3 tag SNPs rather than the 4 found by Zhang's method. These cases demonstrate that Zhang's algorithm fails to find the optimal solution owing to the nonmonotonic property of the common-haplotype evaluation function.
We also apply our algorithm on biological data set from chromosome 20 from HapMap data bulk (http://www .hapmap.org), respectively. The data set contains 120 individuals that include 71,539 SNPs from the Yoruba in Ibadan, Nigeria (abbreviated YRI). We select the first 30 individuals and the first 5,000 SNPs for the input sample of the algorithm. Using the diversity function (3) with ≥ 80% coverage of common haplotypes in blocks, the total number of blocks and tag SNPs identified by the algorithm is 293 and 1,047. The haplotype sample also being applied to Zhang's method with minimum number of tag SNPs implemented on HapBlock [23], using the same criteria, 344 blocks and 1,184 tag SNPs are obtained by Zhang's algorithm. Table 4 shows a comparison of properties of haplotype blocks identified by Zhang's and our algorithms with 80% coverage of common haplotypes. Our algorithm discovers 168 blocks containing more than 10 SNPs per block. Blocks with >10 SNPs account for 57.3% of all blocks. The average number of SNPs for all of the blocks is 17.1. The largest block contains 92 common SNPs, which is longer than the largest block  (containing 79 SNPs) identified by Zhang's algorithm. Note that the haplotype sample has no missing data, and using the data set, the diversity function (3) will satisfy the monotonic property.

Conclusion
In this paper, we examine several haplotype diversity evaluation functions. By use of appropriate diversity functions, the block selection problem can be viewed as finding a segmentation of a given haplotype matrix such that the diversities of chosen blocks satisfy a given value constraint. Tag SNPs can capture most of the haplotype diversity in the blocks and thereby can potentially capture most of the information for association between a trait and the SNP marker loci.
Instead of genotyping all of the SNPs on the chromosome, one may wish to use only the genotype information of tag SNPs. We can infer the haplotype features of most populations by genotyping only a few SNPs. Thus, identifying tag SNPs can dramatically reduce the time and effort needed for genotyping, without loss of much haplotype information. We present two dynamic programming algorithms for haplotype-block partitioning such that total block length is maximized and the total tag SNPs required are minimized. We also show in Theorem 6 that finding the longest -block segmentation with diversity constraints can be done in ( ) time and ( ) space. In Theorem 7, we show that finding a maximum segmentation with limited tag SNPs number can be done in ( ) time; furthermore, we reduce the space complexity into ( + ). We point out that these efficiency results of our algorithms can be applied in many different definitions of diversity functions, provided that we can precompute the boundaries of all feasible blocks and tag SNPs required for these blocks.
We also show that the experimental results discovered by our methods are superior to those by Zhang's algorithm. We demonstrate that owing to the nonmonotonic property of the common-haplotype evaluation function, Zhang's algorithm will not find an optimal solution when the haplotype samples have missing data.