HMEC: A Heuristic Algorithm for Individual Haplotyping with Minimum Error Correction

Haplotype is a pattern of single nucleotide polymorphisms (SNPs) on a single chromosome. Constructing a pair of haplotypes from aligned and overlapping but intermixed and erroneous fragments of the chromosomal sequences is a nontrivial problem. Minimum error correction approach aims to minimize the number of errors to be corrected so that the pair of haplotypes can be constructed through consensus of the fragments. We give a heuristic algorithm (HMEC) that searches through alternative solutions using a gain measure and stops whenever no better solution can be achieved. Time complexity of each iteration is O(m 3 k) for an m × k SNP matrix where m and k are the number of fragments (number of rows) and number of SNP sites (number of columns), respectively, in an SNP matrix. Alternative gain measure is also given to reduce running time. We have compared our algorithm with other methods in terms of accuracy and running time on both simulated and real data, and our extensive experimental results indicate the superiority of our algorithm over others.


Introduction
A single DNA molecule is a long chain of nucleotides (base pairs). ere are four such nucleotides which are represented by the set of symbols { . It is generally accepted that genomes of two humans are almost 99% identical at DNA level. However, at certain speci�c sites, variation is observed across the human population which is commonly known as "single nucleotide polymorphism" and abbreviated as "SNP" [1]. e nucleotide involved in a SNP site is called allele. If a SNP site can have only two nucleotides, it is called biallelic. If it can have more than two alleles it is called a multiallelic SNP [2]. From now on, we will consider the simplest case where only bi-allelic SNPs occur in a speci�c pair of DNA.
e single nucleotide polymorphism (SNP) is believed to be the most widespread form of genetic variation [3]. e sequence of all SNPs in a given chromosome is called haplotype. Haplotyping an individual deals with determining a pair of haplotypes, one for each copy of a given chromosome. A chromosome is a complicated structure of a DNA molecule bound by proteins. is pair of haplotypes completely de�ne the SNP �ngerprints of an individual for a speci�c pair of chromosomes. Given the two sequences of bases, haplotyping is straight forward and just needs to iterate through both the sequences and remove all the common alleles from them. But haplotyping becomes difficult when we want to construct haplotypes from sequencing data for higher reliability. Sequencing data for a genome does not contain the complete sequences of bases for a speci�c chromosome, rather it provides a set of fragments of arbitrary length for the whole genome. erefore, the actual problem of haplotyping is to �nd two haplotypes from the set of overlapping fragments of both the chromosomes, where fragments might contain errors and it is not known which copy of the chromosome a particular fragment belongs to.
e problem of haplotyping has been studied extensively. e general minimum error correction (MEC) problem was proved to be NP-hard [4]. It was also proved to be NP-hard even if the SNP matrix is gapless using a reduction from the MAX-CUT problem [1]. A method based on genetic algorithm has been proposed to solve this problem [5]. Several heuristic methods have also been proposed to �nd haplotypes efficiently. HapCUT [6] and ReFHap [7] are two of the most accurate algorithms in this regard.
In this paper, we give a heuristic algorithm for individual haplotyping based on minimum error correction. e complexity of each iteration is 3 for an SNP matrix of dimension . e algorithm is inspired from the famous Fiduccia and Mattheyses (FM) algorithm for bipartitioning a hypergraph minimizing the cut size [8].
Extensive simulations indicate that HMEC outperforms the genetic algorithms of Wang et al. [5] in terms of both reconstruction rate and running time, and it has better (in most cases) or comparable accuracy and signi�cantly smaller running time than that of HapCUT [6], which is the most accurate heuristic algorithm available. We also compared HMEC with some other algorithms such as SpeedHap [9], FastHare [10], MLF [11], 2 distance MEC [12], and SHR-3 [13] using the HapMap-based instance generator and comparison framework [14,15]. e rest of the paper is organized as follows. In Section 2, we present some de�nitions and preliminary ideas. In Section 3, we present our algorithm for individual haplotyping. We describe the data structure and complexity of our algorithm in Section 4. We report on an extensive performance study evaluating HMEC with other available techniques in Section 5. Finally, we conclude in Section 6 by suggesting some future research directions. An earlier version of this paper was accepted for presentation at BMEI 2008 [16].

Preliminaries
In this section, we give some de�nitions and preliminary ideas.
Let be the set of bi-allelic SNP sites. Let be the set of fragments produced from two copies of the chromosome. Each fragment contains information of nonzero number of SNPs in . Because the SNPs are bi-allelic, let the two possible alleles for each SNP site be 0 and 1, where they can be any two elements of the set { . Since all the nucleotides are the same at the sites other than SNP sites, we can remove these extraneous sites from all the fragments and consider the fragments as the sequences of the SNP sites only. us each fragment is a string of symbols {0 1 − of length where "−" denotes an undetermined SNP which we call a hole. All the fragments can be arranged in an matrix , where row is a fragment from and column is a SNP from . is matrix is called the SNP matrix as follows e consecutive sequence of "−"s that lies between two nonhole symbols is called a gap. A gapless SNP matrix is the one that has no gap in any of the fragments. In (1), the �rst, second, and third rows have no gaps while each of the fourth and sixth rows has one gap.
A SNP matrix 1 2 ⟩ can be viewed as an ordered set of fragments where a fragment where is de�ned as In (1), the distance between the second and the third fragment is two, as they differ in the seventh and ninth SNP sites (columns).
Two fragments and are said to be con�icting if > 0. Let 1 2 be a partition of , where 1 and 2 are two sets of fragments taken from so that 1 ⋃ 2 and 1 ⋂ 2 [5]. In Figure 1(b), an arbitrary partition corresponding to the SNP matrix of Figure  1  such that for any two fragments , {1 2 , and are non-con�icting, that is, 0. Such a partition is called an error-free partition. e partition in the Figure 1   For an error-free SNP matrix, a haplotype , {1 2 is constructed from its corresponding fragment class using the following formula: 1 if at least one fragment in has a 1 in th SNP 0 if at least one fragment in has a 0 in th SNP − if all the fragments in skips th SNP (4) where is called the de�ning class of haplotype , and , where {1 2 and 1 , denotes the th element of the haplotype . We now describe the general minimum error correction problem. If a matrix is not error-free, there will be no error-free partition . For such a matrix , there will be at least one con�icting pair of fragments in each of the classes for all possible partitions. erefore it is impossible to construct a haplotype that is non-con�icting with all the fragments in its de�ning class of fragments. If we are given a partition 1 , 2 ) and two haplotypes 1 and 2 constructed from then the number of errors ) that needs to be corrected can be calculated by the following formula: e MEC problem asks to �nd a partition that minimizes the error function ) over all such partitions of an SNP matrix .

A Heuristic Algorithm
In this section, we give our heuristic algorithm based on minimum error correction which we call HMEC.
Construction of a haplotype from an erroneous class requires correction of SNP values, that is, alleles, in the fragments. We want to minimize the number of error corrections. erefore, for each SNP site, the haplotype should take the allele that is present in the majority of the fragments. Let 0 ) be the number of fragments of a collection that have 0 in the th SNP. Similarly, 1 ) de�nes the number of 1s [5]. erefore, to minimize the number of errors ) for a speci�c partition , the haplotype should be constructed according to the following methodology: where 1, 2 and = 1, 2, , . In Figure 1(c), two haplotypes 1 and 2 , associated with the partition in Figure 1(b), are constructed in this way.
We will use a heuristic search to �nd the best partition. is algorithm starts with a current partition = , ) and iteratively searches a better partition. In each iteration, the algorithm performs a sequence of transfer of fragments from their present collection to the other one so that the partition becomes less erroneous. e transfer of a fragment from one collection to the other can increase or decrease the error function ). Let the partition before transferring a fragment be and the partition resulted is . We de�ne the gain of the transfer as Gain ) = ) − ). Figure 2 demonstrates an example calculation of the gain measure. Let = ⟩, = 1, 2, , be an ordering of all the fragments in a partition in such a way that fragment will precede fragment if all the fragments before in have already been transferred to form an intermediate partition and Gain ) ≥ Gain ) over . Hence, 1 = at the start of each iteration. We also de�ne the cumulative gain of a fragment ordering up to the th fragment as Gain , ) = =1 Gain ). Here Gain ) = )− 1 ). e maximum cumulative gain, Gain ) is de�ned as In Section 4, we shall describe these terms with an example.
In each iteration, the algorithm �nds the current ordering of and transfers only those fragments of that can achieve the Gain ) and the fragment that is the last to be transferred is referred as max . us the algorithm moves from one partition to another reducing the error function by an amount of Gain ). Please see Algorithm 1 for a basic description of HMEC. e algorithm continues as long as Gain ) > 0 and stops whenever Gain ) ≤ 0.

Data Structures and Complexity
is section deals with the data structures and the complexity of our algorithm. Here we also suggest some approximation to improve the performance of our algorithm. First, to �nd in each iteration, the algorithm repeatedly transfers the fragment that is not transferred previously in this iteration and has maximum gain over all such fragments. To accomplish this, we use a locking mechanism. At the beginning of each iteration, all the fragments are set free. e free fragment with maximum gain is found out and tentatively transferred to the other collection. Aer the transfer, the fragment is locked at the new collection. is tentative transfer creates the �rst intermediate partition Gain is the maximum Gain and max is the fragment corresponding to Gain in the log table. Aer �nishing all such tentative transfers, becomes an unde�ned partition. To change to the desired "current" partition of the next iteration, the algorithm checks the log to �nd the Gain and max , and rollback the transfer of all the fragments that were transferred aer max . When the rollback completes, becomes ready for the next iteration.
While tentatively transferring a free fragment, the algorithm needs to �nd the fragment with maximum gain among the free fragments (which are not yet transferred). is requires calculating gains for each of them. To calculate the Gain = − for a fragment, we need to calculate two error values of two different partitions: the present intermediate partition and the next partition which will be resulted if is transferred. Each of these error function requires calculation of two new haplotypes from their corresponding collections (see Figure 2). Although and the haplotypes of can be found from the previous transfer, calculation of requires construction of haplotypes of . Since, the difference between and is only one transfer, we can introduce differential calculation of haplotypes , 2 of next partition from the haplotypes of , 2 of present partition. For this purpose, the algorithm stores and 0 values of the present partition. Aer a transfer these values will either remain same or be incremented or decremented by 1. at is why it is now possible to construct , 2 in time. To compute from the haplotypes requires time. us running time to compute the as well as to compute Gain is . For each intermediate partition , = , we need to compute Gain measures for − unlocked fragments to �nd the maximum one. e transfer of this fragments requires updating of and 0 , 2 and = 2 . Hence, it also needs time to run. Finally, there will be such transfer in each iteration and maximum rollbacks. us each iteration will require 3 running time. We now give an example illustrating a single iteration of our algorithm. Figure 3 demonstrates an example iteration of HMEC. We consider that the current partition = is the partition given in Figure 1(b) for the SNP matrix of Figure 1 For example, the fragment with the maximum gain in 2 is fragment 6 which has gain two. Aer each transfer, the transferred fragment is shown locked by a circle. Here, the ordering of the fragments is ⟨2 6 5 4 3⟩ which is also the order of locking of the fragments. is order will be stored along with the Gains in Gain (  the log table. Figure 4 demonstrates the resulting log table of the illustrated iteration. All the tentative transfers aer max have to be rolled back so that the 3 becomes the next . We now give an approximate gain measure to make our algorithm faster. For large SNP matrix, 3 running time is critical to the performance of the algorithm. We can use an approximation in the calculation of the Gain by using only the fragment and not using the 1 other fragments. e approximate gain should be AppxGain = , , .
Here is the haplotype of 's present collection of partition , and is the haplotype of 's next collection of partition . is function ignores the effect of fragments other than on Gain , but reduces the run time of gain calculation to . erefore, the total run time of each iteration will be 2 .

Comparison with GMEC.
In this section, we compare the performance of our algorithm with the genetic algorithm, which we call GMEC, described in [5].
We �rst sample the original haplotype pair into many fragments with different coverage and error rates. Each fragment works as a distinct sample of the same specimen. Here coverage rate indicates the percentage of the total columns of the SNP matrix that have been sampled out. e remaining slots are gaps. We then introduce some speci�c amount of error into these samples. e simulation was controlled in several ways. We varied the error rate while number of fragments and coverage rate were kept constant. Also, coverage was varied while number of fragments and error rate were kept constant.
Notice that the way we introduced error and controlled the coverage rate is not necessarily same as that of [5]. erefore, the reconstruction rates of the branch and bound algorithm described in [5], which is an exact algorithm, should not be compared with those of HMEC.

Experiment on Angiotensin-Converting Enzyme (ACE).
Angiotensin-converting enzyme catalyses the conversion of angiotensin I to the physiologically active peptide angiotensin II, which controls �uid-electrolyte balance and systematic blood pressure. Because it has a key function in the reninangiotensin system, many association studies have been performed with DCP1 (encode angiotensin-converting anzyme) [20]. Rieder et al. completed the genomic sequencing of the DCP1 gene from 11 individuals and reported 78 SNP sites in 22 chromosomes [17].
We take six pairs of haplotypes to perform the simulation. We generate 50 fragments from each of these haplotype pairs with varying coverage and error rate. We perform the simulation for three different coverage rates (25%, 50%, and 75%). For each of these coverage rates, we perform our simulation for different error rates such as 5%, 10%, 15%, 20%, 25%, 30%, and 50%. In every case, we compare our algorithm with GMEC. Figure 5 illustrates the comparison that bears the clear testimony to the superiority of our algorithm. For most instances, the reconstruction rate achieved by our algorithm is 100% or greater than 98%. Only for a few cases with very high error rate and low coverage value, the reconstruction rate falls below 95%. We also perform the experiment for different coverage rates while keeping the error rate constant. Figure 6 illustrates the performance of these two algorithms for various coverage values. Here also, our algorithm clearly outperforms the genetic algorithm except for very low coverage value (which is unrealistic).  [18,20]. We performed the experiment exactly in the same way that we did for angiotensin-converting enzyme. Figure 7 demonstrates the results. Experimental results suggest that HMEC is much better than the genetic algorithm. Again, for most cases, the reconstruction rate achieved by our algorithm is 100% or greater than 98%. For every instance, our algorithm exhibits better performance than that of GMEC.

Experiment on Simulated Data.
We used simulated data for further evaluation of HMEC. One of the very important advantages of our algorithm is that it takes very short time to reconstruct the haplotypes. Our algorithm is much faster than the GMEC. Table 1 shows the running time of HMEC and GMEC. Here, we perform the simulation by varying the length (length denotes the number of SNP sites in the haplotype pair) of the haplotypes while �xed the value of the coverage rate and the error rate at 50%. Since haplotypes with such varying lengths are not available, we rely on the simulated data. Clearly, HMEC is much faster than GMEC. For example, while HMEC can reconstruct a haplotype with 936 sites in a fraction of a second, GMEC takes 72 seconds.

Comparison with HapCUT.
HapCUT [6] is one of the most accurate heuristic algorithms for individual haplotyping. HapCUT uses a random initial haplotype con�guration and builds a graph. It computes max-cut on the graph to �nd the position to �ip and iterates until no improvement in MEC score is achieved. We have performed extensive experiments to compare the performance of HMEC and Hap-CUT. Experimental results suggest that although HapCUT is reliable, its running time is too large to be a realistic choice for whole genome haplotyping. Our algorithm computes the haplotypes signi�cantly faster than HapCUT without losing accuracy.
We used the �ltered HuRef data from Levy et al. [19] to evaluate the performance. We generated several test data sets varying the coverage and error rate, and tested the performance of HMEC and HapCUT on these data sets. e results are shown in Tables 2, 3 Experimental results indicate that the reconstruction rates of both HapCUT and HMEC are reasonably good. For very low coverage, reconstruction rate of HapCUT is slightly better than HMEC. However, as the coverage rate increases, HMEC begins to outperform HapCUT. Notice that HapCUT is only better than HMEC for very low (and thus unrealistic) coverage values (5% and 10% coverage). However, for higher coverage, HMEC consistently performs better than HapCUT. Furthermore, HapCUT is signi�cantly slower than HMEC. For an instance, with 35% coverage and 40% error rate, HapCUT takes 215.5 seconds where HMEC takes only a fraction of a second (see Table 5). erefore, although generally HapCUT provides reliable reconstruction rate, on large dataset, it is an unrealistic choice due to its time consuming operations. On the other hand, HMEC provides the high accuracy with much less amount of time.
We also used some sample data from ReHap project [14,15]. We created the allele matrix from the ReHap error matrix and fed the matrix to both HMEC and HapCUT algorithms. HMEC consistently produces higher reconstruction rate than HapCUT (see Table 6). Also, the running time of HMEC is clearly much better than HapCUT.

Comparison Using ReHap.
We also compared HMEC with some other well-known algorithms such as SpeedHap [9], FastHare [10], MLF [11], 2 distance MEC [12], and SHR-3 [13] using HapMap-based instance generator and comparison framework [14,15]. e results are shown in Table 7 and Table 8 that suggest that no single method clearly outperforms the others in all cases. However, reconstruction rates achieved by HMEC are the highest or very close to the highest. is bears a clear testimony to its suitability as a practical tool for individual haplotyping.

Conclusion
In this paper, we present a heuristic algorithm (HMEC) based on minimum error correction that computes highly accurate haplotypes signi�cantly faster than the known algorithms for haplotyping. e algorithm is inspired from the famous Fiduccia and Mattheyses (FM) algorithm for bipartitioning a hyper graph minimizing the cut size [8]. We report on an extensive performance study evaluating our approach with other available techniques using both real and simulated datasets. Comprehensive performance study shows that our algorithm outperforms (in most cases) or matches the accuracy of other well-known methods, but runs in a fraction of the time needed for other techniques. High accuracy and very fast running time make our technique suitable for genomewide scale data.
e accuracy of the algorithm can be improved by incorporating some prior knowledge. For example, small groups of fragments that are declared to be in the same haplotype can be identi�ed. Probabilistic methods like expectation maximization (EM) also deserve some consideration over such optimization problems. In the near future, we intend to consider these issues to make further improvements.