To decode a long genome sequence, shotgun sequencing is the state-of-the-art technique. It needs to properly sequence a very large number, sometimes as large as millions, of short partially readable strings (fragments). Arranging those fragments in correct sequence is known as fragment assembling, which is an NP-problem. Presently used methods require enormous computational cost. In this work, we have shown how our modified genetic algorithm (GA) could solve this problem efficiently. In the proposed GA, the length of the chromosome, which represents the volume of the search space, is reduced with advancing generations, and thereby improves search efficiency. We also introduced a greedy mutation, by swapping nearby fragments using some heuristics, to improve the fitness of chromosomes. We compared results with Parsons’ algorithm which is based on GA too. We used fragments with partial reads on both sides, mimicking fragments in real genome assembling process. In Parsons’ work base-pair array of the whole fragment is known. Even then, we could obtain much better results, and we succeeded in restructuring contigs covering 100% of the genome sequences.
The study of bioinformatics is one of the most vibrant area of research, whose important applications are growing exponentially. A good starting point is the introductory book by Neil and Pavel [
Genome is the complete genetic sequence made from an alphabet of four elements, Adenine (A), Thymine (T), Cytosine (C), and Guanine (G). The letters A, T, C, G represent molecules called nucleotides or bases. In a living cell, it appears in a double helix structure [
The genome (DNA) sequences are enormously long from a few thousand nucleotides for small viruses to more than 3 giga nucleotides for human. Genomes like that of wheat (
DNA sequence, responsible for producing different proteins, is at the root of functioning of a living organism. Decoding genome sequence is thus the first step to understand the function as well as malfunction of living things, for medical, agricultural, and many other research areas. Investigation of human genome could lead to the cause of inherited diseases and the development of medical treatments for various illnesses. The genome analysis is useful for the breeding of improved crops. Moreover, it is the key information for investigations in evolutionary biology. It is hoped that the platypus genome, which is very recently decoded in 2008, would provide a valuable resource for in-depth comparative analysis of mammals [
Sanger sequencing [
Most of the existing fragment assembly systems read the fragment base-sequence by Sanger technique and reconstruct the original genome sequence with their proprietary assembling algorithms. Many assembling algorithms were proposed, the important ones being TIGR assembler [
The present work is based on genetic algorithm (GA), which is modified to be efficient for such array assembling problem. During last ten years a few works were reported to use genetic algorithm or similar techniques like
Several deterministic algorithms, based on graph-theory, and greedy heuristic algorithms are proposed. But they are extremely computationally involved and need large scale parallel processing computational environment which is very costly. Worldwide only a few such installations are available, and they are owned by large research facilities. Yet, the need for genome sequencing is felt more and more strongly at every small medical research centers, drug development centers, agricultural research centers, and so forth. To help progress of their researches we need an efficient fragment assembling algorithm, which could run on an inexpensive computational platform. Moreover, on many occasions what one needs is only a partial sequencing, or to know whether a particular sequence is present in the genome or not, not the whole genome sequence.
The main motivation of this work is to find an efficient fragment assembling algorithm that could run on desktops, yet be able to find nearly correct draft sequences.
In the proposed method, fragment matching, contig formation, and scaffolding all are embedded in one process. Moreover if the researcher needs to know/confirm only certain gene sequence, information of which is available in the draft sequence (in the contig pool defined in Section
The rest of the paper is organized as follows. In Section
In this section, we explain fragment assembling based on Sanger sequencing. To decode a long DNA sequence one needs to clone it to a few copies, split it up into fragments, read the individual fragments, and then assemble them in correct sequence to reconstruct the target DNA. This process is called shotgun sequencing and is the basis of all sequencing strategies. In 2000 Myers et al. successfully sequenced the fruit fly
WGSS was used in the determination of draft human genome in 2001 by Celera Genomics [
Our target is similar, to create the sequence based on WGSS, doing the assembling part using GA.
The whole process of WGSS is divided into two steps—one is the biological part of cloning, fragmenting, and reading. The other one is the computational part of assembling the fragments.
The basic shotgun procedure starts with a number of copies of DNA whose sequence is cut into a large number of random fragments of different lengths. Fragments that are too large or too small are discarded. Of the remaining fragments, that is, those used for assembling, the length of short ones is about 2 kbp and of the long ones is about 10 kbp [
Formation of contigs.
Starting with a fair number of clones, the total base-pair reads of fragments are several times the number of bases in the original genome. Commonly, a term
To sequence the original DNA, we first identify overlapping sections by comparing the already read base sequences at both ends of the fragments, as shown in Figure
Two
Scaffolding.
Celera assembler [
Our proposed genetic algorithm technique is specialized for fragments assembling and similar problems. The main contribution here is Chromosome Reduction Step (CRed), which reduces the length of GA chromosome with progressive generation. As the chromosome length reduces, so does the search space, and the searching is more and more efficient. The other contribution is Chromosome Refinement Step (CRef), which is a greedy mutation to improve the correctness of the solution by local rearrangement of genes. We were able to combine the phase of overlap (contig formation) and scaffolding by the way we defined the structure of the GA chromosome and CRed. The details are explained in the following sections.
Genes of our GA chromosome are genome fragments, where one fragment is one gene. In a GA chromosome gene, there is the information of two
The fragments generated by shotgun sequencing method are labeled in serial numbers, 1 to
Chromosomes of GA for fragments assembly.
Algorithm including CRed and CRef.
The goal of the search is to bring closer the fragments generated from the same region of the original chromosome. Fitness of a GA chromosome increases as adjacent genes match in their
We use roulette-wheel selection and elitist preservation. Roulette-wheel selection tends to converge on local maximum when a few chromosomes have much better fitness. On the other hand, the selection is more fair as it properly takes care of individuals’ fitness. Other selection methods, like ranking and tournament selection, have less selection pressure and therefore less probability to early convergence. The variance of fitness among chromosomes is low in our chromosome design, and due to CRed operation. (At the end of CRed operation every chromosomes fitness again resets to a low narrow range). In fact, in a preparatory experiment we have verified that roulette-selection is more efficient for the proposed algorithm.
We do not allow multiple copies of the same fragment in our GA chromosome. To ensure that, we used order-based crossover (OX) and swap, often used in solving TSP [
Mutation is done by simple swapping. Two genes in a chromosome are selected at random and swapped over. In swap mutation, it is also possible to swap a subset where the selected gene is included in the subset. By doing so we can avoid breaking the subset already formed. But we did simple one pair gene swapping.
Through generations, chromosomes bring individual fragments with long matched
In filtering stage we search for contigs already formed in GA chromosome. The search is performed on the elite chromosome. If contig over a certain threshold length is formed, all fragments contained within that contig are extracted from all chromosomes. This shortens the length of chromosomes.
Further detail is as follows. Here,
First,
Marked fragments are combined based on their overlaps. The contig/s is/are stored in a separate database that we call “contig pool’’ which is indexed in Contig Data Table (CDT). This stage is called combining stage. If the other contig/s is/are already in the contig pool, newly formed contig is compared with those contig/s and is combined with those to get longer contigs whenever possible. Accordingly “contig pool’’ and CDT are updated.
As mentioned, the length of subset to be extracted, in terms of number of fragments, is initialized to
The first time, when
After filtering stage, combining stage is executed. When a new contig is added to the contig pool, we try to combine it with the existing contigs, if possible, to make longer contigs. Once a longer contig is formed, further genes (genome fragments) could be shed off from the chromosomes the way it is done in the filtering stage.
In filtering stage of CRed, the fragments in the substring extracted from the chromosome, may join to one end of an existing contig, or it may join two contigs on two sides to form a very long contig. Information of contigs after filtering operation is updated to Contig Data Table (CDT). Information about their relationship, if any, obtained from
As the contigs become longer and chromosomes shorter, GA runs more efficiently. After every combining stage, the user could check whether the available results are good enough (long enough) for her/his purpose. If not, the genetic search continues.
Instead of depending on genetic search alone, we add a step to facilitate proper sequencing more efficiently by manual greedy swapping. This is a simple and fast heuristic that we named CRef.
CRef improves the quality of solution by rearranging the sequence of fragments in a GA chromosome to correspond to the base sequence in the target genome. When two fragments A and B are sequentially positioned in a chromosome due to high-degree of overlap, the following overlap patterns, as shown in Figure
Matching pattern of two fragments. Pattern 1: tail-part of fragment A overlaps with the beginning of fragment B. Pattern 2: tail-parts of the two fragments overlap. Pattern 3: beginning of the two fragments overlap. Pattern 4: beginning of fragment A overlaps with end of fragment B. Pattern 5: both beginning and end of the two fragments overlap.
If two fragments have overlap of pattern 4, it is obvious that their sequential order is wrong in the chromosome. We swap the positions of these two fragments. With this, the positions of fragments in GA chromosome are arranged to correspond to their positions in the original genome as shown in Figure
Chromosome Refinement (CRef) Step.
This concept could be extended by expanding the scope of fragment comparison, beyond that of adjacent fragments only. We set a numeric parameter
A detail explanation of the CRef operation is as follows. Here,
First, we set the values of
The computation cost is low much because CRef is neither executed on all chromosomes nor is executed at every generation. It is limited to a few high-fitness chromosomes (set by the parameter
With these two steps of CRed and CRef, both the efficiency and quality of result of our genetic search greatly improved.
In this section, we describe the details of our experimental setup, discuss the results, and compare them with a the most frequently referred GA-based assembling proposed by Parsons et al. [
We used two real genome sequence data, also frequently used by other researchers, to test the effectiveness of our algorithm. They are available in the NCBI database [
Experimental genome data.
POBF | AMCG |
---|---|
10089 bp | 20100 bp |
Accession no.: M15421 | Accession no.: J02459 |
Human apolipoprotein B-100 mRNA, complete cds. | Bacteriophage lambda, complete genome (initial 40%) |
Number of fragments: about 500 | Number of fragments: about 1000 |
We implemented Parsons’ GA-based algorithm and compared results with proposed algorithm under same experimental conditions. The basic differences between our GA and Parsons’ GA are shown in Table
The differences between proposed GA and Parsons' GA.
Our GA | Parsons' GA | |
---|---|---|
Gene of GA chromosome | Fragment with 2 |
|
Fitness function | Equation ( | |
Crossover | Order-based crossover | |
Mutation | swap mutation + greedy mutation | Swap mutation |
Heuristic part | CRed and CRef | None |
Scaffolding | possible | Not possible |
In the experiment described in Parsons’ paper, they used the some POBF and AMCG data. But the fragment lengths were different, and the whole fragment was readable. In our case, the
Population size is set at 100 chromosomes which are generated by technique explained in Section
We defined
In another preparatory experiment we examined how to set the proper value of the parameter
We experimented 20 trials with different sets of fragments, with POBF and AMCG genome data. In our proposed technique, the number and length of contigs were checked in the “contig pool’’ (Section
Results about contigs.
POBF | AMCG | |||||||
---|---|---|---|---|---|---|---|---|
Proposed GA | Parsons' GA | Proposed GA | Parsons' GA | |||||
Average | Best | Average | Best | Average | Best | Average | Best | |
Number of contig | 27.6 | 1 | 19.8 | 9 | 38.3 | 8 | 35.1 | 11 |
Length of contig | 317.3 | 10089 | 342.6 | 1008 | 349.4 | 2795 | 316.4 | 1341 |
Reconstruction ratio | 86.8 | 100 (2) | 67.2 | 74.3 | 66.5 | 72.1 | 55.2 | 61.1 |
Error | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
With our proposed GA, we could reconstruct the complete original genome of POBF twice. Though Parsons’ GA could obtain the complete genome in their paper [
Because fragmentation is done randomly, though the overall coverage was 4X, some part of the genome may be covered only once by the fragment
The average length of contigs using POBF data in our proposed GA is slightly lower than that of Parsons’ GA. However, the scaffolds generated by our proposed technique, as shown in Table
Results about scaffolds.
POBF | AMCG | |||
---|---|---|---|---|
Average | Best | Average | Best | |
Number of scaffold | 4.1 | 1 | 12.4 | 6 |
Length of scaffold | 2135.9 | 10089 | 1079.1 | 3888 |
The improvement of the reconstruction ratio versus execution time is shown in Figures
The improvement of the reconstruction ratio: POBF dataset.
The improvement of the reconstruction ratio: AMCG dataset.
We calculated the reconstruction ratio from the total length of all contigs in the contig pool. It was 0% in the beginning until CRed started its operation. Because we set a threshold value for substring length to be taken out from chromosome and transferred to contig pool, improvement of reconstruction ratio was stagnant periodically. When the stagnation continued over a certain period of time, CRed parameter
Even at the end of the predefined length of execution time (40 hours and 100 hours), the results were improving. By increasing
We proposed a genetic-algorithm-based approach to assemble DNA fragments to construct the genome sequence. Our GA chromosomes were different from previous approaches. We also added two modifications, Chromosome Reduction step (CRed) and Chromosome Refinement Step (CRef), to improve the efficiency of GA optimization for fragment assembly. Experimenting with actual genome data, we could obtain 100% of the POBF genome sequences. We compared our proposed algorithm with Parsons’s algorithm and have shown that the proposed algorithm delivered better results.
We used a coverage of only 4X instead of more practical
Though CRed step is proposed for fragment assembling problem, it is applicable to similar problems like clustering, path search and other combinatorial optimization.