The rapid advancement of the next-generation sequencing technologies has made it possible to regularly produce numerous reads from the DNA samples in the sequencing laboratories. In particular, the number of reads now is in the range of hundreds of millions. Hence, the current challenges include efficient processing of this data which may reach even a couple hundred GB. To this end, the

The

Designing lightweight implementations of de Bruijn graphs has been the focus of attention in recent times. For example, minimum-information de Bruijn graphs, pioneered by [

Pell et al. [

Recently, Chikhi and Rizk [^{9}, it takes more than 10 hours to complete the critical false positive calculation. To summarize, their approach, in addition to the RAM usage, requires the total free hard disk space to be used over and over again. This in the end affects the runtime and it becomes prohibitively high. Another limitation of this approach is that it cannot handle the situation when the

According to the present state of the art, memory-efficient Bloom filter representations of de Bruijn graphs have two critical issues, namely, the high running time and the task of false positive computation. On the other hand, other traditional approaches that do not have these issues need much higher memory.

In this paper, we make an effort to alleviate these problems. In particular, we present a new algorithm based on

HaVec introduces a novel graph construction approach that has all three desired properties: it is error free, its running time is low, and it is relatively memory efficient and hence requires sufficiently low memory.

It introduces the idea of using a hash table along with an auxiliary vector data structure to store the

It constructs such a graph representation that generates no false positives. As a result, only true neighbours are found for traversing the whole graph.

We note that some preliminary results of this research work were presented at the 17th International Conference on Computer and Information Technology (ICCIT 2014) [

Let us consider the genome assembly process when a de Bruijn graph is used. Because of the high memory requirement, traditional graph representation approaches do not scale well. This is specially true in case of large graphs having millions of nodes and edges. A Bloom filter can offer a memory-efficient alternative. In this option, edge is not stored explicitly; rather a present bit is used for every node. The procedure is well known and briefly described below for completeness. For each node in the graph, a hash value is produced, which along with the table size produces an index in the table. The most popular and easy method to produce this index is to divide the hash value by the table size to get the remainder. Now, if the node is present, the corresponding index as calculated above is set to 1. Similarly, to check the presence (absence) of a node in the graph, we do the same calculation and simply check whether the corresponding index is 1 (0). At this point, recall that a Bloom filter may produce false positives. Hence, if the corresponding index is 0, then the node is definitely absent; otherwise, the node is possibly present.

Now the question is how can we compute the edges? Again, the procedure is simple. Recall that a node corresponds to a

Now the problem of using the Bloom filter to represent the graph lies in the probability that more than one node may generate the same index: when divided by the table size and hash values of more than one node may produce the same remainder. So, there is a chance for a false edge to be created in the graph if a neighbour node is generated falsely; that is, if the corresponding bit is set due to a different node generating the same reminder. This is why we may have false positives when using a Bloom filter.

If the false positives are eliminated, then, the Bloom filter will undoubtedly be one of the best candidates (if not the best) to represent a de Bruijn graph. Note that an increase in the table size of a Bloom filter surely decreases the false positive rate; however, it will never become zero. In this paper, we present a crucial observation to tackle this issue: even if the same reminder is produced from more than one node following the abovementioned division operation (i.e.,

Our approach is quite simple and described below. We use a total of

As has been mentioned above, HaVec does not maintain an explicit graph structure; rather, it uses the

HaVec uses hashing for faster access. In the hash table, for each index, HaVec uses 40 bits, that is, 5 bytes of memory as will be evident shortly (please see also Table

Because we are working on DNA sequences, each node (i.e.,

It can have no neighbours (we have only one possibility).

Or it can have only one neighbour (we have 4 possibilities).

Or it can have only 2 neighbours (we have

Or it can have 3 neighbours (we have

Or it can have all 4 neighbours (we have only one possibility).

Hence, HaVec employs 4 bits for this purpose, where a particular bit corresponds to a particular nucleotide.

HaVec uses 3 bits to keep track of the hash functions thereby accommodating a maximum of 8 hash functions (in this setting).

The quotient value therefore can be stored in the remaining 33 bits.

HaVec’s usage of 5 bytes.

Information | Number of bits |
---|---|

Outgoing neighbours | 4 |

Hash function number | 3 |

Quotient value | 33 |

HaVec employs an auxiliary vector data structure as shown in Figure

Three levels of vectors used in our approach.

In order to represent the hash value of a ^{(2k-33)}. This is because the quotient value is computed by dividing the hash value by the table size. We illustrate this with the help of an example. Suppose that the value of ^{64}–1. Now, the minimum hash table size of 2^{64-33} or 2^{31} implies that the maximum quotient value can be 2^{33}–1 requiring 33 bits of storage. Clearly, the minimum hash table size is dependent on the value of ^{50-33} or 2^{17}.

At this point, a brief discussion on the relation between the memory requirement and the quotient size is in order. We illustrate this using another example. Consider the case when we have 20 bits for the quotient value. Then for ^{44} (2^{64-20}). This will clearly affect the total memory requirement adversely. In fact, if we reduce the number of quotient bits by one, the minimum table size will be multiplied by two; on the other hand, increasing it will result in fewer hash table entries. Naturally, fewer entries in the hash table force more use of the auxiliary vector structures thereby increasing the running time. As it turns out, keeping 33 bits for the quotient value makes the right compromise: the memory requirement and the running time remain at an acceptable level and we can handle up to 32-mers.

In our implementation, the hash values are 64-bit unsigned integers. We need 2^{64} entries. But it is not practical to have a hash table of that size. So, we consider a much smaller hash table and then use multiple hash functions in order to reduce the probability of collision, filling as much space in the hash table as possible. So, it is mandatory to keep track of which hash function has been used for which

Now, it seems that if we increase the number of hash functions, we can populate the hash table more efficiently. But there is a cost for storing the index of the hash function used for a particular ^{n}

To understand the whole process, here we explain how HaVec works with the help of an illustrative example. In this example, we assume that the values of

The de Bruijn graph for the

For the sake of ease of explanation, let us assume that the hash table size is 11. Suppose the two hash functions we have are

Reads are broken into

Hash values | Comments | |
---|---|---|

GGCAA | 57 | |

GCAAT | 27 | |

CAATT | 24, 36 | |

AATTG | 52 | |

ATTGT | 36, 27 | Put in vector |

TTGTG | 22 | |

TGTGT | 34, 30 | Put in vector |

GTGTG | 49, 47 | Put in vector |

TGTGT | 34, 30 | Found in vector; update |

GTGTC | 38, 25 | Put in vector |

TGTCG | 56 |

Hash table information. We have a total of 11 indices. At each index, the corresponding

Index | Information |
---|---|

0 | 2–1– |

1 | 5–1 |

2 | 5–1– |

3 | 3–2– |

4 | 0–0 |

5 | 2–1– |

6 | 0–0 |

7 | 0–0 |

8 | 4–1– |

9 | 0–0 |

10 | 0–0 |

Auxiliary vector data structures. All collided

We consider the first 5-mer, namely,

Now, let us focus on the proceedings related to

For the next 5-mer, namely,

The handling of

The next

The last

put the quotient into memBlock[index]

put whichHashFunc into memBlock[index]

put nextNeucleotide into memBlock[index]

put nextNeucleotide into memBlock[index]

and hashvalue does not match

create tempVect with

add tempVect to mapPointer5Byte[firstLevelVectorIndex]

create a tempkmerInfo and put nextNeucleotide in it

has already this kmer

update nextNeucleotide

put quotient, whichHashFunc in tempkmerInfo

add the newly updated tempkmerInfo to mapPointer5Byte[firstLevelVectorIndex]

[secondLevelVectorIndex].vect

continue

In genome assembly,

Notably, the issue of a cutoff value has become less significant in recent times than it was before few years ago. This is because of the rapid advancement of the technologies in the sequencing laboratories that are now able to produce very high-quality reads much accurately. This motivated us not to keep provisions for a cutoff value in our original design. However, HaVec can easily accommodate cutoff values simply by using an additional byte. This allows us to support cutoff values between 1 and 255. When we process the input file, we can easily update the count information of a

To evaluate the performance of HaVec, we have conducted extensive experiments. We have run our experiments on a server with an Intel® Xeon® CPU E5-4617 @ 2.90 GHz having 12 cores with a total RAM of 64 GB. Note that the scope of this research was to implement HaVec as a single thread, and hence we have used only one core of the server for our experiments. We do plan to release a multithreaded version of HaVec in the near future.

Table

Description of datasets.

Serial number | File name | File size in MB |
---|---|---|

1 | 50 m.fa | 030.42 |

2 | Ecoli_MG1655_s_6_1_bfast.fasta | 242.19 |

3 | Ecoli_MG1655_s_6_2_bfast.fasta | 1718.37 |

4 | Human1_95G_CASAVA1.8a2_NCBI37_18Jan11_chr21.sorted.fasta | 1599.86 |

5 | Human1_95G_CASAVA1.8a2_NCBI37_18Jan_chr19.sorted.fasta | 2393.17 |

6 | NA19240_GAIIx_100_chr21.fasta | 1854.83 |

7 | dataset_1_7GB.fa | 1677.57 |

8 | dataset_1_9GB.fa | 1944.92 |

We first have designed an experiment with a goal to understand and analyze the relation among different parameters of HaVec. This experiment is done on the input file 50 m.fa assuming

The relation between the number of

The graph shows the relation between the number of

The relation for hash table index and total memory. Optimum memory use can be achieved when hash table size is 1.25 to 1.5 times the number of unique

The relation between the hash table size and the runtime. Generally, runtime for the 6-byte implementation is slightly higher than that of the 5-byte implementation.

Next, we investigate the relation between the number of

Figure

Finally, the curve in Figure

We have further conducted extensive experiments considering all the files listed in Table

Results for HaVec.

File name | Hash table index | 6 bytes per |
5 bytes per |
6 bytes per |
5 bytes per |
Unique |
Unique |
Total unique | |
---|---|---|---|---|---|---|---|---|---|

50 m.fa | 27 | 3,558,218,093 | 2261.75 | 2163 | 21445.9392 | 18052.608 | 2,358,010,004 | 14,135,390 | 2,372,145,394 |

50 m.fa | 32 | 3,283,745,651 | 2183 | 2038.5 | 19228.8768 | 16648.4992 | 2,176,128,060 | 13,035,689 | 2,189,163,749 |

Ecoli_MG1655_s_6_1_bfast.fasta | 27 | 20,115,587 | 70.25 | 70 | 144.896 | 125.7472 | 13,330,622 | 79,768 | 13,410,390 |

Ecoli_MG1655_s_6_1_bfast.fasta | 32 | 20,693,341 | 69.375 | 68.75 | 148.3776 | 128.6144 | 13,713,606 | 81,955 | 13,795,561 |

Ecoli_MG1655_s_6_2_bfast.fasta | 27 | 196,614,919 | 488.5 | 486.5 | 1205.8624 | 1018.368 | 130,293,923 | 782,686 | 131,076,609 |

Ecoli_MG1655_s_6_2_bfast.fasta | 32 | 200,937,899 | 480.75 | 476.5 | 1231.872 | 1040.2816 | 133,158,739 | 799,832 | 133,958,571 |

Human1_95G_CASAVA1.8a2_NCBI37_18Jan11_chr19.sorted.fasta | 27 | 313,251,713 | 601.5 | 593.25 | 1909.0432 | 1610.24 | 207,579,166 | 1,255,278 | 208,834,444 |

Human1_95G_CASAVA1.8a2_NCBI37_18Jan11_chr19.sorted.fasta | 32 | 334,345,241 | 611.625 | 592.75 | 2035.6096 | 1716.736 | 221,565,835 | 1,330,984 | 222,896,819 |

Human1_95G_CASAVA1.8a2_NCBI37_18Jan11_chr21.sorted.fasta | 27 | 199,165,411 | 371.5 | 370.625 | 1221.4272 | 1031.4752 | 131,981,455 | 795,486 | 132,776,941 |

Human1_95G_CASAVA1.8a2_NCBI37_18Jan11_chr21.sorted.fasta | 32 | 207,852,223 | 374.75 | 374.5 | 1273.4464 | 1075.2 | 137,741,484 | 826,661 | 138,568,145 |

NA19240_GAIIx_100_chr21.fasta | 27 | 163,949,171 | 508.625 | 513.5 | 1009.5616 | 853.1968 | 108,643,167 | 656,266 | 109,299,433 |

NA19240_GAIIx_100_chr21.fasta | 32 | 170,662,721 | 549 | 504.625 | 1016.3712 | 886.9888 | 113,094,644 | 680,508 | 113,775,152 |

dataset_1_7GB.fa | 27 | 199,165,411 | 395.25 | 368.125 | 1221.4272 | 1031.4752 | 131,981,455 | 795,486 | 132,776,941 |

dataset_1_7GB.fa | 32 | 207,852,223 | 373.75 | 367 | 1273.4464 | 1075.2 | 137,741,484 | 826,661 | 138,568,145 |

dataset_1_9GB.fa | 27 | 163,949,171 | 516 | 507.125 | 1009.5616 | 853.1968 | 108,643,167 | 656,266 | 109,299,433 |

dataset_1_9GB.fa | 32 | 170,662,721 | 511.375 | 507.625 | 1049.8048 | 886.9888 | 113,094,644 | 680,508 | 113,775,152 |

We have conducted a number of experiments to compare the performance of HaVec with the state of the art methods. In particular, we have compared HaVec with Velvet [

This command runs minia for

It should be mentioned here that in our experiments, minia has produced approximately 5% less unique

For minia, we have run the experiments for both values of

Results for minia.

File name | Min abundance | Estimated genome size | Runtime (AVG) | Runtime (SD) | Memory total (AVG) | Memory total (SD) | |
---|---|---|---|---|---|---|---|

50 m.fa | 27 | 1 | 2,189,163,749 | 37637.66666 | 1064.9044 | 6589.0 | 0 |

50 m.fa | 32 | 1 | 2,189,163,749 | 33418.33333 | 477.8162 | 6604.0 | 0 |

Ecoli_MG1655_s_6_1_bfast.fasta | 27 | 1 | 13,795,561 | 100 | 1.0000 | 251.0 | 0 |

Ecoli_MG1655_s_6_1_bfast.fasta | 32 | 1 | 13,795,561 | 100.66667 | 1.5275 | 251.0 | 0 |

Ecoli_MG1655_s_6_2_bfast.fasta | 27 | 1 | 133,958,572 | 1787 | 35.7911 | 1813.0 | 0 |

Ecoli_MG1655_s_6_2_bfast.fasta | 32 | 1 | 133,958,572 | 1753 | 13.0767 | 1814.0 | 0 |

Human1_95G_CASAVA1.8a2_NCBI37_18Jan11_chr19.sorted.fasta | 27 | 1 | 222,896,820 | 2229.66667 | 21.9393 | 2551.0 | 0 |

Human1_95G_CASAVA1.8a2_NCBI37_18Jan11_chr19.sorted.fasta | 32 | 1 | 222,896,820 | 2075 | 7.5498 | 2553.0 | 0 |

Human1_95G_CASAVA1.8a2_NCBI37_18Jan11_chr21.sorted.fasta | 27 | 1 | 138,568,143 | 1083.33333 | 8.5049 | 1697.0 | 0 |

Human1_95G_CASAVA1.8a2_NCBI37_18Jan11_chr21.sorted.fasta | 32 | 1 | 138,568,143 | 1116.33333 | 15.1767 | 1698.0 | 0 |

NA19240_GAIIx_100_chr21.fasta | 27 | 1 | 113,775,137 | 836.33333 | 7.5719 | 1935.0 | 0 |

NA19240_GAIIx_100_chr21.fasta | 32 | 1 | 113,775,137 | 870.33333 | 9.0185 | 1935.0 | 0 |

dataset_1_7GB.fa | 27 | 1 | 138,568,143 | 1084 | 10.1489 | 1697.0 | 0 |

dataset_1_7GB.fa | 32 | 1 | 138,568,143 | 1103.66667 | 6.6583 | 1698.0 | 0 |

dataset_1_9GB.fa | 27 | 1 | 113,775,137 | 841 | 9.5394 | 1935.0 | 0 |

dataset_1_9GB.fa | 32 | 1 | 113,775,137 | 864.66667 | 7.5719 | 1935.0 | 0 |

Results for Velvet.

File name | Disk space (GB) | RAM space | Run time (seconds) | Number of |
---|---|---|---|---|

50 m.fa | 6.6 | 64,675,804 KB | 83,900 | 50,000,000+ |

Ecoli_MG1655_s_6_1_bfast.fasta | 1.9 | 751,460 KB | 53 | 2,003,258 |

Ecoli_MG1655_s_6_2_bfast.fasta | 13 | 5,884,072 | 411 | 14,214,324 |

Human1_95G_CASAVA1.8a2_NCBI37_18Jan11_chr19.sorted.fasta | 7.3 | 6,591,012 | 327 | 17,670,833 |

Human1_95G_CASAVA1.8a2_NCBI37_18Jan11_chr21.sorted.fasta | 5.0 | 4,107,472 | 188 | 11,812,904 |

NA19240_GAIIx_100_chr21.fasta | 7.0 | 3,714,472 KB | 246 | 15,016,990 |

dataset_1_7GB.fa | 5.0 | 4,108,000 KB | 187 | 11,812,904 |

dataset_1_9GB.fa | 7.0 | 3,714,480 KB | 242 | 15,016,990 |

We have conducted

File name | Minia runtime | HaVec 5-byte mf = 1.2 | HaVec 5-byte mf = 1.5 | HaVec 6-byte mf = 1.2 | HaVec 6-byte mf = 1.5 |
---|---|---|---|---|---|

Ecoli_MG1655_s_6_1_bfast.fasta | 100.00 | 86.625 | 0 | 85.875 | 70.25 |

Ecoli_MG1655_s_6_1_bfast.fasta | 100.50 | 86.125 | 8.75 | 86.875 | 69.375 |

Ecoli_MG1655_s_6_2_bfast.fasta | 1801.350 | 597.5 | 86.5 | 601.25 | 488.50 |

Ecoli_MG1655_s_6_2_bfast.fasta | 1756.00 | 589.875 | 476.5 | 592.75 | 480.75 |

Human1_95G_CASAVA1.8a2_NCBI37_18Jan_chr19.sorted.fasta | 2236.00 | 724.25 | 593.25 | 722 | 601.50 |

Human1_95G_CASAVA1.8a2_NCBI37_18Jan_chr19.sorted.fasta | 2071.50 | 726.25 | 592.75 | 738.50 | 611.625 |

Human1_95G_CASAVA1.8a2_NCBI37_18Jan11_chr21.sorted.fasta | 1087.50 | 449.5 | 370.625 | 444.125 | 371.50 |

Human1_95G_CASAVA1.8a2_NCBI37_18Jan11_chr21.sorted.fasta | 1124.50 | 447.875 | 374.50 | 450.625 | 374.75 |

NA19240_GAIIx_100_chr21.fasta | 832.00 | 621 | 513.50 | 623.625 | 508.625 |

NA19240_GAIIx_100_chr21.fasta | 870.00 | 619.25 | 504.625 | 622.375 | 525.75 |

dataset_1_7GB.fa | 1078.50 | 446.5 | 368.125 | 450.25 | 395.25 |

dataset_1_7GB.fa | 1102.00 | 451.25 | 367 | 454.25 | 373.75 |

dataset_1_9GB.fa | 841.50 | 619.875 | 507.125 | 628.375 | 516.0 |

dataset_1_9GB.fa | 862.00 | 626 | 507.625 | 624.125 | 511.375 |

In this paper, we have presented HaVec, which is a simple and efficient approach to store a de Bruijn graph for genome assembly. HaVec uses hash table along with an auxiliary vector data structure to store

HaVec can also support the concept of cutoff values by storing the count information of each

Any operation involving a

Before concluding, we briefly discuss another useful feature of HaVec. During assembly, the construction of the de Bruijn graph and the assembly process may need to be run more than once for different cutoff values. On the contrary, in the 6-byte implementation of HaVec, we just keep the count of the number of occurrences of each

The major share of the time in the genome assembly process is taken by the graph construction procedure. In this paper, we have presented HaVec which can do this in a significantly shorter time. Another critical feature of HaVec is that it does not produce any false positive

All the data files in FASTA format can be downloaded from the following link:

This research work was carried out as part of an undergraduate thesis of Md Mahfuzer Rahman, Ratul Sharker, and Sajib Biswas at CSE, BUET under the supervision of M. Sohel Rahman.

The authors declare that they have no competing interests.

M. Sohel Rahman designed the project. Md Mahfuzer Rahman, Ratul Sharker, and Sajib Biswas implemented the project and conducted the experiments. M. Sohel Rahman analyzed and verified the results. All authors wrote and approved the manuscript.