In this paper, we advocate high-level programming methodology for next generation sequencers (NGS) alignment tools for both productivity and absolute performance. We analyse the problem of parallel alignment and review the parallelisation strategies of the most popular alignment tools, which can all be abstracted to a single parallel paradigm. We compare these tools to their porting onto the FastFlow pattern-based programming framework, which provides programmers with high-level parallel patterns. By using a high-level approach, programmers are liberated from all complex aspects of parallel programming, such as synchronisation protocols, and task scheduling, gaining more possibility for seamless performance tuning. In this work, we show some use cases in which, by using a high-level approach for parallelising NGS tools, it is possible to obtain comparable or even better absolute performance for all used datasets.
Next generation sequencers (NGS) have increased the amount of data obtainable by genome sequencing; a NGS run produces millions of short sequences, called
The rapid evolution of sequencing technologies, each one producing different datasets, is boosting the design of new alignment tools. Some of them target specific datasets (e.g., short reads, long reads, and high-quality reads) or even data from specific sequencing technologies. Since the alignment process is computationally intensive, many alignment tools are designed as parallel applications, typically targeting multicore platforms. Several of them are based on the well-known Smith-Waterman algorithm, which is known to be computationally expensive. For this reason, many of them are already parallel, typically leveraging on multithreading. Also in some cases, SIMD parallelism (via either SSE or GPGPU) is also exploited.
Due to specialisation, some of these tools provide the users with superior alignment quality and/or performance. With the ever-growing number of sequencing technologies, it can be expected that the scenario of specialised alignment tools will widen yet more.
Although the market of NGS alignment tools is growing, to date, the parallel programming methodologies used to design these tools do not embrace much more than low-level synchronisation primitives, such as mutual exclusion and atomic operations. In the hierarchy of abstractions, it is only slightly above toggling absolute binary into the front panel of the machine. In the NGS community, programming multicore for performance is still perceived according to “the closet to the metal the fastest”, thus exclusively focusing on extreme optimisation of the code for a single algorithm and a single platform. We believe that correctness, productivity, time-to-market, and porting of existing legacy codes are equally important targets.
In this paper, we advocate high-level programming methodology for NGS alignment tools for both productivity and absolute performance. We analyse the problem of parallel alignment and review the parallelisation strategies of some of the most popular alignment tools (such as Bowtie and BWA), which can all be abstracted by the
This paper is organised as follows. Section
Many algorithms for sequence alignment have been proposed and different tools were implemented that entirely exploit multithreading on homogeneous and heterogeneous platforms.
The first step done before an alignment is to create and load the reference genome. The used techniques are hash tables and Burrows-Wheeler transform [
Alignment tools, generally, exploit parallelism via multithreading. As an example, Bowtie, both versions Bowtie1 and Bowtie2, implements multithreading with Posix Threads. The BWA aligner is implemented in two different versions: multithreads by exploiting a pool of Posix Threads in BWA and by using MPI (message passing interface) to exploit both shared memory and distributed clusters (pBWA [
There exist several technologies for DNA sequencing, which produce reads of different lengths. At today, the most popular sequencing platforms for long read generation are the Roche 454 or Ion Torrent PGM platforms, whereas for short read generation are the Illumina and Applied Biosystems platforms. Novel sequencing platforms, such as SMRT (Single Molecule Real Time sequencing) PacBio RS family [
Multicore platforms are
Parallel software engineering addressed this challenge via high-level sequential language extensions, parallel coding patterns, and algorithmic skeletons aimed at simplifying the porting of sequential codes to parallel architecture while guaranteeing the efficient exploitation of concurrency [
Parallel design patterns have been recognised to have the potential to induce a radical change in the parallel programming scenario, such that new parallel program will be capable of exploiting the high parallelism provided by hardware vendors [
In this respect, the FastFlow parallel programming framework [
FastFlow is a parallel framework targeting shared memory multi/many-core and heterogeneous distributed systems. It is implemented in C++ on top of the Posix Threads (Pthreads) library and provides developers with a number of efficient parallel patterns [
FastFlow Layered Design.
FastFlow has been used to design a variety of parallel algorithms, including Smith-Waterman [
As discussed in related works (Section
Thanks to specialisation, some of these tools might provide the users with superior alignment quality and/or performance. It is of particular interest to identify and engineer the building blocks needed to develop a parallel alignment tool that is at the same time efficient and portable. Ideally, such building blocks can provide any forthcoming alignment tool with absolute performance, performance portability, and reduced development time.
Indeed, the parallelisation of sequence alignment problem exhibits a number of distinguished features. There exists one (or a set of, in the future,) reference sequences (e.g., genome). The single sequence is typically read-only data. There exists a set of reads to be aligned against the reference. It is also read-only data. The specific attributes of the reads (e.g., length, quality) depend on the dataset. They can anyway be independently aligned against the reference(s). With the growing size of the dataset, they are likely to be available as a stream of data flowing from a permanent storage. The assembly of results from independent alignment frequently does not require a complex merging operation. In case a merging phase is required (e.g., to provide a global filtering of the data), it is expected to be an online process on the result stream flowing to a permanent storage.
These features fit into the master-worker parallel paradigm (i.e., a variant of the farm paradigm), or the more general composition of pipeline and farm paradigms in the case the process requires complex a merging operation (e.g., ordered merging). As a matter of fact, all the most popular parallel alignment tools, including Bowtie2, BWA, and BLASR, implement a master-worker paradigm, where each worker cycles over the following three steps: gets a sequence to align from the shared input file; aligns the read against the genome loaded into the shared index file; populates shared data structures with results and statistics.
During the first and last steps, shared data structures are accessed in a read/write mode. These accesses are regulated via mutual exclusion (either blocking lock or atomic-based spin-lock, depending on the configuration). Furthermore, during these steps, the memory space to accommodate reads is dynamically allocated and deallocated, which might induce further mutual exclusion operations within the memory allocator. Each worker thread iteratively gets a single read from the input dataset and maps it onto the reference genome. This behaviour, usually named on-demand scheduling, enforces load balancing among worker threads.
Interestingly enough, all of them are developed with extremely low-level programming tools, such as spin-lock and atomic operations. They might provide the applications with low-overhead synchronizations but certainly make them hardly portable across different platforms and operating systems. Furthermore, such low level optimizations require nontrivial debugging and a large performance-tuning effort.
As shown in the next section, the adoption of an engineered master-worker pattern simplifies the work, guarantees the portability of the application, and provides the application with good performance. This adoption has been applied to two aligners, Bowtie2 and BWA-MEM. To definitely assert that the proposed pattern is the best for every aligner, we should test it on each tool. It is difficult because of their constant increasing number, but we can say that, for its nature, this pattern helps on simplifying both the parallelisation process and further optimisations.
The Bowtie2 (a.k.a. Bowtie version 2) alignment tool can align reads of very different length. The human genome loading requires a fairly limited amount of memory (about 2.3 GB) and it makes the tool usable from both workstations and laptops. The original source code of Bowtie2 implements parallelism by using the Posix Threads library according to a master-worker pattern.
Each worker iteratively cycles the three steps described in Section
In order to asses expressiveness and efficiency of the pattern-based approach, Bowtie2 (version 2.0.6) has been ported on top of the FastFlow library (
The synchronisation schemas of both original Bowtie2 and Bowtie2-FF are shown in Figure Thread creation and synchronisation are trasparently made available by the parallel pattern. This simplifies the coding and enhance portability on different platforms and threading libraries (e.g., Windows native threads). Pattern run-time behaviour can be configured according to different scheduling policies (e.g., static, on-demand, and affinity) without changing the code. The lock-free run-time support minimises concurrency overhead due to coherency traffic, thus exhibiting a superior speedup on fine-grain and irregular workloads.
Typical thread orchestration in parallel alignment tools. (a) Low-level design (e.g., Bowtie2, BWA); (b) pattern-based design with FastFlow (e.g., Bowtie2-FF, BWA-FF).
Also, the FastFlow framework offers the opportunity to easily couple thread pinning and memory affinity. As an example, in the Bowtie2-FF implementation, each worker private data structures have been allocated on the memory node connected to the core that is executing the worker pinned on it. This way it is possible to get the best memory access latency; that is, each worker thread needs less time to access to the memory and retrieve private data. In order to improve access to the genome, it has been allocated with an interleaved policy, that is, allocating memory pages into all memory nodes on the system (Round-Robin scheduling policy). This way it is possible to avoid memory hot spots on the access to the genome (concurrently accessed by many cores). To understand the gain breakdown of the different techniques, in Section Bowtie2-FF (bt-FF): master-worker with workload dynamically partitioned among workers; Bowtie2-FF with thread pinning (bt-FF (pin)): master-worker with threads pinning on cores and memory affinity for private data; Bowtie2-FF with thread pinning and genome interleaving (bt-FF (pin + int)): master-worker with threads pinning on cores, memory affinity for private data, and interleaved allocation policy among memory nodes for shared data (genome).
For further implementation details, please refer to [
Bowtie2-FF has been developed as a porting version 2.0.6 on the FastFlow library. In February 2014, the version 2.2.1 of Bowtie2 has been released. It has been improved in the index querying efficiency using “population count” instructions available since SSE4.2. In this set, the STTNI instructions (String and Text New Instructions) have been added, which contain several new operations for character searches and comparison on two 16 bytes operands. The two versions do not differ in the orchestration of threads.
The BWA alignment suite includes three algorithms based on the suffix-array based representation of data (Burrows-Wheeler): BWA-backtrack, BWA-SW, and BWA-MEM [
In all three variants, the BWA tool is designed according to a master-worker paradigm as described in Section
As for Bowtie2, each worker of BWA iteratively cycles the three steps described in Section
Theoretically, the FastFlow master-worker implementation has still performance edge against the described implementation since it also avoids all coherency traffic due to atomic operation (thanks to the memory fence-free/atomic-free design of FastFlow run-time). However, this edge becomes evident only for very fine grain tasks (hundreds of clock cycles), whereas typical task grain in BWA is orders or magnitudes larger. Still, master-worker pattern simplifies the design because it implements a transparent dynamic load-balancing strategy and does not require any ad hoc rebalancing strategy.
In this section, datasets used to compare performances between Bowtie2 and Bowtie2-FF are firstly present (Section
Within this work, we aligned datasets obtained with three different sequencing technologies in order to show how they behave with various lengths. More precisely, for our analysis on Bowtie2, we selected 4 short reads (SRR027963, SRR078586, SRR502198, and SRR341579) and 3 long reads experiments (SRR003161, Human-Ref19-1, and Human-Ref19-2). The formers report genomic sequences from CTCF ChIP-Seq experiments performed on IMR90 cell line (SRR078586), Exome sequencing from phase 1 of 1000 Genomes Project (SRR502198), and a dataset from Hi-C assay on K562 cells (SRR341579, SRR027963). For long read datasets, we choose three whole human genome sequencing, one from phase 2 of 1000 genome project (SRR003161). Table
Datasets.
Dataset | Platform | Read length (bp) | Reads count |
---|---|---|---|
SRR534301 | Illumina | 101 | 108,749,331 |
SRR072996 | Illumina | 20 | 60,673,318 |
lane2_CTL_qseq | Illumina | 36 | 53,673,423 |
SRR568427 | Illumina | 36 | 53,594,954 |
SRR502198 | Illumina | 36 | 25,675,656 |
SRR078586 | Illumina | 8–48 | 3,101,013 |
SRR003161 | 454 GS FLX | 47–4,931 | 1,376,701 |
SRR341579 | Illumina | 202 | 6,143,624 |
SRR027963 | Illumina | 76 | 18,145,940 |
Human-Ref19-1 | PacBio | 35–35,488 | 25,249 |
Human-Ref19-2 | PacBio | 35–34,583 | 17,797 |
Drosophila M. | PacBio | 55–6,883 | 332,369 |
We also selected a portion of two PacBio Bioscience PacBio RS II technologies (Human-Ref19-1, Human-Ref19-2). By comparing the length of mapped and unmapped reads, we observed that Bowtie2 was able to align reads longer than 10 Kbp (max length 35 Kbp). Despite expectations, majority of shorter reads (less than 3 Kbp) were not mapped by the algorithm (Figure
Alignment percentages on Human-Ref19-1 and Human-Ref19-2.
Subset | Mapped | Number of reads | Mapped | Number of reads |
---|---|---|---|---|
PacBio Human-Ref19-1 | PacBio Human-Ref19-2 | |||
1 Kbp | 8.10% | 2098 | 10.76% | 1747 |
2 Kbp | 8.20% | 5231 | 11.68% | 4788 |
3 Kbp | 9.57% | 2697 | 12.07% | 7093 |
5 Kbp | 9.58% | 11755 | 13.38% | 9931 |
Length of mapped and unmapped reads for PacBio Human-Ref19-1 (a) and Human-Ref19-2 dataset (b) obtained with bt-FF.
The same analysis was done on BLASR and BWA-MEM alignment results. Figure
Length of mapped and unmapped reads for PacBio Human-Ref19-1 (a) and Human-Ref19-2 dataset (b) obtained with BWA-MEM.
We verified this observation also using BLASR and BWA-MEM tools. All reads were aligned using the former algorithm (data not shown), while BWA-MEM produced similar results of Bowtie2 algorithm in terms of distribution of reads length (Figure
Notably, using Bowtie2, the fraction of aligned reads increased to 84.3% but again, we observed a fraction of unmapped reads whose length was significantly lower compared to the former (unpaired
Length of mapped and unmapped reads for PacBio
In this section, the original multithreaded implementation of Bowtie and BWA alignment tools are compared for performance to their porting onto the FastFlow pattern-based library on different datasets. A key of used tools with their version is reported in Table
Alignment tools key.
Acronym | Tool | Version | Variant | Technology |
---|---|---|---|---|
bt-2.0.6 | Bowtie2 | 2.0.6 | original | Pthreads |
bt-2.2.1 | Bowtie2 | 2.2.1 | original | Pthreads |
bt-FF | Bowtie2 | 2.2.1 | porting | FastFlow |
BWA-MEM | BWA MEM | 0.7.9a | original | Pthreads |
BWA-MEM-FF | BWA MEM | 0.7.9a | porting | FastFlow |
BLASR | BLASR | 2.1 | original | Pthreads |
Tests were executed on an Intel workstation with 4 eight-core E7-4820 Nehalem (64 HyperThreads) @2.0 GHz with 18 MB L3 cache and 64 GBytes of main memory with Linux x86_64. Each processor uses HyperThreading with 2 contexts per core. bt-2.2.1 was compiled with the
As reported in Figure
Maximum speedup obtained by executing different implementations of Bowtie2 on short reads datasets (see Table
Notice that the version with pinning and interleaving performs better in the most cases. This latter version is used for tests shown in the rest of the paper. Alignment tests with Roche 454 real and synthetic datasets can be found in [
As shown in Figures
Speedup comparisons among bt-2.2.1 and bt-FF on mixed length datasets (see Table
Speedup comparisons testing on short read dataset.
Speedup comparisons testing on long read dataset.
Performances of bt-2.2.1, bt-2.0.6, and bt-FF have been also compared on two Human uncorrected datasets, which exhibits long length reads. Notice that the aim of the test is to assess the performance gain due to the high-level programming approach, that is, to compare bt-FF against bt-2.06 and bt-2.2.1, and not to assess absolute performance of Bowtie2 on long reads.
Table
Bt-2.2.1, Bt-2.0.6, and Bt-FF execution times on Human-Ref19-1 and Human-Ref19-2.
Tool | Best time | Speedup | Best time | Speedup |
---|---|---|---|---|
PacBio Human-Ref19-1 | PacBio Human-Ref19-2 | |||
bt-2.2.1 | 00:14:48 | 6.34 | 00:20:38 | 5.6 |
bt-2.0.6 | 00:02:20 | 16.30 | 00:03:15 | 15.61 |
bt-FF | 00:02:20 | 16.30 | 00:03:13 | 15.77 |
To exclude the influence of sequencing errors content in sequences, we compared the speedup achieved by different tools (Bowtie2, BWA, and BLASR) on the PacBio
Figure
Performance comparison among bt-2.2.1, bt-FF, BLASR, BWA-MEM, and BWA-MEM-FF on
The BLASR version, for which a FastFlow version has not been developed, is reported for the sake of completeness. BLASR exhibits the very same parallel structure of other two tools (see Section
In all cases, the efficiency of parallelisation hardly reaches the
Metric | bt-2.2.1 | bt-2.0.6 | bt-FF |
---|---|---|---|
CPUs utilised | 28.665 | 19.661 | 24.363 |
CPU-migrations | 1,363 | 3,513 | 57 |
|
0.19 | 0.98 | 1.03 |
|
42.46% | 32.91% | 32.14% |
(of all L1-dcache hits) | |||
|
80.87% | 58.66% | 67.91% |
(of all LL-cache hits) | |||
Execution time (s) | 96.87 | 24.41 | 19.05 |
Further information to explain performance differences of the different versions of Bowtie2 can be extracted via
Table
To assess results across different platforms, the tools were tested also on a different platform, an Intel Sandy Bridge with two 8-core sockets (2 HyperThreads) @2.2 GHz, 20 MB L3 cache with Linux x86_64 (only on a subdataset from Human-Ref19-1 from 1 Kbp up to 5 Kbp). As in the previous tests, bt-2.2.1 was compiled with the
Execution time for each tool version (bt-2.2.1, bt-2.0.6, and bt-FF) on tested PacBio human subdatasets on the Intel Nehalem workstation.
Execution time for each tool version (bt-2.2.1, bt-2.0.6, and bt-FF) on tested PacBio sub-datasets on the Intel Sandy Bridge workstation.
In general, the Sandy Bridge platform is (slightly) highly clocked and more importantly exploits SSE/AVX instructions whose length is twice the Nehalem’s SSE ones. However, the quad-socket Nehalem platform exhibits an aggregate Level3 cache of 72 MB (18 MB x 4), whereas the Sandy Bridge dual-socket is only 40 MB (20 MB x 2). For this, the 5 Kbp experiment working set fits in the Nehalem cache and does not fit in the Sandy Bridge cache. Being Bowtie, a strongly memory-bound application, this impairs the performance to such large degree that cannot be balanced by the faster processors of the Sandy Bridge. In the same way, bt-2.2.1 is generally slower with respect to bt-2.0.6/bt-FF on the same experiment because it requires a larger working set.
In this paper, we analysed the problem of sequence alignment from parallel computing perspective; we reviewed the design of three of the most popular alignment tools exhibiting parallel computing capabilities, among others, Bowtie2, BWA, and BLASR. All these tools exploit a master-worker parallel orchestration paradigm to process the set of reads in parallel. Some of them also exploit SIMD parallelism to further accelerate the computation of a single task (i.e., a read) using SSE instructions. Each of the analysed tools implements its own version of the master-worker paradigm at a very low-level of abstraction, specifically using blocking locks of the Posix Threads library or processor-specific atomic instructions.
We advocate high-level parallel programming as an alternative design strategy for next generation alignment tools. High-level parallel programming aims at reducing development and performance tuning effort and enhances code and performance portability across different platforms. We demonstrated on two tools (Bowtie2 and BWA-MEM) that the pattern-based design not only simplifies tool engineering but also boosts the speedup of the application beyond the hand-tuned low-level original code. As nowadays no developer expects to get any performance advantage coding an application in assembler, no developer should expect to get more speedup by the low-level coding of a parallel application.
We ported Bowtie2 and BWA on top of the pattern-based FastFlow parallel programming framework for C++. The porting required altering few lines of code (out of several ten thousands) with an estimated programming effort of few days. Also, the FastFlow-based version of the tools resulted easier to tune for maximum performance. In particular, scheduling policy, load-balancing strategies, and memory affinity are extrafunctional features of the master-worker FastFlow pattern. Leveraging on these features, it has been possible to optimise tools parallel behaviour beyond the hand-optimised code of their original versions. As an example, in the case of Bowtie2, which is a memory bound application; the key optimisation consists in improving locality of the memory accesses and utilisation of shared memory bandwidth. In terms of programming effort, this just consists in configuring the master-worker pattern to adopt a memory-affine task scheduling.
High-level parallel programming is becoming the mainstream approach for a growing class of applications. Even though our results cannot be considered fully demonstrative of the correctness and efficiency of the parallel pattern applied, we can fairly state that the global structure of an aligner, from the parallelisation viewpoint, can be always mapped within a master-worker pattern with suggested optimisations. We do believe this can be an enabling feature for future generation sequence alignment and analysis approaches.
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work has been partially supported by the Paraphrase (EC-STREP FP7 no. 288570) and the (EC-STREP FP7 no. 609666) REPARA Projects.