Long Read Alignment with Parallel MapReduce Cloud Platform

Genomic sequence alignment is an important technique to decode genome sequences in bioinformatics. Next-Generation Sequencing technologies produce genomic data of longer reads. Cloud platforms are adopted to address the problems arising from storage and analysis of large genomic data. Existing genes sequencing tools for cloud platforms predominantly consider short read gene sequences and adopt the Hadoop MapReduce framework for computation. However, serial execution of map and reduce phases is a problem in such systems. Therefore, in this paper, we introduce Burrows-Wheeler Aligner's Smith-Waterman Alignment on Parallel MapReduce (BWASW-PMR) cloud platform for long sequence alignment. The proposed cloud platform adopts a widely accepted and accurate BWA-SW algorithm for long sequence alignment. A custom MapReduce platform is developed to overcome the drawbacks of the Hadoop framework. A parallel execution strategy of the MapReduce phases and optimization of Smith-Waterman algorithm are considered. Performance evaluation results exhibit an average speed-up of 6.7 considering BWASW-PMR compared with the state-of-the-art Bwasw-Cloud. An average reduction of 30% in the map phase makespan is reported across all experiments comparing BWASW-PMR with Bwasw-Cloud. Optimization of Smith-Waterman results in reducing the execution time by 91.8%. The experimental study proves the efficiency of BWASW-PMR for aligning long genomic sequences on cloud platforms.


Introduction
Bioinformatics involves the biological, genomic, statistics, mathematics, and computer science disciplines of study. Analysis of genomic and biological sequence data is one of the important parts of bioinformatics [1]. The analysis of genomic sequences enables us to understand the genetic structure, and its bases of drug response and disease. Numerous researchers who are working in this interdisciplinary field perform gene structure and functionality analysis in order to discover new gene sequences and understand gene origin. For the analysis purpose, existing genomic databases such as Google Genomics [2] and NCBI [3] (generally in 100's of GB in size) are used to identify the similarities. Identification of similarities/dissimilarities is achieved by sequence comparison algorithms. Comparisons of biological sequences produce matching alignments and similarity scores. These similarity scores represent the similarities/dissimilarities between the considered biological sequences. The matching alignments and similarity scores are used for secondary structure predictions and multiple sequence alignments which are highly complex operations that rely on the accuracy of the comparison algorithm used. Applications related to cancer research, forensics, agrigenomics, genetic disease identification, microbial research, reproductive health, human wholegenome sequencing, and many more rely on sequence alignment algorithms for analysis.
Genomic sequencing data is obtained from the developed Next-Generation Sequencing (NGS) technologies (e.g., from Illumina, Solexa, and Pacific Bioscience). Most of the genomic data available consists of millions of genomic sequences of short reads [4]. With the advances in sequencing technologies, millions of sequences with greater read lengths > 1000 bp are now being generated. Based on genomic data, the sequencing tools can be basically classified into two categories: 2 BioMed Research International (1) Short read aligners: used to align genomic sequences whose read length is between 32 bp and 200 bp, that is, BWA [4] and Bowtie2 [5].
(2) Long read aligners: used to align genomic sequences whose length is greater than 1000 bp, that is, Illumina.
Existing bioinformatics applications require both management of huge amounts of data and heavy computation for analysis. Nearly 3 billion US dollars and 10 years were required to produce the initial human reference genome containing about 3.5 billion base pairs. The latest developments in sequencing technologies available produce enormous amount of data in terms of gigabytes per run [13]. In NGS large sample size applications, users are required to wait for sufficient resources to become available in which the time needed to complete processing becomes unpredictable. Analyzing such enormous amount of data and its storage are problems that exist. To address these high computation needs, grid or cluster computing platforms have been provided to researchers [14]. The provided grid or cluster computation platforms are constrained by hardware capacity and concurrent access capacity to support multiple users. The ever growing gap between computing capabilities and sequencing throughput is presented in [15].
Using cloud platforms, one can solve the storage and computing problems that exist in gene sequencing. Cloud computing platforms provide flexibility, scalability, and ondemand access to resources. Moreover, adoption of cloud computing technologies eliminates the costs incurred in establishing, maintaining large physical storage systems, and computing clusters. Cloud users pay for the utilized storage and computing resources without worrying about maintenance, availability, and reliability related issues. Although cloud computing provides scalable and flexible infrastructure, parallel computing models on cloud platforms/infrastructure are required to achieve the desired goal of analyzing gene sequences. One such framework for the cloud, called MapReduce model [16], was introduced by Google. Another model developed by University of California at Berkeley proposed the Spark platform [17] for cloud computing. The Pregel framework for cloud environments is presented in [18]. A comprehensive survey of most recent research and development in cloud-based service computing solutions in research of genome informatics is available in [19,20]. In research of genomic analysis, there are a number of cloud computing providers that offer cloud-based services solutions such as Google Cloud Genomics [3], Amazon [21], and Microsoft Azure [22]. The necessity for cloud computing for genomic analysis has been well discussed by leaders in bioinformatics and computational biology [23]. Galaxy [24] is a powerful web-based system for performing genome analysis tasks and one of many models that manage bioinformatics databases. Galaxy built on Amazon's Elastic Compute Cloud (EC2) with CloudMan framework to assist researchers executes their own Galaxy server on cloud infrastructure. However, Galaxy has data storage and data manipulation bottlenecks for large datasets with capability of analyzing one sample at a time and does not utilize the elastic cloud compute capabilities completely. This drawback is caused by its dependence on a single shared file system. A significant bottleneck has been reported in [25] when processing large datasets across distributed compute resources. Among all the cloud computing platforms available, Hadoop MapReduce is by far the most popular choice due to its ease to tune, open source nature, and acceptability by industry and academic organizations. In order to utilize the full potential of cloud infrastructure, public cloud service providers offer virtualized resources in terms of both hardware and software [26]. The virtualization enables user specific customization, flexibility, application execution environments, and cost efficiency and minimizes power consumption [27,28].
The Hadoop framework [29] adopts a MapReduce model for computing a user specific application on cloud platforms. In the MapReduce model, a dual phase execution approach is adopted. In the initial phase, input data to be processed is split into chunks. Each chunk is associated with a mapper or a map worker that provides ⟨Key | Value⟩ pairs as outputs.
The outputs values are sorted on the basis of the Key values associated. The sorted values are provided to reduce workers, that is, ⟨Key | SortedList(Value)⟩. Reduce workers store the results in Hadoop distributed file system. The map and reduce workers are generally virtual machines (VMs) in public cloud environments. A simple MapReduce model deployed on the VM based computing environment is shown in Figure 1.
Next-Generation Sequencing (NGS) tools like CloudAligner [13], Cloud Burst [30], SeqMapReduce [31], and Crossbow [32] adopt the Hadoop framework. The major drawback of these alignment tools is that they are short read aligners. The short read aligners prove to be efficient when single-gap or ungapped alignment is to be computed. Considering long reads, these alignment algorithms exhibit performance degradation and affect accuracy proved by the results presented in [10]. For alignment considering long sequence reads, optimization of the BLAST algorithm and its deployment on Hadoop platform is presented in [33]. For long read aligners, the BWA-SW algorithm in [10] has been found to be efficient and suitable. In [34], the Bwasw-Cloud algorithm considering Hadoop platform is presented. The Bwasw-Cloud adopts BWA-SW algorithm for alignment. Based on the literature reviewed, it is evident that limited work has been carried out considering long read aligners required to support analysis of genomic data generated from current and future sequencing technologies. There is an increasingly urgent need to provide a reliable and scalable support for the ever-increasing convergence of NGS data. However, there is clearly a lack of standardized and affordable NGS management solutions on the cloud to support the growing needs of translational genomics research [20]. This is a major motivating factor for the authors of this paper.
. . . iterative process and Hadoop incurs considerable performance overheads when iterative applications are hosted on the framework [18,35]. The aligners presented in [33,34] need to rely on multiway joins as genomic sequences are split and needed to be merged prior to the reduce phase. The performance of Hadoop suffers when multiway joins are considered [36]. The Hadoop framework considers sequential processing of the map followed by reduce stage that affects performance [37]. Therefore, in this paper, we present Burrows-Wheeler Aligner's Smith-Waterman Alignment on Parallel MapReduce (BWASW-PMR) to perform long read alignments on a cloud platform. BWASW-PMR adopts the BWA-SW presented in [10] to perform long read alignments of genomic data. MapReduce model is considered for the execution on the public cloud environment. The Smith-Waterman (SW) algorithm [38,39] in BWA-SW is optimized using parallel computation technique. To overcome the drawbacks of Hadoop MapReduce, a parallel execution strategy of the map and reduce workers is considered in BWASW-PMR. The map and reduce functions executions on the VM based worker nodes are parallelized to reduce execution time. The aligner presented in [34] bears the closest similarity to the BWASW-PMR proposed and is considered for comparison. The major contribution of our work can be summarized here: (i) Optimization of Smith-Waterman in the BWA-SW long read sequence alignment. (ii) A custom MapReduce framework to support the required computations for long sequence alignment. (iii) Parallel map and reduce workers execution strategy.
(iv) Parallel execution of the map and reduce functions at worker nodes.
This paper is organized as follows. Section 2 presents related work background. Section 3 discusses the proposed Burrows-Wheeler Aligner's Smith-Waterman Alignment on the Parallel MapReduce, BWASW-PMR. The SW algorithm and its considered optimization are also presented in Section 3. The results and the experimental study are shown in Section 4. Comparison notes with related work counterparts are discussed in the penultimate section. The concluding remarks and future work are discussed in the last section.

Smith-Waterman (SW) Algorithm. Burrows-Wheeler
Aligner's Smith-Waterman (BWA-SW) Alignment relies on SW algorithm to align the seed matches of sequences. Let 0 and 1 represent two genomic sequences obtained from the seed matches. The Smith-Waterman algorithm computes the similarity matrix score initially. Using the similarity matrix and backtracking technique, optimal alignment is obtained. Let X represent the size of 0 and Y represents the size of column of are initialized to 0. The remaining elements of (indexed by ( , )) are computed using where ( , ) is a function providing values of exact match or mismatch; that is, if 0 [ ] = 1 [ ] then ( , ) = (identical characters among sequences 0 and 1 ); otherwise, ( , ) = (if characters among sequences 0 and 1 are unique). and represent the matrices in accordance with the affine gap model. The matrices and are used to determine the gaps and are defined as where and represent the first and successive gap penalty.
Backtracking algorithm is used to find the optimal alignment between 0 and 1 . The backtracking begins with a cell in that holds the highest score and proceeds till a zero valued cell is reached. Smith-Waterman algorithm provides optimal alignments and similarity scores. It must be noted that computation of matrix is bounded by the running time of the slowest task due to the dependencies it exhibits.

Burrows-Wheeler Aligner's Smith-Waterman (BWA-SW)
Alignment Algorithm. BWA-SW algorithm constructs a full-text index in minute space (FM-index) [40] of the query sequence and the reference sequence . A prefix directed acyclic word graph ( ) is built using sequence .
is built using the sequence . The prefix trie is a tree representation of sequence . The tree is constructed by concatenating all edge symbols from any node in the graph as a route to the root node providing a unique string. The unique string obtained is a substring of the sequence . Each node of the tree is represented using V . Traversing from nodes generates strings that are lexicographically sorted. A node in represents a string. The V of the node and the string it represents are considered to be equivalent to each other. From , nodes exhibiting identical or similar V are collapsed to form . Each node of established represents one or more substrings of . Figure 2 represents , , and of the input string "BANANA$".
Let PT( ) and DG( ) represent two functions that are used to form and graph. In the BWA-SA algorithm, PT( ) and DG( ) are initially computed. Let be the root node of PT( ), and represents the root node of DG( ). The best score between the sequences and is computed using a dynamic programming mechanism. Initialize = = = null considering the root nodes of PT( ) and DG( ). For each of the parents, that is, , of the th node in DG( ), computation of | , | , and | is considered. The computation of | is where is the gap open penalty and is the gap extension penalty. | is computed using where is the parent of the th node in PT( ). The computation of | is achieved using where ( , ; , ) represents the score between the symbol on edge ( , ) and ( , ). The computation of , , and is defined as all other cases, where = arg max ∈ ( ) | and ( ) is a set containing parent nodes of . Variable represents the best matching score between the substrings, that is, substring and substring . Consider > 0 when the substrings and match. It is known that the Smith-Waterman algorithm provides accurate alignment results but requires large computation time. To optimize the computations, the reverse postorder traversal scheme on DG( ) and PT( ) is adopted. The dynamic programming mechanism in the BWA-SW algorithm enables identifying the seed matches of the genomic sequences. Based on the partial matches, the Smith-Waterman Alignment is considered to the extended matches.
V , that is, ( , ), is formed when the best score, that is, , is high and V size of is within the threshold set. The seed matches are derived from V by analyzing the suffix array of sequences and . Matching of the extended seed matches using Smith-Waterman algorithm is considered only when high matching scores are observed.

Proposed Burrows-Wheeler Aligner's Smith-Waterman Alignment on the Parallel MapReduce, BWASW-PMR
BWASW-PMR provides a cloud platform to perform long read alignments considering genomic data obtained from NGS techniques. The BWASW-PMR adopts the MapReduce computation model for cloud computation. To support scalability in BWASW-PMR, map and reduce worker nodes are deployed on a cloud cluster consisting of VMs. For long read alignments, the BWA-SW algorithm is adopted. The advantages of adopting the BWA-SW algorithm for long read alignments and its advantages over the existing aligners are found in [10]. The BWASW-PMR considers the genomic sequence alignment in dual phases, that is, map and reduce phase. Existing long read aligners for cloud platforms adopt the Hadoop framework. In the Hadoop framework based solutions, the map phase is executed and then the reduce phase is initiated. To overcome the drawbacks of Hadoop, a parallel execution strategy of the map and reduce phases is considered. Optimization of the Smith-Waterman algorithm is an additional feature considered in BWASW-PMR. Execution of the map and reduce functions is modelled to run in parallel utilizing all programming cores available in worker VMs.

Overview and Preliminaries.
Let represent a reference genomic sequence and the query sequence. BWASW-PMR is deployed on a cloud platform comprising a master node, map worker nodes, and reduce worker nodes. The master node of BWASW-PMR initializes map and reduce worker nodes using virtual computing nodes. Each computing node or VM is assumed to have CPU cores available for computation. Let T VM Config represent the time taken to initialize the virtual computing platform. The sequence is split into chunks with overlapping sections (the overlapping sections are depicted in grey in Figure 2). The reference chunk and are sent to map worker nodes as input key, value pairs. The key value pairs considering the chunk are represented as ( , V ), where is a key and V contains the overlapping offset data. The key value pair of is represented as ( , V ), where is a key and V is the query sequence. In each of the map workers, sequence is further split into chunks and stored in the local memory available. Alignment is performed using the BWA-SW algorithm considering and each in a parallel fashion utilizing the cores. The Smith-Waterman algorithm in BWA-SW is optimized by a parallelization technique to reduce execution time. Let T MAP represent the average execution time of the map worker nodes. The map workers after computation provide  Reduce worker nodes obtain intermediate data, that is, list( , V ), to perform the shuffle and sort function. In the reduce phase, aggregation of all alignment locations, that is, list(V ), that are nonredundant and nonoverlapping is considered. The reduce operation can be defined as reduce( , list(V )) → list(V ). Let T REDUCE represent the average time required by reduce worker nodes to perform the shuffle, sort, and reduce computation. The total makespan of the BWASW-PMR cloud platform to align the sequence against is computed as The overview of the BWASW-PMR cloud platform is shown in Figure 3.
In the map phase of the BWASW-PMR cloud platform, alignment locations and corresponding scores between and chunk are computed. The required data, that is, and chunk , is obtained from the cloud memory. Let t Get Data M represent the time taken to obtain the data. To reduce computation time using parallel computing techniques, the genomic sequence is split into chunks. Let t represent the time taken to split sequence into chunks. Sequence alignment considering one chunk of and is performed using the BWA-SW algorithm. DG( ) and PT( ) are initially constructed. The seed matches of sequences and are computed using a dynamic programming mechanism.
The seeds computed are extended to ensure rule alignment of the genomic sequences using the SW algorithm. To reduce computation time, parallelization of the SW algorithm is considered in BWASW-PMR. The parallelization technique to optimize computation is discussed in the latter subsection.
Let t BWASW represent the time taken to align th chunk and using the BWA-SW algorithm. The total time taken to align the total chunks is ( × t BWASW ). As computing cores are available with each worker node, the parallel computation of number of chunks of is possible. The computation time considering all chunks and utilizing cores is defined as (( × t BWASW )/ ). The alignment locations and scores obtained are stored in the cloud for reduce workers. The time taken to store this data per map worker is represented as t Store Data M . The makespan of the th map worker node can be defined as The average execution time of all the map worker nodes is defined as 3.2. Reduce Phase. The master node in BWASW-PMR initializes reduce worker nodes and map worker nodes simultaneously. This parallel initialization mechanism enables reducing the total makespan T. In the reduce phase, the shuffle  Parallelization of the reduce function is achieved by utilizing all computing cores available with worker nodes. Let t Fn R / represent the time taken to compute the reduce function utilizing the computing cores. The reduce function provides the alignment results between sequences and that are written to cloud storage for further analysis. Let t Store Data R represent the time taken by the th reduce worker node to write alignment results into the cloud storage. The makespan of the th reduce worker node can be defined as The average makespan of the reduce worker nodes, that is, T REDUCE , is defined as Using (11) and (9), the total makespan of the BWASW-PMR can be defined as From (12), it can be observed that T (the computing time or makespan of BWASW-PMR) depends on the computation time of map and reduce worker nodes. The core alignment steps (based on the BWA-SW algorithm) are performed by the map worker nodes, that is, T REDUCE ≪ T MAP . The time taken to initialize the map and reduce computing clusters based on VMs, that is, T VM Config , is dependent on the cloud platform considered for deployment and can be neglected. Therefore, it can be stated by the total makespan: Optimizing the SW algorithm in BWA-SW is a possible solution to reduce the total makespan T.

Smith-Waterman Algorithm Optimization.
In the SW algorithm, computation of the similarity matrix score, that is, , requires the maximum time. Adopting a parallelization technique for computation of is considered in BWASW-PMR. Let us consider a query sequence and reference sequence , whose sizes are 4 and 8, respectively. Therefore, the matrix to be computed is of size (5,9) and is shown in Figure 4(a). Data dependencies that exist in computing each element of make the parallelization technique difficult to implement. For computation of 3,4 (shown as black lines in Figure 4(a)), it requires that the computation of 2,3 , 2,4 , and 3,3 must be completed. Based on the sequences and , and values of 2,3 , 2,4 , and 3,3 , the value of 3,4 is obtained. The computation of is more often done serially as shown through the grey arrows in Figure 4(a). To parallelize the computation of , the adoption of CUDA/GPU based techniques was considered by researchers [33,41]. However, the availability and cost of such computing platforms on public clouds are an issue. To maximize resource utilization available with the VM worker node at minimal costs in BWASW-PMR, wavefront based parallelization technique [42] is considered. The parallel computation technique adopted is shown by grey diagonal lines in Figure 4(b). Computation of all cells of that fall under the diagonal lines can be done in a parallelized fashion.  The parallelization technique adopted to compute in SW algorithm enables reducing the makespan of map worker nodes, that is, T MAP .

Evaluation: Experiments
In this section, we study the performance of BWASW-PMR cloud platform. BWASW-PMR is developed using C++ and C#.Net and is deployed on the Azure cloud platform. BWASW-PMR adopts the BWA-SW algorithm with the optimization of SW algorithm. Experiments have been conducted to study the performance of the optimized SW algorithm. The performance of BWASW-PMR is compared with Bwasw-Cloud to perform long read alignments on the cloud platform. All genomic data considered for the experiments are obtained from NCBI database [3] that is publically available.

SW Optimization Analysis.
Optimization of the SW algorithm in BWA-SW aligner for BWASW-PMR is a novel approach considering cloud deployment. To analyze performance of the optimized SW algorithm, comparison with the standard SW algorithm (hereafter referred to as the SW algorithm in this section) available within BWA-SW is considered. The optimization is achieved using a wavefront parallelization technique. The optimized version of SW is hereafter referred to as "SW-Optimized" in this section. Uniform SW parameter settings (i.e., gap penalty, matching, and mismatching score) are considered for analysis. For analysis, the sequences and of equal lengths are considered. Equal length is considered to maximize the computations required for alignments. Section of Homo sapiens chromosome 15, GRCh38.p2 Primary Assembly (NC 000015.10), is considered as the reference . Query sequences considered are obtained from the influenza virus database [43]. The query sequences considered are summarized in Table 1.
Execution time of SW-Optimized and SW algorithm is noted for the four alignment pairs described. The results obtained are graphically shown in Figure 4. A logarithmic (Base 10) representation of the execution is considered in Figure 5. As sequence lengths for alignment are increased, the execution time of SW and SW-Optimized increases. The execution time of SW-Optimized is lower when compared to the considered serial SW algorithm. All experiments are executed on a Quad Core Intel i7 machine with 8 GB of RAM. An average reduction in the execution time of about 91.8% is observed. Results obtained prove SW-Optimized considered in BWASW-PMR outperforms the classical SW algorithm.   [44]. Experiments using a reference genomic sequence and five query sequences of varied lengths are considered. The experiments conducted with the reference and query genomic sequences are summarized in Table 2. Makespan or total execution time is noted and the results obtained are shown in Figure 6. The results obtained prove that the proposed BWASW-PMR cloud aligner deployed on Azure outperforms Bwasw-Cloud deployed on Hadoop. In experiment 1, the speed-up achieved for BWASW-PMR is about 4.5. For longer sequence alignments, that is, in experiment 5, the speed-up was observed to be 7.5. As query length increases, the performance of BWASW-PMR improves. An average speed-up of 6.7 is achieved considering BWASW-PMR when compared to the Bwasw-Cloud.     execution are studied to derive observations and results. The total makespan observed, that is, T, is shown in Figure 7. The results prove that BWASW-PMR exhibits lower makespan time when compared to Bwasw-Cloud. As lengths of sequence reads increase, execution time increases due to the increase in alignment length sequences considered. Long sequence alignment considering BWASW-PMR gains a speed-up of 1.33 when compared to the Bwasw-Cloud aligner. The parallel executions of map and reduce phases along with SW optimization are the main contributing factors to the speed-up observed in this study. Detailed analysis of the experimental data reveals parallel execution of map function and SW optimization aid in reducing map makespans. The major alignment computations are carried out in the map phase, considering both BWASW-PMR and Bwasw-Cloud. An average reduction in the map phase makespan across all the experiments achieved by BWASW-PMR against Bwasw-Cloud is about 30%. The accumulation of the results, that is, aggregation of the alignment locations, is carried out in the reduce phase. No major computation is carried out in the reduce phase of BWASW-PMR and Bwasw-Cloud. The parallelization of the reduce function adopted in BWASW-PMR achieves an average reduction of 9.3% in T REDUCE when compared to Bwasw-Cloud. To prove efficiency in adopting a parallel map and reduce execution environment for the BWASW-PMR, makespans of the tasks executed at the map and reduce worker nodes are noted. The results obtained are shown in Figure 8. In Figure 8(a), the makespans of the worker nodes in Bwasw-Cloud deployed on the Hadoop cluster are shown. The makespans of the worker nodes in BWASW-PMR deployed on Azure are shown in Figure 8(b). From Figure 8, it is clear that the uniform tasks executed on BWASW-PMR exhibit lower makespans when compared to the execution on Bwasw-Cloud. In Figure 8(a), the reduce phase is initialized when all the map workers have completed their jobs. In BWASW-PMR, the reduce workers are running in parallel. The reduce phase is initiated when one or more of the map worker nodes have completed their jobs.

Experiments considering BWASW-PMR Cloud and
The SW optimization considered in BWASW-PMR reduces execution time of sequences to be aligned based on the SW algorithm. The experimental study and the results obtained prove that BWASW-PMR is capable of aligning long sequences using the cloud computing platform.

Comparison Notes with Related Work Counterparts
In this section, a comparison of BWASW-PMR with existing cloud aligners for long sequence reads is presented.
Comparisons with Bwasw-Cloud [34], MapReduce-BLAST [33], CloudAligner [13], and the BWA-SW [10] aligners are considered. The BWA-SW sequence aligner [10] for long reads is memory hungry (high RAM requirements). Moreover, execution on the cloud platform is not considered. The CloudAligner [13] adopts the seed-and-extend algorithm for sequence alignment. The BWA-SW and BLAST also adopt similar approach. Though the seed-and-extend based aligners are fast, they suffer from low accuracy [10] and are more suitable for short read alignments. To achieve accurate alignment results considering long reads, the SW algorithm is adopted in the BWA-SW. MapReduce-BLAST [33] is Table 3: Comparison of BWASW-PMR with its related work counterparts.