The allpairs suffixprefix matching problem is a basic problem in string processing. It has an application in the de novo genome assembly task, which is one of the major bioinformatics problems. Due to the large size of the input data, it is crucial to use fast and space efficient solutions. In this paper, we present a spaceeconomical solution to this problem using the generalized Sadakane compressed suffix tree. Furthermore, we present a parallel algorithm to provide more speed for shared memory computers. Our sequential and parallel algorithms are optimized by exploiting features of the Sadakane compressed index data structure. Experimental results show that our solution based on the Sadakane’s compressed index consumes significantly less space than the ones based on noncompressed data structures like the suffix tree and the enhanced suffix array. Our experimental results show that our parallel algorithm is efficient and scales well with increasing number of processors.
Given a set
With the recent advances in high throughput genome sequencing technologies, the input size became very huge in terms of the number of sequences and length of fragments. This calls for both faster and memory efficient solutions for the APSP problem.
The APSP is a wellstudied problem in the field of string processing. The first nonquadratic solution was introduced by Gusfield et al. [
Ohlebusch and Gog [
In an effort to reduce the space consumption of solving the problem, Simpson and Durbin [
In this paper, we present new methods based on the compressed suffix tree of Sadakane [
To further speed up the solution of APSP, we introduce different parallelization strategies to the sequential algorithm that can be used on multiprocessor shared memory computers. Our parallelization methods exploit important features and available operations of the Sadakane’s compressed suffix tree. Experimental results show that our method is efficient and scales well with the number of processors.
This paper is organized as follows. In Section
We write
The suffix tree of a string
Sadakane’s compressed suffix tree [
The
The
A generalized suffix tree for the string AAC#GAG$TTA%. The numbers below the leaves are the text positions for the string paths which are represented by these leaves.
The following BP functions are used in this paper.
To solve the allpairs suffixprefix problem for a set of
We use an array
The algorithm of [
The compressed suffix tree supports all necessary informations to run the original Algorithm of [
For filling the
In the case of using more than one character to encode a distinct separator, it is possible to have an internal node to which a terminal edge is pointing (usually only leaves have this possibility). Accordingly, the text position of a terminal leaf should be added to the
The text position for each leaf which has a terminal edge will be added to the
While
In the second stage, we make another scan for the BP representation from left to right, but this time we move one by one (parenthesis by parenthesis) instead of jumping from leaf to leaf. We distinguish 3 cases.
Case 1: if the scanned node is a leaf and it is representing a starting position of a string
Case 2: if we scan an opening parenthesis for an internal node
Case 3: if we scan a closing parenthesis for an internal node
(1)
(2)
(3)
(4)
(5)
(6)
(7) Add the text position of
(8)
(9)
(10)
(11) firstchild = The position of the first child of the root in BP
(12)
(13)
(14)
(15)
(16) Sol
(17)
(18)
(19) Increment
(20)
(21)
(22) Push
(23)
(24)
(25)
(26) Pop
(27)
(28)
(29)
Algorithm
The correctness of the algorithm follows from the proof in [
The construction of the generalized suffix tree consumes
As a result, the solution requires
In term of space, we need
It is clear the
Another variation for the first method is to keep the first stage as is, but in the second scan we check only the leaves. In this variation we will use the
The running time of the previous method can be improved based on the following two observations of [
All the distinct characters
The terminal leaves (suffixes) sharing a prefix of length
In this method, we scan the BP vector and move from leaf to leaf using the
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9) Sol[
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17) Sol
(18)
(19)
(20)
(21)
(22)
(23)
(24)
(25)
(26)
(27) Pop the stack[
(28) Pop l[
(29)
(30)
(31)
(32)
(33)
(34)
(35)
(36) Push(Stack[
(37) Push(
(38)
(39) Push(
(40)
(41)
(42)
(43)
(44)
(45)
(46)
As in the first method, we ignore parentheses which belong to separators using the Child, Rank, and Select functions. We move from leaf to leaf using the Select function (lines 1–3, Algorithm
To check if a leaf
If the text position of
There is one exception for that; if there is a suffix in
The definition of the same parent depends on the number of characters used to encode the separators; if more than one character is used, then the parent of a leaf is the closest ancestor which does not have a terminal edge (Figure
The second method has the same time complexity as the first method, since the construction of the tree requires
In this section, we introduce parallel versions of the abovedescribed methods for solving the APSP problem. These versions are for shared memory multiprocessor computers. We will handle two parallelization strategies: The first, which we will call
The generalized suffix tree is divided into
For Algorithm
For the first case, the subtrees can be easily processed independently in parallel. The
For the second case, we will have some
In Figure
Each processor is working on one branch of the generalized suffix tree for the string AAC#GAG$TTA%. The numbers below the leaves are the text positions for the string paths which are represented by these leaves.
In the previous algorithm, we cannot guarantee that the subtrees are of equal sizes. Therefore, we use two tricks. First, we select
Interestingly, the structure of CSA allows more robust strategy which can lead to better performance. The idea is to distribute the load equally between processors either by dividing the leaves or by dividing BP between them. Each processor starts working from the starting point of its share. It is clear that the situation is not simple; therefore, let us analyze the content of the stack for an internal node in the sequential case when the algorithm reaches that node. It can be observed that the content of each stack is whatever was pushed when visiting the node’s ancestors. All other pushing work is irrelevant since it is followed by an equivalent popping before reaching the node.
Therefore, each processor can start from a specific point if its stacks are filled with the values which would be in the stacks if we reach this point while running the sequential algorithm.
To apply this concept on the first Algorithm, let us analyze the two stages for this algorithm. The first stage is relatively trivial; each leaf, if it has a terminal edge, should push its text position to the
In the second stage, BP vector will be divided equally between processors. Let
Each processor is working on its share of BP or the leaves. The stacks should be filled first for each processor before continuing with the algorithm.
In the second algorithm, the
It is clear that both techniques use
A summary for the discussed algorithms is shown in Table
Comparison between the two methods in term of time and space complexity. Time and space used for output are ignored.
Algorithm  Used data structures  Time complexity  Space complexity 

First method  BP and CSA 


Second method  BP, LCP and CSA 


To compare our work with previously presented solutions, we downloaded a solution for allpairs suffixprefix problem using Kurtz implementation for a standard suffix tree and the implementation presented by Ohlebusch and Gog for an enhanced suffix array from
SGA can be downloaded from
In our experiments, the implementation of Sadakane compressed tree presented by Välimäki et al. ([
In our solutions, the user can specify the parallel technique from the command line. For each algorithm, we implement both bottomup and topdown parallelizing techniques. The number of threads can also be given as a parameter. If the topdown technique is used, the number of threads should be
In our solution, all strings are concatenated together in one text to build a generalized suffix tree. To overcome the limitation of the number of separators, we used 3 characters to encode enough separators for
Our results are obtained by running against randomly generated as well as real data. The random data were generated by a program that outputs random
Data sets used in experiments. Sizes in megabytes.
Data Set  Type  Size  Number of strings 

Generated by 
Random data  10–300  100,000 
EST of 
Real data  167  334,465 
To test our parallel technique, we used Amazon Web services (AWS) to obtain an instance with 16 cores. Our parallel implementation uses the OpenMP library.
Experimental results demonstrate that the first method uses around onethird of the space used by a standard pointerbased suffix tree to solve the same problem, while the second method uses less than onefifth of the space consumed by a standard suffix tree (see Figure
Comparison of space requirements for the three structures (standard suffix tree, enhanced suffix array, and Sadakane compressed suffix tree methods 1 and 2). In addition, the space consumed by the overlap stage in SGA is also shown. The used minimal length is 15.
However, this gain in space has some consequences. Figure
Comparison of time requirements for the three structures (standard suffix tree, enhanced suffix array, and Sadakane compressed suffix tree methods 1 and 2): we could not run the code to build a standard suffix tree for a text with a size which is bigger than 80 MB or an enhanced suffix tree for a text with a size more than 90 MB. The time consumed by SGA in overlap stage is also shown. The used minimal length is 15.
Despite the impressive space consumption of SGA, our solutions consume less time than SGA. In addition, the performance of SGA depends dramatically on two factors: the maximum length of a sequence and the minimal length of a match. Since the time complexity of our solution depends on
The parallel tests show the following: with random data, all techniques take around 24–26% and 9–11% of the time required by the sequential test, with 4 and 16 cores, respectively. Figures
Time requirements for solving APSP for random data with four different text lengths (10 MB, 50 MB, 100 MB and 300 MB), using topdown technique with 1, 4, and 16 cores.
Time requirements for solving APSP for random data with four different text lengths (10 MB, 50 MB, 100 MB, and 300 MB), using bottomup technique and various number of cores.
With real data, the bottomup technique achieves a speedup of 11–13% compared with the performance of the topdown technique. It is also noticable that the second method [
Time requirements for solving APSP for real data (167,369,577 bytes), for both methods using topdown and bottomup techniques.
This paper provides two solutions for the allpairs suffixprefix problem using Sadakane compressed suffix tree, which reduce the expensive cost of suffix tree in term of space. In spite of significant slowdown in performance, it is clear that the proposed solutions may be preferred when dealing with huge sizes of data because of its modest space requirement. To reduce the performance overhead, the paper presented static and new dynamic techniques to parallelize the proposed solutions. The bottomup technique performs more efficiently when real data is used, while both techniques perform equally with random data. The presented solutions are not limited to cases with a small number of strings. SGA is superior in terms of space, but it consumes more time than the presented solutions and it does not handle sequences which have large lengths. The paper has demonstrated that it is beneficial to use an enhanced suffix array to solve APSP. It could be worthwhile to explore solving the problem using a compressed suffix array and a compressed largest common prefix (LCP) array by adapting the algorithm presented by Ohlebusch and Gog, which makes the topic a good subject for future study.
The authors declare that there is no conflict of interests regarding the publication of this paper.
This publication was made possible by NPRP Grant no. 414541233 from the Qatar National Research Fund (a member of Qatar Foundation). The statements made herein are solely the responsibility of the authors.