An Efficient Algorithm for LCS Problem between Two Arbitrary Sequences

The longest common subsequence (LCS) problem is a classic computer science problem. For the essential problem of computing LCS between two arbitrary sequences s1 and s2, this paper proposes an algorithm takingO(n+r) space andO(r+n2) time, where r is the total number of elements in the set {(i, j)|s1[i] = s2[j]}.The algorithm can bemore efficient than relevant classical algorithms in specific ranges of r.


Introduction
The longest common subsequence (LCS) problem is a classic computer science problem and still attracts continuous attention [1][2][3][4].It is the basis of data comparison programs and widely used by revision control systems for reconciling multiple changes made to a revision-controlled collection of files.It also has applications in bioinformatics and many other problems such as [5][6][7].For the general case of an arbitrary number of input sequences, the problem is NP-hard [8].When the number of sequences is constant, the problem is solvable in polynomial time [9].For the essential problem of computing LCS between two arbitrary sequences ( 2 ), the complexity is at least proportional to the product of the lengths of sequences according to the conclusion as follows.
It is shown that unless a bound on the total number of distinct symbols [author's note: the size of alphabet] is assumed, every solution to the problem can consume an amount of time that is proportional to the product of the lengths of the two strings [9].
The sizes of lengths of sequences make the quadratic time algorithms impractical in many applications.Hence, it is significant to design more efficient algorithm in practice.This paper is confined to LCS 2 and is to present an algorithm that can be more efficient than relevant classical algorithms in specific scenarios.
The following introduction is also confined to the case of two input sequences.Chvátal and Sankoff (1975) proposed a Dynamic Programming (DP) algorithm of ( 2 ) space and time [10].It is the basis of the algorithms for LCS problem.Soon in the same year, D.S. Hirschberg (1975) posted a Divide and Conquer (DC) algorithm that is a variation of the DP algorithm taking () space and ( 2 log ) time [11].In 2000, Bergroth, Hakonen, and Raita contributed a survey [12] that shows in the past decades there is no theoretically improved algorithm based on Hirschberg's DC algorithm [11] as it is so brilliant.In 1977, Hirschberg additionally proposed an ( +  log ) algorithm and an (( + 1 − ) log ) algorithm where  is length of LCS [13].The first one is efficient when  is small, while the other one is efficient when p is close to .Both of the two algorithms are more suitable when the length of LCS can be estimated beforehand.Then, Nakatsu, Kambayashi, and Yajima (1982) in [14] presented an algorithm suitable for similar sequences and having bound of (( −  + 1)) and (( −  + 1) log ).Let the two sequences be 1 and 2.Same in 1977, Hunt and Szymanski proposed an algorithm taking () space and (( + ) log ) time, where  is the total number of elements in the set {(, )|1[] = 2[]} [15].The algorithm reduces LCS 2 to longest increasing subsequence (LIS) problem.Apostolico and Guerra (1987)   an algorithm based on [15] taking time ( log + log log ), where  is the number of dominant matches (as defined by Hirschberg [13]) and  is minimum of  and the alphabet size.Further, based on [16], Eppstein (1992) in [17] proposed an ( log  +  log log min(, /)) algorithm when the problem is sparse.If the alphabet size is constant, Masek and Paterson (1980) in [18] proposed an ( 2 / log 2 ) algorithm utilizing the method of four Russians (1970) [19]; Abboud, Backurs, and Williams (2015) in [20] showed an ( 2− ) algorithm where  > 0. ( 2 ( log log )/ log 2 ) algorithms are also proposed by Bille and Farach-Colton (2008) in [21] and Grabowski (2014) in [22], each of which has its own prerequisite.Restrained by the conclusion of [9,20], in these decades an extensive amount of research keeps trying to achieve lower complexity than ( 2 ) of computing LCS between two condition-specific sequences for different applications, which also can be found in the survey [12].For computing the length of LCS between two sequences over constant alphabet size, Allison and Dix (1986) presented an algorithm of ( 2 /), where  is the word-length of computer [23].This algorithm uses bit-vector formula with 6 bitwise operations.Although falling into the same complexity class as simple ( 2 ) DP algorithms, this algorithm is faster in practice.Crochemore, Iliopoulos, Pinzon, and Reid (2001) in [24] proposed a similar approach whose complexity is also ( 2 /).Due to the fact that only 4 bit-wise operations are used by the bit-vector formula, this approach gives a practical speedup over Allison and Dix's algorithm.
Compared with Chvátal-Sankoff algorithm [10], Hirschberg algorithm [11], and Hunt-Szymanski algorithm [15], most of the other algorithms for LCS problem between two sequences have more dependency, such as the following: the length of LCS is estimable beforehand [13,14], two input sequences are similar [14,16], problem is sparse enough [17], or the alphabet size is finite [16,18,20].Some algorithms give speedup over classical algorithms in engineering [23,24].In this paper, an algorithm of ( + ) space and ( +  2 ) time is proposed for  2 , where  is the total number of elements in the set {(, )|1[] = 2[]} assuming the two arbitrary sequences are 1 and 2.The algorithm also reduces LCS 2 to longest increasing subsequence (LIS) problem.Compared with relevant classical algorithms, the algorithm can be more efficient in specific range of .
This paper is organized as follows.In Section 1, the current state of algorithms for LCS problem between two sequences including LCS 2 is introduced.The proposed algorithm of this paper is presented and exemplified in Section 2, where preliminary terminologies needed to understand most of the paper and the theoretical basis of the proposed algorithm are also given.In Section 3, efficiency of the proposed algorithm is analyzed.

Algorithm
The longest common subsequence (LCS) is the longest subsequence common to all sequences in a set of sequences.This subsequence is not necessarily unique or not required to occupy consecutive positions within the original sequences (e.g.,  is a longest common subsequence between  and ).(1, 2) is a defined function that returns a set containing all the LCSes between two sequences, while the longest increasing subsequence (LIS) is a subsequence of a given sequence in which the subsequence's elements are in sorted order, lowest to highest, and in which the subsequence is as long as possible.This subsequence is not necessarily contiguous, or unique (e.g., {1, 2, 3} is a longest increasing subsequence of {1, 4, 2, 3}).() is also a defined function that returns a set containing all the LISs of a sequence.Assume 1 =  and 2 = .For all 1[] = 2[], assume there is a sequence , of which the elements are vectors in the form of (, ) (see Figure 1).The left part of an element of  ([][0]) is the position of a symbol in 1, and the right part of the element ([] [1]) is the position of the symbol in 2. is sorted according to [][0] as the first key in ascending order and according to [] [1] as the second key in descending order.Define with (1, 2), it is bijective mapping between () and (1, 2).Hence, (1, 2) can be reduced to () [25].According to the theoretical basis, Algorithm 1 is proposed for  2 .The algorithm is designed to reduce LCS 2 to LIS problem.

Example.
Reuse 1 = , 2 =  that are given previously.The process of computing LCSes between 1 and 2 using Algorithm 1 is illustrated in Figure 2 and presented as follows.
Scan  from left to right.The right part of 2 2
The right part of 4] is kept unchanged, and the rest of the elements 2  [5] and 2  [6] are not going to be checked.
The rest of the elements of  can be computed in the same way.Figure 2(d) is the final result of  and .
From the auxiliary data , it can be seen that there is only one LIS in .The length of the LIS is 4.

Complexity.
According to the conclusion of [15] (paragraph 3 page 4), we have the following.
Step 1 [author's note: similar to step 1 of Algorithm 1 of this paper] can be implemented by sorting each sequence while keeping track of each element's original position.We may then merge the sorted sequences creating the MACHLISTs [author's note: similar to array  of this paper] as we go.This step takes a total of ( log ) time and () space.
Assume  is the number of match vectors between 1 and 2.
Step 1 of Algorithm 1 is a process of ( + ) space and (max(,  log )) time.As the length of LCS is (), step 3 is a process of () space and () time.Step 4 takes () space and () time.Write operations in 2  for all element of  are listed together in Figure 2(c).In  and  (see Figure 2(d)), the time of write operation is .In 2  , the time of write operation of dark gray block is ; the time of write operation of light gray block is at most ∑ −1 =1  = ( − 1)/2 = ( 2 − )/2, which is illustrated in Figure 3. Therefore, step 2 takes (+) space and ( + ( 2 − )/2) time.Complexities of every step of Algorithm 1 are listed in Table 1.The whole algorithm takes ( + ) space and ( + ( 2 − )/2) = ( +  2 ) time, which is dominated by step 2.

Comparison with Hunt-Szymanski Algorithm.
As the original position in 2 of each element of  is not used in the process of computing, in Figure 4 Hunt-Szymanski algorithm needs to utilize binary search to locate the position in 2  for write operation for each element of .The time of binary search in 2  of Hunt-Szymanski algorithm is at most ∑  =1 log  + ( − )log , which is illustrated in Figure 5.Using Stirling's approximation [26][27][28], ∑  =1 log  + ( − )log  = log∏  =1  + ( − )log  = log(!)+ ( − )log  ≈  log  + ( − )log  =  log .If the demand is only returning one LCS or the length of LCS, array  of the algorithm proposed in this paper can be replaced with the MATCHLIST that is used in Hunt-Szymanski algorithm.Therefore, the algorithm proposed in this paper can take () space that is the same as the one Hunt-Szymanski algorithm takes.The main difference between them is the time consumed in 2  .In Figure 3, the total time of write operation of both dark gray and light gray blocks is at most  + ( 2 − )/2.As 0 ⩽  ⩽  2 , if  + ( 2 − )/2 <  log  ⇒ ( 2 − )/2(log  − 1) <  ⩽  2 , the algorithm proposed in this paper is more efficient in time than Hunt-Szymanski algorithm (see Figure 7).

Comparison with
Chvátal-Sankoff Algorithm.Chvátal-Sankoff algorithm needs  2 times of comparison in  2 space, which is illustrated in Figure 6.To simplify the analysis, only the  + ( 2 − )/2 time consumed in 2  of the algorithm proposed in this paper is going to be compared with the  2 time of Chvátal-Sankoff algorithm.As 0 ⩽  ⩽  2 , if  + ( 2 − )/2 <  2 ⇒ 0 ⩽  < ( 2 + )/2, the algorithm proposed in this paper is more efficient in time than Chvátal-Sankoff algorithm (see Figure 7).In this case of , the proposed algorithm is also more efficient in space than Chvátal-Sankoff algorithm.

Figure 3 :
Figure 3: Maximum write operation of light gray block in 2  of Alggorithm 1.

Table 1 :
Complexity of each procedure of Algorithm 1.