FACC : A Novel Finite Automaton Based on Cloud Computing for the Multiple Longest Common Subsequences Search

Searching for the multiple longest common subsequences MLCS has significant applications in the areas of bioinformatics, information processing, and data mining, and so forth, Although a few parallel MLCS algorithms have been proposed, the efficiency and effectiveness of the algorithms are not satisfactory with the increasing complexity and size of biologic data. To overcome the shortcomings of the existing MLCS algorithms, and considering that MapReduce parallel framework of cloud computing being a promising technology for cost-effective high performance parallel computing, a novel finite automaton FA based on cloud computing called FACC is proposed under MapReduce parallel framework, so as to exploit a more efficient and effective general parallel MLCS algorithm. FACC adopts the ideas of matched pairs and finite automaton by preprocessing sequences, constructing successor tables, and common subsequences finite automaton to search for MLCS. Simulation experiments on a set of benchmarks from both real DNA and amino acid sequences have been conducted and the results show that the proposed FACC algorithm outperforms the current leading parallel MLCS algorithm FAST-MLCS.


Introduction
Searching for MLCS is a classic dynamic programming problem.Let Σ be a set of the finite or infinite alphabet, and X x 1 , x 2 , . . ., x m be a finite sequence of symbols drawn from Σ, that is, x i ∈ Σ, i 1 ∼ m.A sequence Z z 1 , z 2 , . . ., z k is called a subsequence of X if it satisfies z j x i j , j 1 ∼ k and 1 ≤ i 1 < i 2 < • • • < i k ≤ m, that is, Z x i 1 , x i 2 , . . ., x i k .For two given sequences X and Y , Z is called a common subsequence CS of X and Y if and only if Z is simultaneously a subsequence of both X and Y .When no other common subsequence is longer than Z, Z is named the longest common subsequence LCS of X and Y .Similarly, |LCS X, Y | is the length of the LCS of X and Y .Wang et al. 17-19 developed the efficient MLCS algorithms parMLCS and Quick-DP, respectively, based on dominant points approach, which have reached a near-liner speedup for large number of sequences.It is worth mentioning that Yang et al. 20 , as a new attempt, develop an efficient parallel algorithm on GPUs for the LCS problem.But regretfully, the algorithm is not suitable for the general MLCS problem.
To meet the needs of practical applications, some researchers have also studied some variations of the LCS problems, such as the longest common increasing subsequence LCIS problems, the longest increasing subsequence LIS problems, and the common increasing subsequence CIS problems, and so forth.Fredman 21 proposed an algorithm for LIS problems.The optimal time complexity of the algorithm is O n log n when the average length of sequences equals n.By combining LCS with LIS, Yang et al. 22 defined a common increasing subsequence CIS and designed a dynamic programming algorithm for two sequences CIS problems.The space complexity of the algorithm is O mn .Brodal et al. 23 present an algorithm for finding a LCIS of two or more input sequences.For two sequences of lengths m and n, where m ≥ n, the time complexity and space complexity of the algorithm are O m nl log log σ Sort and O m , respectively, where l is the length of an LCIS, σ is the size of the alphabet, and Sortis the time to sort each input sequence.
Nevertheless, the aforementioned algorithms have the following disadvantages: 1 most of them are inapplicable to the problems with more than two sequences especially a considerable number of sequences, a large alphabet, and a long average length of the sequences ; 2 the efficiency and effectiveness of a few parallel algorithms remain to be improved; 3 the parallel implementations of the algorithms are of a certain difficulty due to their complicated concurrency, synchronization, and mutual exclusion, that is, none of the existing algorithms employed simple and cost-effective high performance parallel computing framework such as MapReduce for implementing their algorithms; 4 most of the algorithms did not provide an abstract and formal description to reveal the inherent properties of the MLCS problem.To overcome these shortcomings, a novel finite automaton based on cloud computing for the MLCS problem was proposed in this paper.The main contributions of this paper are as follows.
1 All common subsequences CS of the n sequences are abstracted as the language over their common alphabet Σ, that is, every CS of n sequences is a sequence over the Σ.The rest of this paper is organized as follows.Section 2 introduces some notations and concepts in this paper for convenience discussion.Section 3 presents the finite automaton Atm for recognizing common subsequence and its basic properties.Section 4 proposes a new algorithm called finite automaton based on cloud computing FACC and describes its implementation in detail.Section 5 explains the analysis of the time and space complexity of FACC.The experiments are made and the analysis results are explained in Section 6.Finally, Section 7 concludes the research.

Notations and Basic Concepts
For convenience, the following notations are adopted in Table 1.
Note that a sequence over some alphabet is a finite sequence of symbols drawn from that alphabet.According to the formal language and automaton theory, we can view the common subsequences of all sequences on set T as a language L of the common alphabet Σ, and then regard the MLCS of the over sequences as the one or several longest statements of the language L. Based on this idea, a novel finite automaton Atm which can recognize/accept the L is designed, and the Atm for MLCS was constructed quickly based on a new constructing-searching algorithm and the MapReduce parallel framework of cloud computing proposed in this paper, meanwhile the MLCS can be easily achieved.
For easy understanding, some basic concepts in the following are introduced, and the properties of the Atm are discussed.For example, for two sequences S 1 abcfg and S 2 bacgf, one can get the following matched pairs by Definition 2.1: 1,2 , 3,3 , 4,5 , 2,1 , and 5,4 with their corresponding ch's being a, c, f, b, and g, respectively.
Definition 2.2.For the sequences S 1 , S 2 , . . ., S d from T , p p 1 , p 2 , . . ., p d and q q 1 , q 2 , . . ., q d are two matched pairs.one calls p q if and only if p i q i for i 1, 2, . . ., d.If p / q and q i < p i for i 1, 2, . . ., d, one calls p a successive matched pair of q, and denote it as q < p.If q < p and there does not exist a matched pair r r 1 , r 2 , . . ., r d for S 1 , S 2 , . . ., S d such that q < r < p, one calls p a direct successor matched pair of q, denoted as q → p. Definition 2.3.For the sequences S 1 , S 2 , . . ., S d , from T , let q q 1 , q 2 , . . ., q d be a matched pair.If there does not exist a matched pair p p 1 , p 2 , . . ., p d p / q such that p < q, one terms q an initial matched pair.In general, there may be more than one initial matched pair from T .
Taking above sequences S 1 and S 2 as an example, we can see that the matched pairs 3,3 , 4,5 , and 5,4 are the successive matched pairs of matched pairs 1,2 and 2,1 .Moreover, the matched pair 3,3 is a direct successive matched pair of the matched pairs 1,2 and 2,1 , wherein 1,2 and 2,1 are two initial matched pairs in total for sequences S 1 and S 2 .
Based on above the definitions, the following conclusion can be easily inferred.
Lemma 2.4.The total number of all possible initial matched pairs is less than or equal to |Σ| regardless of |T |.

Finite Automaton Atm for Recognizing Common Subsequence and Its Basic Properties
It can be seen from the above discussion that the characters in the MLCS from T must be the characters corresponding to their matched pairs.In the following, based on some ideas and concepts from finite automaton Atm , we can see the matched pairs of T as states of the Atm and construct the Atm which can recognize/accept all the CSs of T by defining a specific transition function.The formal definition of the Atm is as follows.
Definition 3.1.The common subsequence finite automaton Atm is a 5-tuple, that is, For for all a ∈ Σ and Q i , one defines the transition function δ as follows It can be seen from Definition 3.1 that the Atm is a deterministic finite automaton DFA , but it is different from the normal DFA.The Atm can be partial, that is, every state in the Atm can be initial or final state.
What follows are the formal definitions of L Atm and MLCS recognized/accepted by the Atm.Definition 3.2.For Atm {Q, Σ, δ, Q 0 , F} defined by Definition 3.1, a character sequence x ∈ * is called to be recognized/accepted by the Atm if and only if for Based on Definition 3.2, we can easily deduce the following conclusion.
That is, MLCS is the set of the longest sequences recognized/accepted by L Atm .
With Definition 3.2, we can obtain the following properties.
Theorem 3.4.For all S ∈ MLCS, let β be the corresponding matched pair of the ith (i > 1) character c i and α be the corresponding matched pair of the (i − 1)th character c i−1 in sequence S, then, β must be the direct successive matched pair of α.Furthermore, the first character of S must belong to the character of an initial matched pair.
Proof.We give the proof by reduction in the following.For all S ∈ MLCS, assuming that β is not the direct successive matched pair of α, there will be a matched pair γ γ corresponds to character c such that α < γ < β.So we can insert c between c i−1 and c i to get a longer common subsequence, denoted as MLCS', which is contradiction to the fact that S is the longest common subsequence.
Then, let us consider the matched pair λ which corresponds to the first character of the S. If λ is not an initial matched pair, according to Definition 2.3, there must exist a matched pair ω, and ω < λ.So we can get a longer MLCS' by inserting the character corresponding to ω into the header of the S which contradicts to the fact that S is the longest common subsequence.

Theorem 3.5. The Atm is a directed acyclic graph (DAG).
Proof.The theorem will be proved by reduction.
Suppose that there was a series of states Q i , Q j , and

The Overview of the MapReduce Parallel Framework of Cloud Computing
For the convenience, we first briefly overview the MapReduce parallel framework of cloud computing.
Cloud computing is a new computing model 24 , which is a development of distributed computing, parallel computing and grid computing.MapReduce is a parallel framework of cloud computing and an associated implementation for processing and generating large datasets.It is amenable to a broad variety of real-world tasks and has been widely recognized as a cost-effective high performance parallel computing model.In the model, based on "divide and conquer" technology, users only specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules intermachine communication to make efficient use of the network and disks 25, 26 .Figure 1 shows the execution framework of MapReduce 26 .

The Proposed Algorithm: FACC
In this subsection, before describing the details of the proposed FACC, we first give its framework as follows.
1 Preprocessing.Determine the common alphabet Σ of sequence set T , and preprocess every sequence of T as follows.The redundant characters in each sequence are filtered, that is, removing the characters which do not appear in at least one sequence in T .This process will ensure a quick searching later for MLCS of T .
2 Construction of successor tables.Based on the MapReduce parallel framework of cloud computing, successor tables see Section 4.2 Definition 4.1 for every preprocessed sequence of T are parallel constructed.Since any character in the MLCS should be a character corresponding to a matched pair, we can construct the successor table for each sequence so that all possible matched pairs of T can be found quickly.
3 Construction of the finite automaton Atm for recognizing/accepting common subsequence.
Based on the MapReduce parallel framework, and using the matched pairs as the states and adding an initial state to the Atm, we can construct the Atm which can recognize/accept all the CS of sequences of the T according to transition function of the Atm, wherein each state holding all of the states of its parent nodes during the construction of the Atm.
4 Traversal of the Atm and output all of the MLCS.Search for the MLCS by traversing the Atm through the depth-first method.
In the following, the proposed FACC and its implementation based on MapReduce will be described step by step in detail.

Preprocessing
Recall that the MLCS of T should be the sequences over their common alphabet Σ.The goal of the preprocessing is to reduce the searching time by filtering the redundant characters in each sequence which does not appear in Σ.After preprocessing, we will obtain the specific sequences which only reserve the characters in Σ.Since preprocessing leads to some time and space cost, the proposed FACC adopts this procedure only in the situations of the large or unknown alphabet Σ.
The idea of the preprocessing is that a Key-Value table is designed that is a data structure Map , where the Key represents a character α and the Value is the total number of the sequences containing character α.For N sequences, the Value corresponding to Key α in the Map equals k if and only if k sequences contain the character α.Obviously, k ≤ N. In this situation, we call the value of Key α is k.According to the definition of the value k of Key α, we can see that all characters with value N consist of the alphabet Σ.Then, all sequences are filtered in parallel using the MapReduce of cloud computing and only the characters in Σ are reserved.The ith resulted sequence obtained from S i after the filter is denoted as S i for i 1 ∼ N.
Algorithm 1 shows the pseudocode of the preprocessing algorithm.For example, sequences S 1 abecadbca and S 2 acbafcgba, Σ 1 {a, b, c, d, e} and Σ 2 {a, b, c, f, g} are the alphabets of these two sequences, respectively.After preprocessing to S 1 and S 2 , we can get Σ {a, b, c}, and S 1 is converted into S 1 abcabca, and S 2 is converted into S 2 acbacba.

Successor Table
Let Tab k denote the successor table of the sequence S k and its definition is as follows.
Definition 4.1.For a sequence S k x 1 , x 2 , x 3 , . . ., x n drawn from alphabet Σ k σ 1 , σ 2 , . . ., σ t , the successor table Tab k of the sequence S k is an irregular two-dimensional table, where the element of ith row and jth column of the table Tab k is denoted as Tab k i, j , which is defined as follow.
The value of Tab k i, j indicates the minimal subscript position p of the sequence S k x 1 , x 2 , x 3 , . . ., x n according to σ i after position j when x p σ i .
For the two sequences S 1 abcabca and S 2 acbacba, Tables 2 and 3 show the successor tables of them.To construct successor tables for T , We dispatch |T | Map functions to construct successor table for each sequence of T in parallel, and then employ a Reduce function to aggregate the successor tables of the sequences of T .
Because the irregular successor tables only store the useful information and are constructed in parallel, it can considerably reduce the time and space complexity of searching for the MLCS.
With the constructed successor table, a direct successive matched pair of a matched pair can be gotten quickly.Take Tab 1 and Tab 2 as examples.When searching for the successive matched pairs of matched pair i, j , all we need to do is searching for the matched pairs of the Tab 1 i , Tab 2 j , where Tab 1 i and Tab 2 j stand for all the elements of the ith and jth columns in the Tab 1 and Tab 2 , respectively , and then, removes all of the matched pairs which are not direct successive matched pairs.For example, by checking Tab

Constructing the Common Subsequence Atm of T
By the aforementioned Definition 3.1 and its properties of the Atm, we can build the common subsequence Atm of the sequence set T in parallel by MapReduce.The algorithm Build-Atm is shown in Algorithm 2.
With the algorithm shown in Algorithm 2, take two sequences S 1 and S 2 for example, the main construction process of the Atm can be shown as follows the process also applies to multiple sequences of T . 1 Construct a virtual initial state 0,0 corresponding to the matched pair 0,0 with character ε. 2 Determine all of the direct successive matched pairs of the matched pair 0,0 .3 Use the deep first search DFS method to construct the Atm.
Notice that each state in the Atm must remember all of the states of its parent nodes during the construction of the Atm.For example, two sequences S 1 and S 2 illustrated in Figure 2 have a direct successive matched pair 1,1 of the matched pair 0,0 .Because matched pair 1,1 corresponds to character a, we can get a state transition δ 0, 0 , a 1, 1 .By the algorithm shown in Algorithm 2. The final common subsequence Atm constructed in the example of Figure 2 is shown in Figure 3, where the states 0,0 and 7,7 are the initial and final states of the Atm of the sequences S 1 and S 2 , respectively.
Algorithm Build-Atm Pos, dsucSet Input: Q 0 : the initial state of the Atm 0, 0, . . ., 0 tabSet: the set of successor tables for T drawn from Σ Output:

Traversing Atm and Finding the MCLS
By the finite automata theory and Definitions 3.1 and 3.2, we can get a character sequence, corresponding to a path from Q 0 to a state of the F of Atm, which is a candidate longest common subsequence of T , named LCS .Hence, all the longest character sequences of all the candidate sequences are exactly the elements of MLCS from T .We first design a specific set named resultSet to store the expected MLCS.Then, By the depth first search method, the Atm is traversed from Q 0 to every state of the F in parallel by MapReduce schemes.
Once we get a candidate LCS , we make following operations: if resultSet is not empty and all string length of elements in resultSet is longer than that of the LCS , ignore the LCS ; otherwise clear resultSet, and then insert LCS into the resultSet.Because resultSet is a set, it can eliminate redundant elements automatically, hence, we can acquire the MLCS from the resultSet eventually.In the example for the sequences S 1 and S 2 the all MLCSs are {abcba, ababa, acaca, acbca, acaba, abaca}.

Time Complexity
In what follows, we first give the time complexity of the FACC in every stage, and then the total time complexity.

Preprocessing
It is necessary to traverse every sequence of T once in order to find its common alphabet, therefore the step is

Constructing the Successor Table
To build successor tables for all of the sequences pre-processed, it is necessary to traverse these sequences once again.It turns out that the same time complexity is required as the above preprocessing stage based on MapReduce schemes.

Constructing Atm
According to Theorem 3.6, it is known that the upper bound for the number of the Atm's states is a∈Σ

Space Complexity
Because the storage space of sequences and successor tables is static and proportional to the size of T , the space complexity of their storage is For building the Atm, the storage space is proportional to the number of states, hence the space complexity for building the Atm is O a∈Σ S i a .

Dataset and Experimental Results
In this paper, to test the time performance of FACC and FAST-LCS fairly, we run the two algorithms on the same hardware platform.Using the DNA and amino acid sample sequences dataset provided by ncbi 27 and dip 28 , we tested the proposed algorithm FACC on the Hadoop cluster with 4 worker nodes, each of which contains 2 Intel CPUs 2.67 GHz X5550, 8 GB of RAM, and 24 GB of local disk allocated to HDFS.In the cluster each node was running Hadoop version 0.20.0 on RedHat Enterprise Linux Server release 5.3 and connected by 100 M Ethernet to a commodity switch.The FAST-LCS algorithm was run on 4 × 2 Intel CPUs 2.67 GHz X5550, and 8 GB of RAM, using the same datasets and operating system, and the programming environment of the algorithms is JDK 1.7.The comparisons of the time performance between FACC and FAST-LCS are shown in Tables 4 and 5 and Figures 4 and 5.

Discussion of Experimental Results
Tables 4 and 5 4 and  5 , while that of FAST-LCS reaches 95% due to its incorrect pruning operation 2.Moreover, Tables 4 and 5 and Figures 4 and 5 show that the time performance of FACC considerably outperforms that of FAST-LCS, and with the increasing lengths of input sequences, the advantage of FACC is growing significantly over FAST-LCS.Figure 6 shows the time performance of the proposed FACC with preprocessing and without preprocessing.It can be seen from Figure 6 that the time performance for the case with preprocessing is obviously superior to that without preprocessing, especially for   the cases of a large set of alphabet and a long average length of sequences.Furthermore, the more the number of input sequences, the more efficient the proposed FACC.
In summary, the time performance of the proposed algorithm FACC is much better than that of FAST-LCS.

Conclusions
Considering that the efficiency and effectiveness of the existing parallel algorithms for searching for MLCS are not satisfactory with the increasing complexity and size of biologic data and do not give an abstract and formal description of the MLCS problem and adopt complicated parallel schemes, we propose a novel finite automaton based on MapReduce parallel framework of cloud computing called FACC to overcome the existing algorithms' shortcomings.The proposed algorithm is based on MapReduce parallel programming framework of cloud computing, the matched pair, and finite automaton FA by using some efficient techniques such as preprocessing, constructing the efficient successor table and common subsequence Atm, and looking for MLCS, and so forth.The theoretical analysis to the proposed algorithm shows that the time and space complexity are linear, that is, they are max{O n , O |Q| } and , respectively, which are superior to the leading parallel MLCS algorithms.Moreover, the simulation experiments of the proposed algorithm on some real DNA and amino acid sequence sample datasets are made, and their performance is compared with that of one of the current leading algorithms: FAST-LCS.The experimental results show that the proposed algorithm is very efficient and effective, and its performance is much better than that of FAST-LCS, especially for the cases of a large alphabet, a considerable number and a long average length of sequences.Meanwhile, experimental results also verify the correctness of our theoretical analysis.

Figure 3 :
Figure 3: The common subsequence Atm of the sequences S 1 and S 2 .
where |S i | is the length of a sequence S i and d is the number of total sequences of T , that is, d |T |.It is also necessary to traverse every sequence of T to filter the characters not in Σ, which also requires O d i 1 |S i | time complexity.Assuming all the sequences in T with the same length n and the number of Map functions is d, the total time complexity is O n in the stage based on MapReduce schemes.

d i 1
|S i | a 1, where |S i | a stands for the number of times character a appears in the sequence S i .Because the time complexity for constructing the Atm is proportional to the number of the Atm's states |Q|, the time complexity for constructing the Atm is O |Q| in this stage based on MapReduce schemes.Thus, the total time complexity of FACC is equal to O n O n O |Q| max{O n , O |Q| }.

d i 1
|S i | a .On the other hand, in recursively constructing the Atm, on the average, the recursion depth is log a∈Σ d i 1 |S i | a , and then the space complexity required temporary space is O log a∈Σ d i 1 |S i | a for constructing and traversing the Atm.It happens that the space complexity of FACC is compares various performance indices of FAST-LCS and FACC on 20 DNA sequences |Σ| 4 and 20 amino acid sequences |Σ| 20 with different lengths of input sequences.It can be seen that various performance indices of FACC are superior to those of FAST-LCS, wherein the precision of FACC reaches 100% shown in column Precision, which is the ratio of the number of found MLCSs N MLCS to the total number of MLCSs, of Tables

Figure 4 :
Figure 4: The time performance comparison between FACC and FAST-LCS on 20 DNA sequences |Σ| 4 with lengths from 25 to 180.

6 Figure 5 :
Figure 5: The time performance comparison between FACC and FAST-LCS on 20 amino acid sequences |Σ| 20 with lengths from 25 to 320.

Table 1 :
Notations and their meanings.
* Kleene closure of Σ, Σ * {x | x is a sequence from Σ with the length of zero or nonzero} L Atm A language recognized/accepted by the Atm Definition 2.1.Suppose S k ∈ T is a sequence over Σ for k 1 ∼ d.Let S k pk ∈ Σ denotes the p k th p k 1, 2, . . ., |S k | character in sequence S k .For the sequences S 1 , S 2 , . . ., S d from T , vector p p 1 , p 2 , . . ., p d is called a matched pair of the sequences, if and only if S 1 p 1 and Q i is the successive state of Q k .Due to the fact that matched pairs p i , p j , and p k correspond to the states Q i , Q j , and Q k , we can get the results, p i < p j , p j < p k and p k < p i , which contradicts to Definition 2.2.Therefore, Atm is a directed acyclic graph.|S i | a 1 is an upper bound of the number of Atm's states, where |S i | a represents the occurrence number of a in sequence S i .

Table 4 :
Performance comparison between FAST-LCS and FACC on 20 DNA sequences |Σ| 4 with lengths from 25 to 180.

Table 5 :
Performance comparison between FAST-LCS and FACC on the 20 amino acid sequences |Σ| 20 with lengths from 25 to 320.
Note: TPR indicates the ratio of the running time of FAST-LCS to that of FACC in Table5.