ScholarWorks @ UTRGV ScholarWorks @ UTRGV

.


BIN FU, MING-YANG KAO, AND LUSHENG WANG
The similar closest string and substring problems were proved to be NP-hard [3,10].Some approximation algorithms have been proposed.Li, Ma, and Wang [13] gave an approximation scheme for the closest string and substring problems.The related consensus patterns problem is as follows: given n sequences s 1 , • • • , s n , find a region of length L in each s i and a string s of length L so that the total Hamming distance from s to these regions is minimized.Approximation algorithms for the consensus patterns problem were reported in [12].Furthermore, a number of heuristics and programs have been developed [1,8,9,16,20].
In many applications, motifs are faint and may not be apparent when only two sequences are compared but may become clearer when more sequences are compared at the same time [6].For this reason, it has been conjectured that comparing more sequences at the same time can help with identifying faint motifs.In this work, we give the first analytical proof for this conjecture.This is a theoretical approach with a rigorous probabilistic analysis.
We study a natural probabilistic model for motif discovery.In this model, there are k background sequences, and each character in the background sequence is a random character from an alphabet Σ.A motif G = g 1 g 2 • • • g m is a string of m characters.Each background sequence is implanted into a probabilistically generated approximate copy of G.For an approximate copy b 1 b 2 • • • b m of G, every character b i is probabilistically generated such that the probability for b i = g i , which is called a mutation, is at most α.This model was first proposed in [16] and has been widely used to experimentally test motif discovery programs [1,8,9,20].We note that a mutation in our model converts a character g i in the motif into a different character b i with no further probability restriction than the upper bound of α.In particular, a character g i in the motif may become any character b i in Σ − {g i } with unequal probabilities.
We design an algorithm that for a reasonably large k can discover the implanted motif with high probability.Specifically, we prove that for α < 0.1771 and any constant x ≥ 8, there exist constants t 0 , δ 0 , δ 1 > 0 such that if the length of the motif is at least δ 0 log n, the alphabet has at least t 0 characters, and there are at least δ 1 log n 0 input sequences, then in O(n 3 ) time the algorithm finds the motif with success probability at least 1 − 1  2 x , where n is the longest length of any input sequence and n 0 ≤ n is an upper bound for the length of the motif.When x is considered as a parameter of order O(log n), the parameters t 0 , δ 0 , δ 1 do not depend on x.We also show some lower bounds that imply that our conditions for the length of the motif and the number of input sequences are tight to within a constant multiplicative factor.This algorithm's time complexity depends on the length of input sequences but is independent of the number of the input sequences.This is because for a fixed x, Θ(log n) sequences are sufficient to guarantee the probability of at least 1 − 1  2 x to discover the motif.In contrast to the NP-hardness of other variants of the common substring problem, motif discovery is solvable in O(n 3 ) time in this probabilistic model.
Our algorithm is a deterministic algorithm that has provable high probability to return the exact motif.The only source of randomness for the algorithm is the randomness in the input sequences.The algorithm extracts similar consecutive regions among multiple sequences while tolerating noises.The algorithm needs the motif to be long enough, but does not need to have the length of the motif as an input.
In section 2, we elaborate on our model of sequence generation and discuss some basics.We give a brief description of our main algorithm, Find-Noisy-Motif, in sec-tion 3. We set up some parameters and constants for the algorithm in section 4.1.The entire Find-Noisy-Motif is described in section 4.2.We analyze Algorithm Find-Noisy-Motif in section 5. Two lower bounds are presented in section 6.We conclude the paper with an open problem in section 7.
2. Notation and the model of sequence generation.For a set A, |A| denotes the number of elements in A. Σ is an alphabet with |Σ| = t ≥ 2. For an integer n ≥ 0, Σ n is the set of sequences of length n with characters from Σ.For a sequence S = a 1 a 2 • • • a n , S[i] denotes the character a i , and S[i, j] denotes the substring a i • • • a j for 1 ≤ i ≤ j ≤ n. |S| denotes the length of the sequence S. We use ∅ to represent the empty sequence, which has length 0.
Let G = g 1 g 2 • • • g m be a fixed sequence of m characters.G is the motif to be discovered by our algorithm.A Θ α (n, G)-sequence is defined to be a sequence t to be π for each π ∈ Σ, and b i has probability at most α not equal to The motif region of S may start at a probabilistic, arbitrary, or worst-case position in S. Also, a mutation may convert a character g i in the motif into an arbitrary or worst-case different character b i subject only to the restriction that g i will mutate with probability at most α.
For two sequences , the ratio of difference between the two sequences.
The analysis of our algorithm employs the Chernoff bound [15] and Corollary 2.3 below, which can be derived from that bound (see [13]).

3.
A sketch of Algorithm Find-Noisy-Motif.Our Algorithm Find-Noisy-Motif has two phases.The first phase exploits the fact that with high probability, the motif areas in some sequences conserve the first and last characters.Furthermore, the middle areas of the motif change with a small ratio.We will select enough pairs of Θ α (n, G)-sequences S and S and find their substrings G and G , respectively, such that G and G match at their left-and rightmost characters.Furthermore, G and G have only a relatively small difference in the middle areas.For each such pair S and S , the substring G of S is extracted.
During the second phase, a new set of Θ α (n, G)-sequences S 1 , S 2 , • • • , S k2 will be used.For each G extracted from a pair of sequences in the first phase, it is used to match a substring The motif regions of S 1 , S 2 , and S 3 are not aligned.from matching G to all sequences S 1 , S We prove that with high probability such a G exists.The rearrangement of S 1 , • • • , S k2 from Figure 1 to Figure 2 illustrates how we recover the motif via voting.
On the other hand, if |G | > |G| or G does not match G well, then we can prove that the number of nonempty sequences among , G will be dropped since with high probability there exists a candidate G 0 with a good voting performance and the algorithm returns the result from the longest one.Our algorithm's time complexity depends on the length of the input sequences but is independent of the number of the input sequences.This is because for a fixed x, Θ(log n) sequences are sufficient to guarantee that with probability at least 1 − 1  2 x the motif will be discovered.Additional sequences can improve the success probability but are not needed for the high probability guarantee.

Algorithm Find-Noisy-Motif.
In this section, we detail Algorithm Find-Noisy-Motif.The algorithm can find any hidden motif G in O(n 3 ) time and with high success probability.It requires that the size of the alphabet is larger than a fixed constant.The performance of the algorithm is stated in the main theorem, Theorem 4.1.The proof of Theorem 4.1 is given in section 5.4.Theorem 4.1.Assume that the mutation probability upper bound α is less than 0.1771.Then there exist constants t 0 , δ 0 , and δ 1 such that if the size t of the alphabet Σ is at least t 0 and the length of the motif G is at least δ 0 log n, then, given k independent Θ α (n, G)-sequences with k ≥ δ 1 log n 0 , Algorithm Find-Noisy-Motif outputs G with probability at least 1 − 1  2 x and runs in O(n 3 ) time, where n is the longest length of any input sequences and n 0 ≤ n is a given upper bound for the length of G.
Some parameters and constants will be used in Algorithm Find-Noisy-Motif .In section 4.1, we give a list of assignments for some such parameters and constants.The description of Algorithm Find-Noisy-Motif is given in section 4.2.The analysis of the algorithm is given in sections 5.2-5.4.

Parameters.
Multiple parameters affect the performance of the main algorithm, Find-Noisy-Motif; we list them below and discuss some useful inequalities.
• Let x be any constant at least 8.The parameter x controls the failure probability of Find-Noisy-Motif to be at most 1  2 x .We will prove that Find-Noisy-Motif has probability at least 1 − 1  2 x to output the exact correct motif G. • Let α be any constant with 0 ≤ α < 0.1771.Note that The parameter α is the upper bound for the mutation probability of each character in the motif region.
• Let η = 1 6 .The algorithm has five cases in which it may fail.In order to keep the total failure probability at most 1  2 x , we ensure that each such case has failure probability at most η 2 x .As at most five cases can fail, the total failure probability is bounded by 5 24 .We will design a function Extract(S 1 , S 2 ) to output ℵ(S 2 ) with probability greater than a fixed constant.This parameter controls the probability that Extract(S 1 , S 2 ) derives a substring of S 2 without overlap with the motif region ℵ(S 2 ) in S 2 .It also affects the selection of d, a lower bound of the motif length.
• Let > 0 be any constant such that In order to find the motif, we often extract one of two similar substrings from two input sequences.The parameter controls the similarity of two substrings (see diff(S 1 , S 2 ) in section 2) and appears in the probability that is derived from the Chernoff bound (see Corollary 2.3).The existence of follows from inequality (1).It also affects the selection of some other parameters.
• Let n be the largest length of an input sequence with n ≥ 3. Let parameter n 0 ∈ [d, n] be a given upper bound on the length of the motif G that will be discovered by Algorithm Find-Noisy-Motif.If n 0 is unknown, we just let and ( 4)

BIN FU, MING-YANG KAO, AND LUSHENG WANG
To satisfy inequalities (3) and ( 4) above, select Note that δ 0 is a constant since both η and x are fixed.We require that the length of the motif G is at least d as stated in Theorem 4.1.The motif G is a pattern unknown to Algorithm Find-Noisy-Motif.Find-Noisy-Motif will attempt to recover G from a series of Θ α (n, G)-sequences generated by the probabilistic model in section 2, which is controlled by the parameters α, n, and G.The source of randomness for Find-Noisy-Motif comes entirely from the input sequences.
Recall that a sequence S is generated as follows: (1) Generate a sequence S with n − |G| characters, in which each character is a random character in Σ. ( 2) Generate G such that with probability at most α, G A mutation may create an arbitrary or worst-case G [i], with no probability restriction except that the mutation occurs with probability at most α.(3) Insert G , which serves as the motif region ℵ(S) of S, into any arbitrary or worst-case position of S .
Let Z 0 be a set of a set of k 2 sequences that will be used in the second phase of Algorithm Find-Noisy-Motif.Let k = 2k 1 + k 2 be the total number of Θ α (n, G)sequences that are used as the input to Find-Noisy-Motif.Both parameters k 1 and k 2 are determined later (see Definition 4.2).

Description of Algorithm Find-Noisy-Motif.
The algorithm is detailed in this section.Before presenting the algorithm, we define some constants and notions.
1. Select any constant r 0 > 0 such that The constant r 0 will be used to select the constants v (which is defined below) and t 0 (which is the lower bound of the size of the alphabet and is defined in Definition 5.1).The existence of r 0 follows from inequality (2). 2. Let v be the least integer that satisfies the following inequalities: where c = e − 2 3 .Note that the existence of v for inequality (7) follows from inequality (5).

The function Extract(S 1 , S 2 ) tries to find ℵ(S 2 ) by matching ℵ(S 1 ) and ℵ(S 2 ) without shifting (ℵ(S 1 )[i] is aligned to ℵ(S 2 )[i]). The parameter v is a threshold for the number of characters shifted when matching two motifs from two input sequences (there is one shift if
For the case where the number of shifts is more than v, the Chernoff bound is used to show that the probability is small enough.For the case where the number of shifts is less than v but at least 1, the probability is still small due to the assumption that the size of the alphabet is large enough. 1) is a constant independent of the length of the input sequences since both η and x are constants.The parameter k 1 is the number of pairs of input sequences (S 1 , S 1 ), • • • , (S k1 , S k1 ) used to extract the motif candidates in the subroutine Phase-One of Find-Noisy-Motif.4. Select a constant δ 1 > 0, and let

The number
The parameter k 2 is the number of input sequences used in the subroutine Phase-Two of Find-Noisy-Motif.The candidates for the motif from Phase-One are used to match the motif regions of the k 2 sequences in Phase-Two.
The original motif G is recovered via voting on the k 2 substrings.Definition 4.3.
1. Let β = 2α + 2 .The parameter β controls the similarity between ℵ(S) and the original motif G (see Lemma 5.8). 2. Two sequences X 1 and X 2 are left matched if (1) Two sequences X 1 and X 2 are matched if X 1 and X 2 are both left and right matched.Algorithm Find-Noisy-Motif has two phases.The two phases are organized as subroutines Phase-One and Phase-Two, respectively.The input to Phase-One is k 1 pairs of Θ α (n, G)-sequences collected in the set Z 0 .The input to Phase-Two consists of k 2 Θ α (n, G)-sequences collected in the set Z 2 and the output from Phase-One.All the Θ α (n, G)-sequences are independent Θ α (n, G)-sequences.Recall that k 1 is constant, k 2 = O(log n 0 ), and n 0 (≤ n) is an upper bound for the length of the motif G as discussed in Definition 4.2 and section 4.1.Algorithm Find-Noisy-Motif is deterministic, and its probabilistic performance is based on the randomness of those sequences in both Z 0 and Z 2 and the independence in generating them.
The subroutine LoadInputSequences() below generates the input sequences for Find-Noisy-Motif using the probabilistic model in section 2.

End of LoadInputSequences
The function Extract(S 1 , S 2 ) below extracts the longest similar region between two sequences S 1 and S 2 .
Output: a substring of S 2 which is similar to a substring of S 1 . Steps: ] and S 2 [j, j ] are matched (see Definition 4.3), then return S 2 [j, j ] and end this function; return ∅ (output the empty sequence when there is no match found);

End of Extract
The following function is Phase-One of Algorithm Find-Noisy-Motif.Phase-One(Z 0 ) Input: a set of pairs of sequences generated at step 1 of Find-Noisy-Motif.
Output: a set W that contains a similar region of each pair in Z 0 .
Steps: let W = ∅ (empty set); for each pair of sequences (S , S ) ∈ Z 0 let G = Extract(S , S ) and put G into W ; return W , which will be used in Phase-Two; End of Phase-One After a set W of motif candidates is produced from Phase-One of Find-Noisy-Motif, we use this function Match(G , S i ) below to match this set with the set Z 2 of input sequences to recover the hidden motif by voting.
Function Match(G , S i ) Input: a motif candidate G , which is returned from the function Extract(), and a sequence S i from the group Z 2 .
Output: either a substring G i of S i of the same length as G or an empty sequence, where G i will be considered as the motif region ℵ(S i ) of S i if it is not empty and the empty sequence means the failure in extracting the motif region ℵ(S i ) of S i . Steps: Output: a sequence G , which is derived by voting on every position of the input sequences. Steps: are the same character a, then let a j = a else return "failure" and end this function;

End of Vote
The following function performs Phase-Two of Algorithm Find-Noisy-Motif.It uses the motif candidates for the motif derived in Phase-One to extract the motif regions of the set Z 2 of input sequences and recovers the motif by voting.
Phase-Two(Z 2 , W ) Input: } as defined before and W from Phase-One.Output: G , which is a recovery of motif G. Steps: ) (which will be proved to be identical to G with probability at least 1 − 1 2 x ) and end Phase-Two.return "failure".

End of Phase-Two
The entire main algorithm is described as follows: Algorithm Find-Noisy-Motif Steps: (Z 0 , Z 2 ) = LoadInputSequences(); W = Phase-One(Z 0 ); Phase-Two(Z 2 , W ); End of Algorithm Find-Noisy-Motif 5. Analysis of Algorithm Find-Noisy-Motif.Section 5.1 gives an overview of the analysis of Find-Noisy-Motif.Section 5.2 analyzes Phase-One.Section 5.3 analyzes Phase-Two.Section 5.4 gives an overall analysis of Algorithm Find-Noisy-Motif and proves the main theorem, Theorem 4.1.

Overview of the algorithm analysis. Phase-One derives k
). Lemmas 5.2, 5.3, 5.5, and 5.6 show that the probability is small for Phase-One returning a substring not from the motif region of an input sequence in Phase-One.Lemma 5.8 shows that each Θ α (n, G)-sequence S has its motif region ℵ(S) similar to the motif with high probability.In expecting Extract(S i , S i ) to return the motif region ℵ(S i ), we compare the similarity between S i and S i in order to detect the location of ℵ(S i ) in S i .If the two substrings are ℵ(S i ) and ℵ(S i ), there is much similarity between them, and ℵ(S i ) BIN FU, MING-YANG KAO, AND LUSHENG WANG is found in S i .Otherwise, they have high similarity only with a small probability according to Lemmas 5.2, 5.3, 5.5, and 5.6.
After showing that the probability is small for Phase-One returning a nonmotif region, we prove Lemma 5.9 to give an Ω(1) probability lower bound that a motif region ℵ(S i ) of S i is returned via Extract(S i , S i ).Finally, the analysis of Phase-One will show that one of those G 1 , • • • , G k1 has the same length as G and is very similar to the true motif G with high probability.The number k 1 of pairs amplifies the success probability exponentially, as shown in Lemma 5.10.Now assume that G 0 is a relatively accurate motif produced by Phase-One.
Phase-Two uses such G 0 to detect the motif region ℵ(S i ) for each S i of the k 2 input sequences S 1 , • • • , S k2 via pattern matching between G 0 and a substring in as shown in Lemma 5.14.
Our main theorem about the correctness and complexity of the algorithm is Theorem 4.1.It combines the analyses for Phase-One and Phase-Two.If a G i produced by Phase-One is longer than G, it will be dropped during the voting.If G i is shorter than G, it will be also dropped since there G 0 is longer and has a better voting consensus than G i .

Analysis of Phase-One of
Find-Noisy-Motif.Lemma 5.2 shows that with only a small probability, a sequence can match a random sequence.It will be used to prove that when two substrings in two Θ α (n, G)-sequences are similar, they are likely to coincide with the motif regions in the two Θ α (n, G)-sequences, respectively.Definition 5.1.Let t 0 be any positive constant such that The parameter t 0 is used as a required lower bound of alphabet size.In the remainder of this paper, we always assume the alphabet size t is at least t 0 .
Lemma 5.2.Assume that X 1 and X 2 are two independent sequences of the same length and that every character of X 2 is a random character from Σ. Then the following hold: Proof.The two statements are proved as follows.Statement (i).For every character X 2 [j] with 1 ≤ j < v, the probability is . The expected number of positions where the two sequences X 1 and |X1| ≤ e − 2 3 |X1| by inequality ( 16), Corollary 2.3, and the fact that t ≥ t 0 (see Definition 5.1).
Function Extract(S 1 , S 2 ) returns a substring of S 2 .We expect that Extract(S 1 , S 2 ) is the motif region ℵ(S 2 ) in S 2 .Lemma 5.3 shows that with a small probability, the region for Extract(S 1 , S 2 ) in S 2 does not overlap the motif region ℵ(S 2 ) of S 2 .
Lemma 5.3.With probability at most ρ 0 , Extract(S 1 , S 2 ) and ℵ(S 2 ) are not overlapping substrings of S 2 .In other words, with probability at most ρ In order to show that Extract(S 1 , S 2 ) can be used to effectively find a motif region in S 2 , we give Lemma 5.5 to show that with only small probability, the region of Extract(S 1 , S 2 ) in S 2 may shift far from the motif region ℵ(S 2 ).
Definition 5.4.The constant z is selected so that The parameter z is a threshold for controlling the shift in the analysis of Phase-One of Find-Noisy-Motif.See Lemma 5.5 and Definition 2.1.
Lemma 5.5.The probability is at most H 1 = 2ρ 0 that for a pair of sequences and z is as defined in Definition 5.4.
Proof.Assume that M = S 2 [i 2 , j 2 ] = Extract(S 1 , S 2 ) is the matched sequence.By Lemma 5.3, the probability is at most ρ 0 that M does not intersect ℵ(S 2 ).
Notice that M = S 2 [i 2 , j 2 ] is a substring of S 2 according to the function Extract().Assume that M = S 1 [i 1 , j 1 ] is the substring of S 1 such that M and M are matched in the function Extract then M contains a substring N of length w outside ℵ(S 2 ) and N is either a prefix or a suffix of M .Every character of N is outside ℵ(S 2 ) and is a random character that has the probability 1 t to be equal to any character in Σ.By symmetry, we assume without loss of generality that N is a prefix of M .Then M = N N 1 and M = N N 1 , where N and N have the same length and have a difference ratio at most β according to the conditions in Definition 4.3.By Lemma 5.2, the probability is at most e − 2 3 w that N and N have the same length and have a difference of ratio at most β.
Therefore, the probability is at most 4e

BIN FU, MING-YANG KAO, AND LUSHENG WANG
(recall c = e − 2 3 from Definition 4.2).Therefore, the total probability that shift (18).Lemma 5.6 will be used to give an upper bound in probability analysis.It is derived by standard methods in calculus.
Definition 5.7.We say that a Θ α (n, G)-sequence S contains a stable motif region ℵ(S) if the following conditions hold: , and m = |G| as defined in Definition 4.2 and section 2.
Lemma 5.8.With probability at least Q 0 , a Θ α (n, G)-sequence S contains a stable motif region.
Proof.The probability is (1−α) 2 to satisfy conditions ( 1) and ( 2) in Definition 5.7.Consider condition (3).Since every character of ℵ(S) [1, m] (notice that m = |G|) has probability at most α to mutate, by Corollary 2.3 the probability is at most 2 as defined in Definition 4.2.Therefore, the probability is at most In sum, the probability that S contains a stable motif region is at least 1−c = Q 0 .Lemma 5.9 below gives a lower bound for the probability that Extract(S 1 , S 2 ) returns the motif region ℵ(S 2 ) of S 2 , and that the motif region ℵ(S 2 ) of S 2 does not differ much from the original motif G.
Define constants Lemma 5.9.Given two independently generated Θ α (n, G)-sequences S 1 and S 2 , the probability is at least ) returns ℵ(S 2 ), and ℵ(S 2 ) contains a stable motif region, where H 1 is defined in Lemma 5.5, , and c 2 is a constant defined in (23).Proof.Let M 1 be the substring of S 1 that matches the substring of M 2 of S 2 , let M 2 = Extract(S 1 , S 2 ), and let For two random Θ α (n, G)-sequences S 1 and S 2 , their motif regions ℵ(S 1 ) and ℵ(S 2 ) can match well with probability at least Q 2 0 by Lemma 5.8.We can assume that in the function Extract(S 1 , S 2 ), we consider only the variable h ) and ℵ(S 2 ) cannot match well.
Define P L,R (w 1 , w 2 , s) to be the probability that the following three conditions are satisfied.
Condition (1).There are w 1 characters outside ℵ(S 1 ) on the left side of M 1 .In other words, S 1 [i 1 + w 1 ] is the first character of ℵ(S 1 ), and the Condition (2).There are w 2 characters outside ℵ(S 2 ) in the right region of M 2 .In other words, is the last character of ℵ(S 2 ), and the w 2 characters S 2 [j 2 − w 2 + 1, j 2 ] are outside the region ℵ(S 2 ).
Condition (3).The position of the first character of ℵ(S 1 ) in M 1 and the position of the first character of ℵ(S 2 ) in M 2 have shift s.In other words, if 3 for an illustration.
In a similar way, we define the probabilities P L,L (w 1 , w 2 , s), P R,L (w 1 , w 2 , s), and P R,R (w 1 , w 2 , s).In other words, for P A,B (w 1 , w 2 , s) with A, B ∈ {L, R}, A = L (or A = R) represents the case that there are w 1 characters outside ℵ(S 1 ) on the left BIN FU, MING-YANG KAO, AND LUSHENG WANG Fig. 3. M 1 and M 2 for Case 1 of the proof of Lemma 5.9.
(respectively, right) side of M 1 , and B = L (or B = R) represents the case that there are w 2 characters outside ℵ(S 2 ) on the left (respectively, right) side of M 2 .Furthermore, the parameter s indicates the shift defined in Condition (3) above.By Lemma 5.5, with probability at most The probabilistic analysis below has 10 cases.Case a.b is the bth subcase of Case a. Case a.b.c is the cth subcase of Case a.b.We use P a , P a.b , and P a.b.c to denote the probabilities of Cases a, a.b, and a.b.c, respectively.
Case 1. 0 ≤ w 2 < w 1 , M 1 has w 1 characters outside ℵ(S 1 ) on the left side of M 1 , the last character of ℵ(S 2 ) is in M 2 , and M 2 has w 2 characters outside ℵ(S 2 ) on the right side outside M 2 .See Figure 3.
We consider only the cases where s = 1, 2, . . ., w 1 + w 2 and M 2 has fewer than w 1 characters outside ℵ(S 2 ) on the left side.The cases where M 2 has at least w 1 characters outside ℵ(S 2 ) are covered by Cases 5 and 9.
If s > w 1 + w 2 , the matched region will be shorter than that of ℵ(S 2 ).If s ≤ w 1 , the first character of ℵ(S 2 ) is in M 2 , and this case is included in Case 4. Therefore, we consider only the range w 1 + 1 ≤ s ≤ w 1 + w 2 .For an upper bound for the probability of Case 1, we compute P 1 = ∞ w2=0 ∞ w1=w2+1 w1+w2 s=w1+1 P L,R (w 1 , w 2 , s).There are some subcases.
Fig. 4. M 1 and M 2 for Case 2 of the proof of Lemma 5.9. • 1−c .Therefore, by Lemma 5.6, . Therefore, we have derived a probability bound for Case 1: ) in the right side of the matched region, and M 1 has w 1 characters outside ℵ(S 1 ) in the right side of the matched region.See Figure 4.
For a probability upper bound of Case 2, we compute There are some subcases.
Assume v ≤ w 2 < w 1 .By Lemma 5.2, for a fixed w 1 and a fixed s = w 1 − w 2 , the probability for Case 2.2 is at most P R,R (w 1 , w 2 , s) ≤ e − 2 3 w1 .The probability for Case 2.2 for all w 1 > w 2 is at most . Therefore, the probability for Case 2 is upper bounded as BIN FU, MING-YANG KAO, AND LUSHENG WANG Case 3. 0 ≤ w 2 < w 1 , M 1 has w 1 characters outside ℵ(S 1 ) on the right side of M 1 , the last character of ℵ(S 2 ) is in M 2 , and M 2 has w 2 characters outside ℵ(S 2 ) on the left side of M 2 .
For a probability upper bound of Case 3, we compute This case has the same analysis and probability as Case 1.Therefore, we have ). Case 4. 0 ≤ w 2 < w 1 , S 1 has w 1 characters outside ℵ(S 1 ) on the left side of M 1 , and S 2 has w 2 characters outside ℵ(S 2 ) on the left side M 2 .
For a probability upper bound of Case 4, we compute This case has the same analysis and probability as Case 2. Therefore, we have and the right sides of both M 1 and M 2 have the same number w 2 of characters outside ℵ(S 1 ) and ℵ(S 2 ), respectively.
For the probability upper bound of Case 6, we compute P 6 = ∞ w2=1 P R,R (w 2 , w 2 , 0).This case has the same analysis and probability as Case 5. Therefore, ) on the left side of M 1 , and M 2 has w 2 characters outside ℵ(S 2 ) on the right side of M 2 .
We have only s = 1, 2, . . ., w 1 + w 2 .For a probability upper bound of Case 7, we compute By Lemma 5.2, P L,R (w 1 , w 2 , s) ≤ 1 t for fixed w 1 , w 2 , and s.The total probability for this case for all w 1 with 0 ≤ w 1 ≤ w 2 is at most w1+w2 t .The probability is at most By Lemma 5.2, P L,R (w 1 , w 2 , s) ≤ e − 2 3 w2 for a fixed w 2 and a fixed s.The total probability for this case for fixed w 1 and w 2 , and variable s = 1, . . ., w 1 +w 2 is w1+w2 s=1 P L,R (w 1 , w 2 s) ≤ (w 1 + w 2 )e − 2 3 w2 .The total probability for this case for all w 2 ≥ v is at most P Lemma 5.6).In summary, the probability for Case 7 is upper bounded as P 7 = P 7.1 ) is in M 1 , there are w 1 characters on the right side of M 1 and outside its ℵ(S 1 ), and M 2 has w 2 characters on the right side of M 2 and outside its ℵ(S 2 ).
For a probability upper bound of Case 8, we compute There are two subcases.
• Case 8.1.0 < w 2 < v.By Lemma 5.2, the probability for this case for a fixed w 1 and a fixed s = w 2 − w 1 is at most 1 t .The total probability for this case for all w 1 with 0 By Lemma 5.2, P L,R (w 1 , w 2 , s) ≤ e − 2 3 w2 for a fixed w 2 and a fixed s = w 2 − w 1 .The total probability for this case for all 0 ≤ w 1 ≤ w 2 − 1 is at most ) by Lemma 5.6.We have P 8 = P 8.1 ). Case 9. 0 ≤ w 1 < w 2 , the last character of ℵ(S 1 ) is in M 1 , S 1 has w 1 characters outside ℵ(S 1 ) on the right side of M 1 , and S 2 has w 2 characters outside ℵ(S 2 ) on the left side outside M 2 .
For a probability upper bound of Case 9, we compute This case has the same analysis and probability as Case 7. Therefore, we have P 9 = P 7 ≤ 2c 1 • v 3 ( 1 t + c v ).Case 10. 0 ≤ w 1 < w 2 , the first character of ℵ(S 1 ) is in M 1 , M 1 has w 1 characters on the left side of M 1 outside ℵ(S 1 ), and M 2 has w 2 characters on the left side of M 2 outside ℵ(S 2 ).
For a probability upper bound of Case 10, we compute This case has the same analysis and probability as Case 8. Therefore, we have Proof.Assume w 0 ≥ 1.Let w be the number of characters outside ℵ(S) on the left side of M , and let w be the number of characters outside ℵ(S) on the right side of M .Clearly, w 0 = w + w .Since w 0 ≥ 1, either w ≥ 1 or w ≥ 1. See Figure 5. Without loss of generality, we assume w ≥ 1.
Statement (i).There are two cases.Case (a). 1 ≤ w < v.By Lemma 5.2, the probability for this case for a fixed w is at most 1 t .Thus, the total probability for this case is at most (v−1) t .Case (b).v ≤ w.By Lemma 5.2, the probability for this case for a fixed w is at most e − 2 3 w .The total probability for this case is at most ∞ w=v e − 2 3 w = c v 1−c .The probability analysis is similar when w ≥ 1.Therefore, the probability for w 0 ≥ 1 is at most R = 2( v−1 t + c v 1−c ).Statement (ii).By Lemma 5.8, with probability at least Q 0 , S contains a stable motif region.By statement (i) of this lemma, we have probability at least Q 0 − R that, given a random Θ α (n, G)-sequence, S, ℵ(S) = Match(G 0 , S).
Lemma 5.14 below shows that we can use G to extract most of the motif regions for the sequences in Z 2 if G = G 0 (recall that G 0 is close to the original motif G as defined in Definition 5.11).Proof.Recall that sequence G 0 is selected according to Definition 5.11.When G is fixed, G i = ℵ(S i ) = Match(G , S i ) and G j = ℵ(S j ) = Match(G , S j ) are two independent events due to the independence of S i and S j .Thus, we can apply Chernoff bounds in the proof below.
Statement (i).By Lemma 5.13, for every S i ∈ Z 2 , the probability is at least . By Corollary 2.3, the probability is at most e − Statement (ii).By Lemma 5.13, the probability is at most R that G i = ℵ(S i ).
By Corollary 2.3, with probability at most e

Conclusions.
We have proved that if the mutation probability upper bound α is less than 0.1771, there exist constants t 0 > 0, δ 0 > 0, and δ 1 > 0 such that if the length of the motif is n 0 > δ 0 log n and the alphabet has at least t 0 characters, then there exists an O(n 3 )-time algorithm that, given at least δ 1 log n 0 input sequences, can find the motif with high probability, where n is the longest length of any input sequence.Very recently, we have also shown [4] that for any alphabet Σ with |Σ| ≥ 2, for every motif G ∈ Σ ρ − Ψ ρ,h, (Σ), where Ψ ρ,h, (Σ) is a small subset of Σ ρ with ) , if G has length at least c 0 log n, it can be recovered with O(n log n) sequences with high probability.This second algorithm is applicable to DNA motif discovery.An interesting open problem is whether there exists an algorithm to recover all the motifs for an alphabet of four characters.

Fig. 2 .
Fig. 2. S 1 , S 2 , and S 3 have their motif in the same column region.

3 by 3 ≤
the probability is at most e − 2 d Lemma 5.2 (notice that the length of M is at least d according to Extract()).The probability of M and ℵ(S 2 ) not overlapping for all possible [j, j ] is at most n 2 • e − 2 d ρ 0 by inequality (4).
than (Q 0 − R − )k 2 sequences G i with G i = ℵ(S i ).

•
Case 2.1.0≤w2 < v. Case 2.1.1.0≤w2 < w 1 < v.By Lemma 5.2, P R,R (w 1 , w 2 , s) ≤ 1t for fixed w 1 , w 2 and s = w 1 − w 2 .The total probability of Case 2.1.1 for a fixed w 2 and all w 1 with w2 < w 1 ≤ v is at most (v−w2−1) P R,R (w 1 , w 2 , s) ≤ e − 2 3w1 for a fixed w 1 and a fixed s = w 1 − w 2 .We have ∞ w1=v and the left sides of both M 1 and M 2 have the same number w 2 of characters outside ℵ(S 1 ) and ℵ(S 2 ), respectively.For a probability upper bound of Case 5, we compute P 5 = By Lemma 5.2, the probability for this case is at most 1 t for a fixed w 2 .The total probability for this case for 1 ≤ w 2 < v is P 5.1 ≤ v t .• Case 5.2.v ≤ w 2 .By Lemma 5.2, the probability for this case is at most e − 2 3 w2 for a fixed w 2 .The total probability for this case for v ≤ w 2 is P 5.2 ≤ The total probability bound for Case 5 is upper bounded as P 5 = P 5.1 + P 5.2 ∞w2=1 P L,L (w 2 , w 2 , 0).There are two subcases.•Case 5.1.1 ≤ w 2 < v.