PairMotifChIP: A Fast Algorithm for Discovery of Patterns Conserved in Large ChIP-seq Data Sets

Identifying conserved patterns in DNA sequences, namely, motif discovery, is an important and challenging computational task. With hundreds or more sequences contained, the high-throughput sequencing data set is helpful to improve the identification accuracy of motif discovery but requires an even higher computing performance. To efficiently identify motifs in large DNA data sets, a new algorithm called PairMotifChIP is proposed by extracting and combining pairs of l-mers in the input with relatively small Hamming distance. In particular, a method for rapidly extracting pairs of l-mers is designed, which can be used not only for PairMotifChIP, but also for other DNA data mining tasks with the same demand. Experimental results on the simulated data show that the proposed algorithm can find motifs successfully and runs faster than the state-of-the-art motif discovery algorithms. Furthermore, the validity of the proposed algorithm has been verified on real data.


Introduction
A DNA motif is a conserved pattern occurring in the regulatory region of DNA sequences with small mutations [1]. All occurrences of the motif in the sequences are called motif instances or motif sites, which are usually the sequence fragments with specific biological functions such as transcription factor binding sites (TFBSs) [2]. TFBSs are important regulatory elements that control transcription initiation and transcription efficiency of the associated genes. Identifying motifs in a given set of DNA sequences is the basis for analysis of gene expression regulation [3] and the precursor to identifying disease-associated regulatory variations [4].
Though very important, motif discovery is a challenging computational task. Given a set of DNA sequences, (i) the motif and its instances are unknown; (ii) each of the input DNA sequences is long with hundreds of bases, while the motif is short, generally 5 to 25 bases [5]; (iii) a portion of the input sequences may not contain motif instances; (iv) the input sequences typically contain the disturbance of random overrepresented substrings. In 2003, Evans et al. proved that motif discovery is NP-complete [6]. In addition, with the development of biological experimental techniques, the data used for motif discovery have been changed from traditional promoter sequence data sets to high-throughput sequencing data sets [7]. A traditional data set typically contains only a few to dozens of sequences. A high-throughput sequencing data set is a set of peak regions containing TFBSs obtained through ChIP-seq experiments [8], read mapping [9], and peak calling [10]. It contains hundreds or more sequences, thus forming a large DNA sequence data set, and further increases the difficulty of rapid and accurate identification of motifs.
Currently, there are a lot of motif discovery algorithms to deal with small-scale data sets, such as Weeder [11], PairMotif [12], PairMotif+ [13], MEME [14], PMS8 [15], and qPMS9 [16]; for more algorithms, refer to [7,17]. Because of high time or space complexity, these algorithms cannot be used for motif discovery in high-throughput sequencing data sets directly.
This paper mainly focuses on motif discovery algorithms for high-throughput sequencing data sets. According to motif representation, the algorithms can be divided into two categories. The algorithms in the first category represent motifs as words. Some of these algorithms, such as Fmotif [18] and weeder2 [19], use pattern-driven ideas. They exhaustively verify all possible strings of the motif length over the DNA alphabet and then output the strings that satisfy specified motif property. When verifying motifs, Fmotif and weeder2 use the suffix tree and De Bruijn graph techniques, respectively. Some other algorithms, such as RSAT [20], CisFinder [21], and MCES [22], adopt word counting ideas; namely, they mine the substrings in input sequences with high occurrence frequency and then combine them into motifs. Besides a test set, these algorithms often require a control set to eliminate the disturbance of random overrepresented substrings.
The second category covers the discovery algorithms representing motifs as position weight matrixes (PWMs). A set of aligned sites of the same length in the input sequences can form a PWM. These algorithms often select some initial PWMs with certain means and then update each PWM iteratively until it reaches the maximum score. MEME-ChIP [23] is a well-known motif discovery algorithm for ChIP-seq data sets, which updates initial PWMs using the expectation maximization method. STEME [24], another discovery algorithm based on expectation maximization, uses suffix trees to improve the time performance of motif discovery when implementing expectation maximization. Currently, there is no discovery algorithm completely superior to others, and thus, in order to tackle false positives produced by individual discovery algorithms, ensemble algorithms [25] integrate multiple existing discovery algorithms to improve the quality of identified motifs.
In order to efficiently identify motifs in large DNA data sets, we propose a new algorithm, which identifies motifs by extracting and combining pairs of -mers in the input with relatively small Hamming distance. Comparisons with the state-of-the-art motif discovery algorithms show that the proposed algorithm can find motifs successfully with the shortest running time. Also, the validity of the proposed algorithm has been verified on real data.

Algorithm
Overview. The notations frequently used in this paper are summarized in the Notations. When we say a pair of -mers, we are referring to two -mers that come from two distinct sequences.
Almost all de novo motif discovery algorithms make identification based on the fact that the motif instances of the same motif are similar to each other. In other words, the motif information contained in the input sequences is presented by the similarity among motif instances. In addition to the degree of similarity among motif instances, the motif information also depends on the number of pairs of motif instances contained in the input sequences, denoted by mip . It is calculated by In our previous work, PairMotif [12] and PairMotif+ [13], we mainly process promoter sequences, which correspond to a small . The basic idea is to extract some pairs of -mers in the input, making them contain at least one pair of motif instances, and then refine each pair of -mers to get motifs. Because of the small value of mip , limited motif information can be obtained while retaining a large amount of disturbance information. Thus, in order to ensure good identification accuracy, exhaustive methods based on pattern-driven ideas are used for refinement, which has a poor time performance.
In the current work, we propose a new algorithm called PairMotifChIP, which is used for processing large DNA data sets. Our basic idea is still to extract pairs of -mers in the input. Since the value of mip under large data sets is significantly greater than that under traditional promoter data sets, the advantages are as follows: (i) the extracted pairs of -mers contain sufficient pairs of motif instances and (ii) it can be easier to filter out most of the random overrepresented pairs of -mers; namely, we can distinguish most of the pairs of motif instances and random overrepresented pairs ofmers by probabilistic analysis (see Section 3.1). Therefore, after extracting pairs of -mers, we perform filtration to filter out most of the random overrepresented pairs of -mers and then combine the remaining -mers using clustering methods to obtain motifs while eliminating other random overrepresented -mers.

Extracting Pairs of -mers.
In this step, we need to determine the value of a threshold so that we extract all pairs of -mers and in the input satisfying ( , ) ≤ and design an efficient algorithm to extract pairs of -mers.
Let denote the probability that the Hamming distance between two random -mers is no more than i.
Let denote the expectation of the number of pairs of -mers and in two n-length DNA sequences satisfying ( , ) ≤ .
The threshold is determined by (4), aiming at eliminating random overrepresented substrings in the pairs of -mers extracted from two -length sequences.
Based on this, we describe our method as follows. Figure 1 shows an example in which | | = | | = 20, = 5, = 1. First, convert the two DNA sequences and into binary strings and . Then, fixing , move from left to right gradually 2 bits each time and simultaneously do the exclusive OR for the overlapped substrings of and with length equal to or greater than 2 ; xor( , , , ) denotes the exclusive OR of the overlapped substrings of and corresponding to [ ⋅ ⋅ ⋅ ] and [ ⋅ ⋅ ⋅ ], where − = − . Finally, extract pairs of -mers with Hamming distance no more than by traversing the exclusive OR xor( , , , ) for each overlapped part of and . Specifically, for each ( ≤ ≤ ), we look up a table [12] to obtain the Hamming distance between two -mers [ + − ⋅ ⋅ ⋅ + − 1] and [ + − ⋅ ⋅ ⋅ + − 1], which is equal to the number of 2 bits that is not 00 in xor( + − , + − 1, + − , + − 1). During the traversal, we use (5) to avoid the calculation for some pairs of -mers with Hamming distance greater than .

Filtering Pairs of -mers.
The purpose of this step is to filter those random overrepresented pairs of -mers extracted in the previous step. According to the property of conservation, if an -mer is a motif instance, there may be some other motif instances with Hamming distance no more than from x, and thus there may be relatively more -mers in the whole data set with Hamming distance no more than from . If an -mer is not a motif instance, even if it appears in a random overrepresented pair of -mers, there are not many -mers in the whole data set with Hamming distance no more than from .
For an arbitrary -mer , let occ ( ) denote the number of -mers in the input sequences with Hamming distance no more than from in random case.
For an arbitrary motif instance , let occ ( ) denote the number of motif instances in the input sequences with Hamming distance no more than from in random case.
We perform filtration according to (7). Let occ( ) denote the number of -mers in the input sequences with Hamming distance no more than from an -mer . In the process of extracting pairs of -mers, we can easily record occ( ) for each -mer in the input sequences. For an extracted -mer , we filter it out if it does not satisfy occ ( ) ≥ occ ( ) + occ ( ) .

A G T A G A A C A G T C G T C T G A A T
Let denote the maximum number of positions where a motif differs from its instance. Given the motif length l, we determine the value of in terms of the challenging problem instance of planted motif search [27], which assumes that motif instances are implanted into sequences. When generating a random motif instance from a motif , we select out of positions randomly and then make the character at each of the positions change with a probability of , which is the conservation parameter reflecting the conservation degree of a motif. Based on this, Pr( ( , ) = ) is evaluated as follows: By the theorem of total probability, is evaluated using (10) where the value of ( ( 1 , 2 ) ≤ | ⟨ , ⟩) is calculated in terms of the actual situation [13].
Finally, we calculate occ ( ) as follows:

Combining -mers.
After performing filtration, we combine the remaining -mers by using the clustering method. On the one hand, we further eliminate random overrepresented -mers, as the filtration carried out in the previous step cannot guarantee that all the random overrepresented -mers are filtered out. On the other hand, the input data may contain more than one motif, and thus the clustering method is also used to distinguish the instances of different motifs and gather the instances of the same motif together. The combining method is described in detail as follows: (ii) Cluster the substrings. At first, we build an undirected graph by taking each substring as a vertex. For any two vertices V 1 and V 2 , assume the corresponding substrings are str 1 and str 2 , respectively. If there exist an -mer 1 in str 1 and an -mer 2 in str 2 such that ( 1 , 2 ) ≤ , then we set the weight of the edge connected by V 1 and V 2 to 1, otherwise 0. Secondly, we cluster the vertices in the graph using the MCL clustering algorithm [28]. Finally, we merge the obtained clusters using the MCL clustering algorithm again to further eliminate redundancy.
(iii) For each obtained cluster, align the substrings in it and then fetch the fragment with high information content as an identified motif.
In the combining method, we control the number of elements to be clustered in order to ensure a good time performance. First, merging overlapped -mers into one substring can help to reduce the elements to be clustered. Second, if the number of substrings is still large, we divide them into multiple groups with each group containing at most 1000 substrings. Then, we cluster substrings in each group separately and finally merge the obtained clusters.

Probabilistic Analysis of Extracting Pairs of -mers.
We use probabilistic analysis to demonstrate the feasibility of motif discovery by extracting pairs of -mers with relatively small Hamming distance for processing large DNA data sets. That is, given DNA sequences, we verify whether the extracted pairs of -mers with Hamming distance no more than contain sufficient pairs of motif instances and whether the pairs of motif instances and the random overrepresented pairs of -mers can be distinguished. Table 1 shows a set of data for probabilistic analysis. In generating these data, we set the number of sequences , the sequence length , the probability that each input sequence contains a motif instance, and the motif conservation parameter to 1000, 200, 0.8, and 0.8, respectively. For a fixed , is obtained by (4), which is the maximum making smaller than 1. Let 1 and 2 denote the number of pairs of -mers and that of pairs of motif instances with Hamming distance no more than contained in -length sequences at random, respectively. The notations occ ( ) and occ ( ) are explained in the Notations.
From the values of 1 and 2 , it can be found that the extracted pairs of -mers with Hamming distance no more than contain sufficient pairs of motif instances and meanwhile many random overrepresented pairs of -mers. For a given -mer , the -mer in the input with Hamming distance no more than from is called a k-neighbor of . From the values of occ ( ) and occ ( ), the number of -neighbors of a motif instance is significantly larger than that of random overrepresented -mers. Thus, we can distinguish the pairs of motif instances and the random overrepresented pairs of -mers, so as to filter out most random overrepresented pairs of -mers extracted in step 1. It should be noted that, for Table 1, we set the motif conservation parameter to 0.8, which is in low conservation case; the value of occ ( ) increases with the increase of the conservation degree, which makes it easier to distinguish the pairs of motif instances and the random overrepresented pairs of -mers. In implementing PairMotifChIP, we fix the value of to 15 because it corresponds to a large value of occ ( ).

Data.
We use both simulated and real data to make experiments. The simulated data are helpful for the comparison of different motif discovery algorithms, since the motifs and their locations are known exactly. The real data are mainly used to test whether the proposed algorithm can find motifs under the realistic biological situation.
We generate simulated DNA data according to [29]: generate -length sequences and an l-length motif randomly and then implant a random instance of to each sequence with a probability . Each instance differs from in at most positions; as shown in column 2 of Table 1, the value of under a specific is determined in terms of the challenging problem instance of planted motif search [27]. When generating a random motif instance from , the specific Hamming distance (0 ≤ ≤ ) between and is determined by (8), where is the motif conservation parameter.
Based on this generation method, two groups of simulated data sets are designed, and they can be downloaded at https://sites.google.com/site/feqond/pairmotifchip. For both groups of data sets, is fixed to 200; is fixed to 0.8; l is taken as 9, 15, and 21, representing short, medium, and long motif length, respectively. The other settings for the first group of data sets are as follows: is fixed to 600, which is the largest sequence number that MEME-ChIP supports to process; is taken as 0.2, 0.5, and 0.8, representing high, medium, and low conservation, respectively. For the second group of simulated data sets, we vary from 500 to 3000 to obtain different data scales and fix as 0.5.
The used real data are mouse embryonic stem cells (mESC) ChIP-seq data [30]. Each mESC data set, which corresponds to a specific transcription factor, is a set of 200length sequences with each sequence being a peak region bound by the specific transcription factor. These data sets contain thousands to tens of thousands of sequences, and we used the first 600 sequences to make motif discovery for each data set.

Evaluation.
In the experiments, we compare the running time and identification accuracy of different motif discovery algorithms. For the site-level identification accuracy, we evaluate it by using the site-level performance coefficient sPC [29]. When we say a motif site is in a set of motif sites , there exists a motif site in such that overlaps . Let and represent the published motif sites and the predicted motif sites, respectively. Then, site-level true positive (sTP), false negative (sFN), and false positive (sFP) are the number of motif sites in both and , in but not in , and not in but in , respectively. Based on this, sPC is calculated as follows: For the nucleotide-level identification accuracy, we evaluate it by using the nucleotide-level correlation coefficient (nCC) [31], an integrated assessment of nucleotide-level sensitivity (nSn) and specificity (nSp). The value of nCC ranges from −1 to +1; a high nCC indicates that the predicted motif is more accurate. Let and represent the nucleotide positions covered by the published motif sites and the predicated motif sites, respectively. Then, nucleotide-level true positive (nTP), false negative (nFN), false positive (nFP), and true negative (nTN) are the number of nucleotide positions in both and , in but not in , not in but in , and in neither nor , respectively. Based on this, nCC, nSn, and nSp are calculated as follows: 3.4. Results on Simulated Data. We select four compared algorithms: MEME-ChIP [23], F-motif [18], PairMotif+ [13], and qPMS9 [16]. MEME-ChIP is a widely used motif discovery algorithm for ChIP-seq data based on PWM. F-motif is a motif discovery algorithm for ChIP-seq data based on word. PairMotif+ is a motif discovery algorithm designed in our previous work. qPMS9 is a recently proposed motif discovery algorithm; it is the best one in exact motif discovery algorithms and can identify motif efficiently in traditional promoter sequences. All the algorithms are implemented in C or C++. Except for MEME-ChIP, whose results are produced by its web server  -: the result is not obtained because the running time is over 24 hours; * the result is not obtained because motif sites are not provided by MEME-ChIP on the corresponding data sets. The site-level identification accuracy is evaluated by the site-level performance coefficient sPC. Since qPMS9 and Fmotif report the same motifs and have the same identification accuracy, the results of qPMS9 are not listed in this table.
(http://meme-suite.org/tools/meme-chip), all the algorithms run on the same machine with a 2.67 GHz CPU and a 4 G memory. Each result is the average obtained on three randomly generated data sets with the same settings. Both PairMotifChIP and MEME-ChIP use the default parameters to identify motifs. In executing F-motif, the minimum motif length is set to 8, and the value of is set to 2, 5, and 8 when the length of the identified motif is 9, 15, and 21, respectively. Both PairMotif+ and qPMS9 need specified and of the identified motif. When calculating identification accuracy, the predicated motif sites are needed. For MEME-ChIP, the predicated motif sites are returned by its web server; for other algorithms, the sites are obtained by searching the substring in each input sequence having the smallest Hamming distance from the predicted motif.
For the first group of simulated data sets, the running time, the site-level identification accuracy, and the nucleotide-level identification accuracy are given in Tables 2,  3, and 4, respectively. It can be seen that (i) all these algorithms show good identification accuracy, since large data sets contain sufficient motif information; (ii) PairMotifChIP has the best time performance among these algorithms; namely, it is able to deal with the DNA data set of 600 sequences within one minute; (iii) MEME-ChIP has the second best time performance, and it solves the problem within half an hour; (iv) for other compared algorithms, their running time becomes impractical with the increase of because of exhaustively verifying l-length candidate motifs.
For the second group of simulated data sets, whose data scales range from 500 to 3000 sequences, the comparison of different algorithms is shown in Table 5. Here, MEME-ChIP and qPMS9 are not taken as the compared algorithms, as MEME-ChIP allows the processed data set containing at most 600 sequences and qPMS9 has a poor time performance in processing large data sets. It can be seen that (i) the three motif discovery algorithms still have a good identification accuracy; (ii) PairMotifChIP performs slower than F-Motif when is 9, while it is significantly faster than the other two algorithms when is 15 and 21; (iii) the running time of PairMotifChIP increases with the increase of data scale, but it does not depend on the motif length.
Finally, it is necessary to test the method for extracting pairs of -mers because it plays a key role in the time performance of the whole PairMotifChIP algorithm. Table 6 shows the comparison of the running time of the method in this paper and the method in [26]. When extracting pairs of -mers with Hamming distance no more than in two given -length sequences, although the worst time complexity of the method in this paper is still ( 2 ), it has a better time performance in practice. Specifically, the method in this paper is 10 times faster than that in [26], which makes it possible to process more DNA sequences.

Results on Real Data.
We test PairMotifChIP's validity of motif discovery on the real ChIP-seq data (i.e., the mESC data). To better show the results, we take our previous algorithm PairMotif+ for comparison. For PairMotifChIP, we use default parameters without any prior information; for PairMotif+, we use the same setting = 13, = 4, and = 2 for each data set. Table 7 shows the results on these real data. For each data set, we list the name of the specific transcription factor, the published motif, and the running time and predicted motif of the two algorithms. The motifs are shown in the form of sequence logos [32]. PairMotifChIP runs much faster  than PairMotif+ on these data. Moreover, PairMotifChIP can successfully find a motif overlapping the published motif for each data set, while PairMotif+ fails to make prediction on the n-Myc, Smad1, STAT3, and Tcfcp2I1 data sets.

Conclusions
In order to identify motifs in large DNA data sets, we propose a new algorithm, PairMotifChIP, which is a ChIPtailored version of PairMotif/PairMotif+. The main difference between PairMotifChIP and PairMotif/PairMotif+ is that (i) PairMotifChIP designs a more efficient method for extracting pairs of -mers and (ii) unlike PairMotif/PairMotif+, which obtains motifs by exhaustively verifying candidate motifs generated from extracted pairs ofmers, PairMotifChIP obtains motifs by combining extracted -mers based on clustering methods. The improvements of PairMotifChIP over PairMotif/PairMotif+ are that (i) PairMotifChIP runs much faster when identifying motifs in large data sets and (ii) PairMotifChIP can make motif discovery without any prior information (e.g., the motif length). The executable program of PairMotifChIP is available at https://sites.google.com/site/feqond/pairmotifchip. It should be noted that, limited by the idea of combining overrepresented substrings, PairMotifChIP may not work well on the traditional promoter sequence data set containing dozens of sequences because of the lack of sufficient motif information.

Notations
: The motif length = { 1 , 2 , . . . , }: The input DNA data set; each input sequence is an -length string over the alphabet {A, C, G, T} : The number of input sequences, = | | : The length of each input sequence : Thep r obab ilitytha teachin pu tsequence contains a motif instance : The maximum number of positions where a motif differs from its instance : The motif conservation parameter -mer: An -length string over {A, C, G, T} : The threshold of extracting pairs of -mers, 0 ≤ < occ( ): The number of -mers in the input sequences with Hamming distance no more than from an -mer occ ( ): The number of -mers in the input sequences with Hamming distance no more than from an arbitrary -mer in random case; it is calculated by (6) occ ( ): The number of motif instances in the input sequences with Hamming distance no more than from an arbitrary motif instance in random case; it is calculated by (11) [ ⋅ ⋅ ⋅ ]: A substring of the string starting from the th position to the th position.