An Entropy-Based Position Projection Algorithm for Motif Discovery

Motif discovery problem is crucial for understanding the structure and function of gene expression. Over the past decades, many attempts using consensus and probability training model for motif finding are successful. However, the most existing motif discovery algorithms are still time-consuming or easily trapped in a local optimum. To overcome these shortcomings, in this paper, we propose an entropy-based position projection algorithm, called EPP, which designs a projection process to divide the dataset and explores the best local optimal solution. The experimental results on real DNA sequences, Tompa data, and ChIP-seq data show that EPP is advantageous in dealing with the motif discovery problem and outperforms current widely used algorithms.


Introduction
Motif discovery problem is an issue of discovering short similar nucleotide segments with a common biological function, which is crucial for understanding the structure and function of gene expression. Quickly and accurately locating motif is a challenging problem in computational biology.
A challenge of motif discovery problem is described as follows [1]: find a motif of length in gene sequences. Each sequence is nucleotides long and contains one motif instance with up to d mutations to the true motif.
Over the past decades, numerous algorithms have been proposed to identify motifs in several to dozens of promoter sequences from coregulated or homologous genes [2]. These algorithms can be divided into two categories: One is exact algorithms, which use consensus sequences to represent motifs [3]. Recent exact algorithms mainly concentrate on pattern-driven algorithms [4][5][6][7][8]. They scan all sequence patterns of length with an initial search space of O(4 l ) and report all possible solves. These pattern-driven based algorithms are able to deal with larger amount of sequences like ChIP-seq data [2,9]. However, they are exponential-time algorithms that need a great deal of time to search for the larger l and inefficient for handling dozens of sequences.
The other category is approximate algorithms, which use the position weight matrixes (PWMs) to represent motifs [10]. The approximate algorithms commonly establish probability training model and score a statistical measure to identify biological signals from background. A particularly successful class of approximate algorithms is developed based on Gibbs sampling [11] and MEME [12]. MEME finds motifs by optimizing the PWMs using the Expectation Maximization (EM), which still defines three types of motif discovery sequence model: OOPS, ZOOPS, and TCM, corresponding to one occurrence per sequence, zero or one occurrence per sequence, and zero or more occurrences per sequence, respectively. The probability training algorithms have been widely used due to its simplicity and stability. The primary advantage of approximate algorithms is the speedy runtime and minimal memory consumption. Random Projection [13] is a projection-based approximate algorithm which projects all substrings of length into the buckets by hashing and then derives the consensus sequences to select some valid buckets. VINE [14] is a graph-based motif discovery algorithm which finds motif by clustering cliques in a -graph. APMotif [15] applies Affinity Propagation to cluster and then employs an effective EM refinement to search for optimal motifs. However, the performances of these algorithms strongly depend on the starting positions, which cause the convergence easy to fall into local optimum, and the training iteration executes much slower when the width of motif increases in the larger data. ] (c) f A : In order to overcome these shortcomings, in this paper, we propose an entropy-based position projection algorithm for motif discovery, named EPP. We design a projection method to divide the dataset into candidate subsets by utilizing the relative entropy in each position of motif. Then, EPP filters the candidate subsets and refines the subsets by searching all the possible instances. We consider intramotif dependency in statistics model and calculate the average loglikelihood ratio to combine the short motif. Our algorithm can apply to OOPS, ZOOPS, and TCM sequence model through the threshold setting. Experimental results on real DNA sequences, Tompa data, and ChIP-seq data demonstrate that EPP is advantageous to deal with the motif discovery problem and outperforms current widely used approximate algorithms.

Notations.
Given an input set of sequences S = { | = 1, 2, 3, . . . , } over the alphabet Σ, the length of sequence is , the length of the motif to be discovered is , and the number of mutations allowed is . The substring, = ( , , +1 , . . . , , + −1 ), starting at position of the th sequence is defined as an -mer. For sequence , there are − +1 substrings of length . Let set X be the set of all the substrings of S. is the projection position. Here, |Σ| = 4 for DNA datasets and |Σ| = 20 for the protein sets.

Motif Representation.
Generally, a motif can be drawn from a multinomial distribution [16], represents the probability of nucleotide preference at the th position of the motif and 0 represents the background probability of nucleotide . The position frequency matrix (PFM) F can be obtained by calculating the frequency of each nucleotide ( ∈ Σ) at each aligned site: where is the count of an observed nucleotide at position and indicates the pseudocounts to deal with the zero frequencies. Figure 1 describes how to calculate the PFM through the input sequences.
Information content (IC) is a measure to rank the motif conservation [17]. Motifs with higher IC represent they have more specific binding preferences. Suppose we have a motif built from the PFM of the selected substrings; the information content of the th position of the motif is defined as Due to the independence of the positions of the motif, the information content of motif is The IC can be used to rank motifs with the same length l. However, some researches indicate that the commonly multinomial distribution model may be too simplistic in identifying the binding motifs, while some positions of TF binding motif exert an interdependent effect on binding affinities of TFs [18,19]. To provide a better result of motifs identification, a more sophisticated model that involves the intramotif dependency should be considered. Intramotif dependency considers that the frequency of nucleotide combinations spanning several positions deviates from the expected frequency under the independent motif distribution [20]. For example, if the frequency of two nucleotides, "GT," in a pair of positions is much higher or lower than the product of frequency of "G" in the first position and the frequency of "T" in the second position, we infer that these two positions are dependent. Therefore, the log-likelihood of nucleotides and +1 is where Φ , +1 represents the probability of the nucleotide pair at th and ( + 1)th position of the motif and Φ 0 represents the background probability of the nucleotide pair. Then, the conditional probability of the substring is where 0 ( ) is the joint probability under the corresponding background distribution 0 . In this paper, we use the third-order Markov model to characterize the background sequence and improve the sensitivity and specificity of identifying motifs. The probability of the substring ( , , +1 , . . . , , + −1 ) in the background under a thirdorder Markov model is So the information content can be represented as Based on the substring statistical significance representation, we present a novel entropy-based position projection algorithm (EPP). EPP aims to solve the motif identification problem and make a good trade-off between accuracy and efficiency, which is detailedly described as follows.

EPP Algorithm
Step 1 (the cluster projection process). Since the random initial state contains too much noise information, how to choose a good initial state to make refinement quickly converge to a local optimal solution becomes essential. Obviously, the ( − + 1) ways of selecting the l-mers from all substrings to constitute the initial state are too large. Here, we designed a cluster projection method to initialize the parameters: (1) Draw all the substrings from dataset S to form a new set X, X = { | = ∑( − + 1)}, where represents an -mer.
(2) Calculate the relative entropy of each position in the set X: (3) Select the position of the maximum relative entropy as the projection position, = argmax =1,..., { }. The collection set X is divided into four subsets through the projection process: the first subset 1 contains all the -mers of appearing base "A" in position . Similarly, the subsets 2 , 3 , and 4 contain all the -mers of appearing bases "C," "G," and "T" in position , respectively.
(4) We set two thresholds max size and min size to check the size of the subsets { 1 , 2 , 3 , 4 }. For example, if | 1 | < min size, we abandon 1 . That is, 1 is too small to contain enough motif instances, which means a transcription factor cannot be combined with sufficient sequences; if | 1 | > max size, the subset has much unnecessary background noise, the algorithm should be back to (2), and we find a new projection position to further divide 1 ; if min size ≤ | 1 | ≤ max size, we consider 1 is qualified and store it into a candidate set { }. The setting of max size and min size will be described in next section. Figure 2 shows an example of the cluster projection process. Figure 2(a) describes the set X derived from S; we choose the fifth position for projection. Figure 2(b) shows the four subsets divided from X; the fifth positon of each subset is the observing letters "A," "C," "G," and "T," respectively. Then, we calculated relative entropy and chose the second, the third, and the fourth position of each subset to project. After several projection processes (Figure 2(c)), we obtain a candidate set { } as shown in Figure 2(d).
In the worst case, the maximum number of candidate subsets is /min size is the number of all substrings (mer). However, in practice, the number of candidate subsets will be much less than this number, such that when the number of substrings is 10 5 , the number of candidate subsets is ultimately only a few hundred.
Step 2 (filter the candidate set). The candidate set { } is constituted by a series of cluster subsets which form by the similar substrings of the same letters at several positons. However, the candidate set still contains the useless subsets made up by the background. It will cost a lot to refine these background subsets and it is necessary to filter them.
Because the projection process calculates the relative entropy to choose the position, it can measure the statistical significance but cannot reflect the complexity of substrings. In order to evaluate the complexity of each subset, we employ the common single-string score [21] as another measure.
So we filter each subset of { } by computing the complexity function (9) and the content information (7) as follows: (1) Calculate the complexity score of each subset in { }, denoted by ( ): where | | represents the cardinality of { } and represents the radius of complexity.
(2) Calculate the content information of each class in { }, denoted by ( ): Similarly, let IC be the radius of IC: this subset is considered qualified and saved into G = { V }.    Step 3 (refine the qualified subsets). Assume each qualified subset V corresponds to a motif; the substrings of the qualified subset should be the motif instances. In fact, we found that the qualified subset contains several fake motif instances generated by the background sequences, while some instances may be missed by the projection and filter processes and are not in the qualified subsets. Therefore, in this step, we remove the fake instances and add the missing ones to refine each qualified subset.
As the previous study [22], we know the instances 1 and 2 of the same motif should be satisfied is the function of measuring the hamming distance between two substrings. For each qualified subset in V , if the substring of the qualified subset satisfies the hamming distance less than or equal to 2d from the others, we keep it in the subset; otherwise, we remove it from the subset. For each fixed , the value of is usually set as < = 2. In this way, the real motif instances must be in one qualified subset.
Then, we search all the possible instances from X and add them into V . The possible instances should satisfy the following two conditions. First, the instance satisfies where | V | is the cardinality of V and represents one instance in V . Second, adding the instance increases the information content (7) of V . These limiting conditions greatly reduce the search space, and we can obtain the refinements for each qualified subset after removing and adding the substrings. In addition, if the qualified subset is too small (less than min size), it indeed does not make sense to contain the real motif instances. We will not refine the small qualified subset and drop it.
Step 4 (predict the longer motif). See each qualified subset as a seed, its PWM can be computed by the steps above, while the corresponding motif with high information content can also be calculated. However, the qualified subsets may represent the similar motifs with a few letters varying as previous studies [23,24]. In order to eliminate redundant motif information and expand the short motif to form longer motif, we combine the similar motifs having the long common-overlap segments by utilizing a metric of computing the average log-likelihood ratio (ALLR) [25]: BioMed Research International 5 Input: the dataset S, motif length . Output: the motifs set (1) X ← all the substrings length from S IF | | < min size (8) abandon c (9) IF min size ≤ | | ≤ max size (10) { } ← (11) IF | | > max size (12) select position in are the count and frequency of base at the 1 th/ 2 th position of 1 / 2 . Since the length of predicted motifs may be different, we use the minimum distance between motifs among all possible overlaps of motifs 1 and 2 that the aligned segment is 6. Thus, we calculate the similarity score of 1 and 2 by (17), where denotes the length of the segment: Suppose the number of motifs to find is ; when a new motif is found, we first check whether there is a similar motif. If the similar motif exists, we combine them and obtain the longer motif; if the similar motif does not exist, we keep the new motif and replace the motif with minimum information content. In this way, we ensure the motifs are different which are also have the information contents as high as possible. In practice, we finally combine and generate at least 20 top information content motifs as the outputs.
The whole algorithm of EPP is described in Algorithm 1. In Step 1, lines (1) to (14), we make the projections to obtain candidate sets; then lines (15) to (17) are the step to filter candidate sets to get the qualified subsets; lines (18) to (23) are the step to refine each qualified subset; at last, lines (24) to (26) are the step to combine the similar motifs and output the results.

Results and Discussion
The parameters we can get from the input dataset include the number of sequences and the length of each sequence ( = 1, . . . , ); the motif length is known (6-30 bps). Based on these parameters, we draw the set X and then start the projection process. The times of projection and the number of the candidate subsets are depending on the parameters of max size and min size. We hope that the candidate subsets containing the true motif have the motif instances as more as possible and have less influence by the background. Thus, for different sequence models, the parameters of max size and min size are flexibly setting in this way. For the OOPS model (one occurrence of motif instance per sequence), we take max size = t and min size = 3 /4; for the ZOOPS model (zeroor one-motif occurrences per dataset sequence), the number of motif instances is less than the number of sequences and we take max size = and min size = t/2; for TCM (two-component mixture) model, there are zero or more nonoverlapping occurrences. Generally, we take max size = 3t/2 and min size = .
We first use six real DNA datasets to test the performance of our algorithm, including CREB, CRP, MEF2, MYOD, SRF, 6 BioMed Research International and TBP [26][27][28]. These datasets contain the sequences of different species, in which motif length varies from 6 to 18 and the number of motif instances is from 17 to 95. Note that, in CREM and CPR datasets, some sequences have two motifs, and in MYOD and SRF datasets, the number of motifs is more than two in some sequences. Using these datasets to test, we can check the performance and stability of our algorithm in different species. And the site information tagged in the dataset can help us have a better performance analysis and compare with other algorithms. The information of the six datasets is shown in Table 1.
Where represents the sequence number, is the sequence length, is the motif length, is the number of motif instances in the dataset, and avg is the average number of motif instances in each sequence.
We compare EPP algorithm with the widely used algorithms, MEME [10], GAME [29], VINE [14], and APMotif [15]. In order to achieve a fair comparison, we use the same motif length for each dataset and use the prior information as less as possible. We choose groups of different initiate sites for multirunning MEME because of the sensitive with initiate conditions. For the genetic-based algorithm GAME, the results are influence by the random seeds; thus, we run the algorithm 20 times and take the average. In each run, the search quantity of motif sets of GAME is 3 × 10 7 .
In order to evaluate the performance of the algorithms, we employ an evaluation method mixing the nucleotide level and the site level [30]. That is, if the predict sites and the real sites are shifting in three bases, it is a true instance. We employ three measures, Precision, Recall, and score [31], which are defined as follows: Here, Precision represents the probability of predicted instances which is influenced by false positive instances. Recall represents the probability of true positive instances. And score is a measure which makes a balance between Precision and Recall, which reduces the influence of false positive. A high score means the algorithm has good performance in both Precision and Recall. Table 2 shows the results of MEME, GAME, VINE, APMotif, and EPP. It can be seen that EPP has a good performance of Precision on MYOD (0.78) and SRP (0.95). MEME has a high Precision on CREB (0.93), MEF2 (0.93), and TBP (0.83). VINE has a high Precision on CRP (0.94). In the respect of Recall, our algorithm performs well on CREB (0.90), CRP (0.79), MEF2 (0.94), and SFR (0.97). APMotif has the same Recall (0.94) on MEF2 with EPP. And VINE performs well on MYOD (0.86) and TBP (0.87). On the aspect of Precision and Recall, we can see that EPP has relatively small influence by the background. In the predicted instances, the true motif instances occupy a larger proportion. So on the aspect of score, our algorithm has the best performance among the five algorithms; only APMotif has the same value on MEF2. The comparison of Precision, Recall, and score is shown in Figure 3; we can find EPP has a stable performance on the average and performs well than the current widely used motif finding algorithms. Table 3 shows the amount of subsets and the -mers in each step, including the total -mers, the thresholds of min size and max size, the amount of candidate subsets and qualified subsets, the -mers in the qualified subsets, and the reducing number of -mers. We can see that the our algorithm eliminates most of the candidate subsets by the projection step and the filter step; only dozens of subsets need to be refined. Meanwhile, the amount of -mers has a great reduction, which is more than 90%. Such as TBP dataset, the amount of -mers reduces by 99% and only two subsets need the refinement.
The running times of the datasets testing above are shown in Table 4. We implement EPP and APMotif in MATLAB under Windows. GAME is implemented in C under Linux. MEME and VINE are implemented through the website version. It is unfair to compare these algorithms implemented in different software, especially compared with website version. But the running time can explain that our algorithm can find the motifs in a reasonable and acceptable time. We report the computational time in the same experiment environment (2.67 G Hz CPU and 4 G memory). From Table 4, we can see that GAME and APMotif are obviously slower than EPP. The web version MEME and VINE are faster than EPP for most datasets. However, MEME needs to run several times for the different start points and VINE is a heuristic algorithm which will be slow with the data size increasing. EPP has the best time efficiency for TBP data because of the reduction of 99% redundant information.
Besides the six real DNA datasets, we also use the Tompa data to test our algorithm. Tompa data is a standard data for evaluating new design motif finding method, including three types of data: Real, Generic, and Markov. Here, we select Real data which contains 52 groups of real promoter sequences extracted from TRANSFAC database and involves four species: Drosophila melanogaster (dm), Mouse (mus), Human (hm), and Saccharomyces cerevisiae (yst). It should be noted that some datasets of Tompa only have one sequence, such as dm02r and dm06r. Not each sequence contains the motif, such as dm01r, hm06r, hm11r, mus07r, and yst01r. And for most of the Tompa datasets, each sequence contains more than one motif, like hm08r, hm10r, mus11r, yst03r, and yst05r.     Motifs are difficult to identify for the weak conservation in Tompa data. Thus, we select a part of the datasets to test, which are dm01r, dm02r, dm03r, dm04r, dm05r, and dm06r in Dm species; mus01r, mus03r, mus05r, mus06r, mus11r, and mus12r in Mus species; hm01r, hm07r, hm08r, hm10r, hm17r, hm22r, hm23r, and hm24r in Hm species; yst01r, yst02r, yst03r, yst04r, yst05r, yst06r, yst08r, and yst09r in Yst species (Figure 4). We use the measure based on the nucleotide level to evaluate the performance, because the number of motifs and the length of motifs are different in each sequence.
where TP (true positive) represents the real sites in the predicted sites; FP (false positive) are the fake sites in the predicted sites; FN (false negative) represents the fake sites that do not predict. We also choose MEME as the reference algorithm to compare the performances. The length of motif ranges from 6 to 30 bps, and we output the best result. Figure 3 is the results of EPP and MEME. We can see that both EPP and MEME are hard to find the motifs in the one sequence data sets, such as dm02r and dm06r. For the datasets dm03r, dm04r, and dm05r, some sequences have several motifs but some sequences have no motif; for example, the third sequence of dm05r contains 9 motifs. This motif distribution makes it difficult to identify. Thus, both EPP and MEME have poor effect for the Dm spices. For the Hm spices, one notable feature is that the length of motifs changes a lot; for example, the motifs of hm01r range from 7 to 56 bps. We use the fixed motif length as before which can only predict a part of segment overlapping with the true motif. However, for the data motif length changing relatively small, like hm17r (10-17 bps), both EPP and MEME have the best results. And EPP has a higher accuracy than MEME in the Hm spices. For the Mus and the Yst data, most of the datasets contain less than 10 sequences (expect mus11r, yst03r, yst08r, and yst09r), and most of the sequences have multiple motifs of different lengths. From the experiment results, we find that EPP and MEME have their own advantages for these two species.
Through the experiments above, we can see the existing algorithms have poor performance on Topma data [32]. However, the different algorithms can complement and reinforce each other. For example, for the data mus06r, yst05r, and hm10r, EPP can have an effective prediction but the accuracy of MEME is worse. In recent research, the algorithm like Ensemble which merges the results of different algorithms can improve the accuracy effectively [33]. Moreover, the same results of the different algorithms can also enhance the prediction.
In order to show the effect of our algorithm, we also test the synthetic datasets which contain the low and high conservation positions. The synthetic datasets are generated under the following six combinations of three perspectives: (1) motif width: short (8-10 bp), middle (14-16 bp), and long (19-21 bp); (2) sequence length: 600 and number of sequences: 20; (3) motif conservation: low and high. For each combination, we sample 10 datasets which are generated randomly and embedded with the instances of a random motif. Specifically, in the high conservation aspect, the dominant nucleotide is generated with 0.91 probability on each position of the motif instance (while all other three nucleotides are generated with 0.03 each). In the low conservation aspect, only 60 percent of the positions in the motif instances are as highly conserved as those in the previous high conservation aspect, while the rest 40 percent of the positions are lowly conserved, where the dominant nucleotide is generated only with probability 0.55 (while all other three nucleotides are generated with 0.15 each) in every instance. Table 5 shows the performance coefficient (NPC) of MEME, VINE, and EPP. From the results, we can see that  all these compared algorithms have good performance on the high conservation dataset. Among these compared algorithms, EPP has the best results on three high conservation datasets (0.98, 0.99, and 1), which are higher than the other three algorithms. For the low conservation datasets, EPP has the highest accuracies among these compared algorithms. However, when the width of motif is short, motif instances are hard to distinguish from the background sequences; the accuracies of all the compared algorithms are low.
Meanwhile, we also use 12 TFs in mouse embryonic stem cell ChIP-seq datasets to test our algorithm. ChIP-seq is a technique coupling chromatin immunoprecipitation experiment with high-throughput sequencing [34,35], which provides dataset of one or two magnitudes larger than a typical motif discovery dataset and sequences with a high resolution. Therefore, the tradition motif finding algorithms are hard to solve ChIP-seq data for the huge calculation. In order to improve the efficiency of EPP, the original dataset is equally divided into halves: a training set and a testing set. We run the projection and filter steps on the training set to generate the qualified subsets, and then run the refine step to search the instances and construct longer motifs on the testing set. Table 6 shows the results of 12 TFs in mES ChIP-seq datasets discovered by our algorithm with the motifs found by Chen et al. with Weeder [36]. It can be seen that EPP is able to find the motif similar to the published one.
Chen et al. report a single motif with Weeder. Besides these primary motifs, our algorithm can find multiple motifs for each TF using the same datasets. For instance, Oct4 and Sox2 often form a heterodimer that binds a Oct4 motif located adjacent to a Sox2 motif, called the Sox-Oct motif [37]. In Sox2 and Oct4 dataset, EPP predicts not only the Sox-Oct composite motif bound by Sox2 and Oct4 complex but also the monomer motifs Sox2 (CCATTGTT) and Oct4 (TATGCAAAT). As discussed by Chen et al., Smad1 and Nanog frequently bind the same regions as Oct4 and Sox2, which raises a particular difficulty for motif discovery [38]. In Smad1 dataset, our algorithm finds motif "CCTTTGTC," which matches a Sox2 motif and demonstrates the frequent cobinding relationship of Smad1 and Sox2 TFs. Furthermore, our algorithm was able to find the Nanog motif "CCATCAA," which corresponds to an experimentally validated alternative Nanog motif [39].
In summary, EPP is a competitive algorithm to deal with motif discovery problem; our method has the following advantages: (1) the projection which deals with all the substrings does not miss any information in the data. That is, this step guarantees each substring may exist in a candidate subset. (2) The goal of finding motif is to find the substrings having the maximum IC, and the process of selecting the projection positon is also a part of maximizing IC. (3) The size of candidate subsets depends on the thresholds [min size, max size]. If a candidate subset is too large, it will contain too much background information. We continue to divide it; if a candidate subset is too small, the substring in it may be not enough to represent an effective motif. We abandon it. In some cases, motif instance may exist in the abandoned subset, but it still can make up by other subsets containing the motif instance. In the worst case, the number of the candidate subsets is n/min size, where n is the number of all substrings. However, in practice, this number will drastically reduce. The number of candidate subsets may be only a few hundred for 10 6 substrings. (4) There are often some meaningless DNA segments in real data, such as duplicate "AAAAAAAAAA" or "CGCGCGCGCG." These segments will generate the same duplicate substrings which cause redundant computation. Through the projection step of our algorithm, these segments will be very easy to find and discard.
In addition, the computation complexity of EPP mainly depends on projection step and refinement step. Suppose the time of projection is ℎ, in each projection, the computation complexity of calculating relative entropy is O(nl); then, the computation complexity of the projection step is O(hnl). Since the order of magnitude of ℎ and is 10, and is usually less than 10 6 , the order of magnitude of projection is about 10 8 . In the refinement, the number of qualified subsets is about 10 2 for 10 6 substrings. the computation complexity of refinement in each qualified subset is O(nl). So the order of magnitude of refinement is 10 8 which is totally acceptable.

Conclusions
We propose a new probability algorithm named EPP for identifying motifs in DNA datasets. EPP presents a new entropy-based position projection to divide original dataset and remove a large amount of redundant information. Experimental results show that EPP is able to efficiently and effectively identify motifs in DNA sequences and ChIP-seq datasets. However, the functions of some motifs are still unknown; the analysis of motifs in these complex transcriptional regions is needed. In addition, with the increase of data size, designing the parallel algorithm to handle big data is also a key issue for the future study.