A New Algorithm for Identifying Cis-Regulatory Modules Based on Hidden Markov Model

The discovery of cis-regulatory modules (CRMs) is the key to understanding mechanisms of transcription regulation. Since CRMs have specific regulatory structures that are the basis for the regulation of gene expression, how to model the regulatory structure of CRMs has a considerable impact on the performance of CRM identification. The paper proposes a CRM discovery algorithm called ComSPS. ComSPS builds a regulatory structure model of CRMs based on HMM by exploring the rules of CRM transcriptional grammar that governs the internal motif site arrangement of CRMs. We test ComSPS on three benchmark datasets and compare it with five existing methods. Experimental results show that ComSPS performs better than them.


Introduction
Eukaryotic gene expression is regulated by cis-regulatory modules (CRMs). A CRM is a DNA sequence that contains multiple binding sites of one or more specific transcription factors (TFs) that regulate together an aspect of a gene expression pattern. A CRM typically has a length range from 100 to 2000 bp. Promoters, enhancers, and silencers are common CRM examples; they perform different regulatory functions and are located in 5 耠 flanking regions, 3 耠 flanking regions, promoter regions, intron regions, and even exon regions of regulated genes. The CRM discovery is the key to revealing mechanisms of gene regulation and understanding mechanisms of development, evolution, and disease.
Although CRMs play important roles in the regulation of eukaryotic gene expression, the discovery of CRMs [1,2] remains a challenging problem in biology. The reasons are as follows. The distribution of CRMs in gene regulatory regions is very wide and some CRMs are even far from the target gene up to several hundreds of kbp; motif sites as functional elements of a CRM are short and degenerate, and their identification in itself is a difficult problem. Moreover, the complex regulatory structure of CRMs, which defines how motif sites within CRMs are organized to form regulatory complexes in what number, order, orientation, and spacing, further increases the difficulty of CRM discovery. CRMs of eukaryotic genes have the complex and varying structure; the motif sites within orthologous CRMs of different genes often mutate and rearrange, and the overall structures of CRMs are not conserved. Unfortunately, the internal regulatory mechanism of CRMs, which governs the organization of structure features, has not yet been fully understood. Thus, it is difficult to deterministically describe the regulatory structure of CRMs.
Current methods predict CRMs based on motif site clustering or modeling the regulatory structure of CRMs along the some clues. Although the overall CRM structure is not conserved, the biological reasons make the arrangement of motif sites within a CRM nonrandom, and the arrangement of these motif sites should be conducive to TFs binding. Many studies have shown that motifs are usually organized into composite elements (CEs) [3] to regulate eukaryotic gene expression. CEs are defined as functional units consisting of two or more motif sites that are located close to each other. The corresponding TFs interact with CEs to enhance or repress a specific transcriptional activation. The presence of a CE within a CRM constrains the ordering and spacing between motif sites within the CRM. Current CRM discovery methods can be divided into the following three categories according to CRM representation.

BioMed Research International
The first category simply views a CRM as a cluster of motif sites in windows and uses the set model to represent CRMs [4][5][6][7][8][9]. Some methods in this category employ statistical approaches to score clusters of motif sites in windows of a predefined given size and to identify the statistically significant site clusters as candidate CRMs. For example, MSCAN [4] uses values to score clusters of motif sites in each window region. MotEvo [7] uses a Bayesian approach to calculate the posterior probabilities of all given motifs in each aligned window. Some other methods use combinatorial approaches to search a combination of motif sites satisfying specific constraints in a window region of a given size. For example, CMStalker [8] and CPModule [10] use a constraint satisfaction formulation and a constraint programming for itemset mining approach, respectively. The advantages of these methods are that they are simple, direct, and easy to implement. But they just define simple uniformity constraints for component motifs of CRMs and ignore the regulatory structure of CRMs, which is unrealistic. Moreover, their performance is usually limited by the window size setting and the hard specification of scoring thresholds.
The second category still regards CRMs as sets of motif sites but directly or indirectly constrains the order and distance of motif sites within CRMs based on phylogenetic conservation. Some methods search possible CRMs in orthologous sequences by looking for a group of predefined CRM patterns, that is, a set of specific motifs and constraints on ordering and intersite distances for these motifs; the corresponding methods are [11,12]. Some other methods use sequences alignment, such as [13,14], or motif site sequences alignment, such as [15], to detect conserved CRMs among species. Although this category of methods has achieved a certain degree of success, they are just applicable to the discovery of conserved CRMs in related species.
The third category uses probabilistic models to represent CRMs [16][17][18][19][20][21][22][23]. Most methods use Hidden Markov Models (HMMs) to model CRMs. The HMM can provide flexible structure representation by defining different states and transitions. Early methods, such as Cister [19] and Cluster-Buster [20], only define the distance constraint between motif sites but do not model any order between motif sites within a CRM. Subsequent methods, such as Stubb [21], BayCis [22] and CORECLUST [23], define transitions between motif sites to infer the possible spatial arrangement of motif sites within a CRM. The difference between these methods is that Stubb defines the correlation between some motifs, but, in inferring CRMs, it uses sliding window technology. The limitation of sliding window technology lies in the fact that the lengths of predicted CRMs are difficult to know in advance and are only based on experience to guess. BayCis and CORECLUST attempt to capture structural characteristics of CRMs by defining the correlation between all given motifs. This introduces a large number of model parameters to be estimated leading to a large number of computations.
This paper presents a new CRM discovery algorithm called ComSPS (the executable program of ComSPS is available at https://sites.google.com/site/onehoare/comsps). The algorithm still uses HMM to represent CRMs. Coexpressed genes often share common expression patterns and CRMs driving these expression patterns usually have an identical regulatory structure. We build a regulatory structure model based on HMM by exploring the rules of CRM transcriptional grammar [24] that governs the arrangement of motif sites within CRMs. These grammatical rules include not only motifs constituting CRMs but also the orientation preferences of motif sites, distributions of motif sites and their spacing, and the arrangement preferences of motif sites. Searching for the CRM transcriptional grammar is helpful to distinguish potential CRMs from background and thus improves the identification performance of CRMs. Moreover, modeling the CRM transcriptional grammars allows ComSPS to be able to analyze the CRM regulatory structure. In addition to motif states, ComSPS explicitly defines CRM states and can automatically determine sizes of CRMs. ComSPS uses the conservation of CRMs across sequences to improve the prediction accuracy without sequence alignment. We test ComSPS on three public benchmark datasets and compare it with five existing methods. Experimental results show that ComSPS performs better than them.

Overview.
Binding sites of a TF usually mutate and are not identical in different sequences, but they have a common pattern known as a motif. We use a TFBS or a motif site to refer to an occurrence of a motif in sequences. Here, we use position weight matrices (PWMs) [25] to represent motifs.
The basic input of the algorithm includes a set of regulatory sequences, denoted by = { 1 , 2 , . . . , 푁 }, which are from a group of coexpressed genes, and a set of motifs that are represented by PWMs, denoted by = { 1 , 2 , . . . , 퐾 }. These coexpressed genes are assumed to be regulated by similar CRMs with sites of motifs from W. In addition, the input includes some other model parameters. The algorithm outputs positions of found CRMs, along with TF labels of motif sites within the CRMs and their positions in sequences.
ComSPS employs an HMM to model the CRM regulation grammars shared by a group of coexpressed sequences. The specific regulation grammar rules introduced by the model include the number of motif sites, the origination of motif sites, the distance between motif sites, the order of motif sites, and the coassociation of motif sites. These rules are characterized by defining the parameters that are attached to the HMM model structure (state, state transition, and state transmission) and are determined by the learning of model parameters from given datasets. By the inference of the model, the CRMs following the predefined grammatical rules are deduced.
Specifically, states of motifs to probably be involved in the regulation and their reverse complementary counterparts are defined to capture motif sites constituting CRMs and their orientation preference; motif emission probabilities are described by PWMs to characterize the binding affinity of motifs; intermotif background states are defined to describe the distribution of nonmotif nucleotides and their duration distributions reflect the distance constraint between motifs; motif frequencies are added to the model to infer the distribution of motif sites; correlation probabilities are introduced to capture coassociated motif site pairs. The model is trained using an extended Baum-Welch algorithm based on the expectation maximization (EM) algorithm, and most parameters are adjusted automatically. The model uses the Viterbi algorithm to search CRMs by inferring the most likely state path. In addition, to improve the time efficiency of the algorithm, the training and inference of the model on all sequences are parallelized.
Before and after building the model, we performed some additional processing. If the given PWMs are directly from existing motif libraries (such as TRANSFAC [26] and JASPAR [27]) or outputs of third party de novo motif discovery methods, there may be more than one PWM describing motifs of a same TF, and these PWMs may have different lengths; however, the HMM architecture depends on the given PWMs, and many unnecessary states will be introduced when directly using them, thus increasing search space. In such cases, it is necessary to merge PWMs that represent the same TF before constructing the HMM. We further screened CRMs found by the HMM according to their conservation among all sequences to reduce false-positive predictions.
Based on these criteria, the whole algorithm can be simply described as follows: (i) We filter the input PWMs based on clustering of similar PWMs. This step depends on the quality of the given PWMs, as optional preprocessing of the algorithm.
(ii) For the input sequence, we use an HMM to model the regulatory structure of CRMs based on the filtered PWMs (or the given PWMs). We train model parameters using the Baum-Welch algorithm. Based on the trained model, we use the Viterbi algorithm to infer the locations of potential CRMs in sequences.
(iii) For the found candidate CRMs, we further screen them based on their conservation among sequences and output them.
Each specific step of the algorithm is described in detail as follows.

PWM Filtering.
To find the similar PWMs describing motifs of a same TF, we use a general clustering method. The basis of clustering is measuring the similarity between these PWMs. To reduce the impact of uncertainty and noise in biological data and to better measure the similarity between PWMs, we use the FISim algorithm [28] to calculate their similarity. FISim takes into account the relative importance of bases at different positions in a PWM and can give higher scores to more conserved positions, which makes FISim more robust than other methods. The following are specific steps of the filtering process: (i) Use the FISim algorithm to measure the similarities between all given PWMs.
(ii) Take similarities as weights, PWMs as vertices, and PWM pairs whose similarities are above a given threshold as edges to construct a weighted adjacency graph.
(iii) Use the process [29] to find all dense subgraphs whose density are above a given threshold, and return them as the final clusters; let 耠 be the number of clusters ( 耠 ≤ ). Each cluster corresponds to a unique TF.
(iv) Choose the PWM whose length is closest to the average of all PWMs in a cluster as representative of the cluster. After filtering, the PWM set is denoted as

CRM Searching
Based on the HMM. Based on the filtered PWM set 耠 (if the given PWMs does not need filtering, then 耠 = ), we build an HMM to identify CRMs in the given sequences. The HMM characterizes the regulatory structure of CRMs by introducing corresponding states and transitions. For a given sequence, it can be viewed as observations of the HMM. The HMM encodes the regulatory structure of the sequence according to a following hierarchical organization. A sequence can be considered as a mixture of CRMs which have variable lengths and are separated by inter-CRM background sequences; a CRM can be modeled as a concatenation of motif sites and intra-CRM background sequences.
As shown in Figure 1, inter-CRM backgrounds are denoted by the state 푔 ; a CRM is denoted by the state 푠 , which indicates the start, and the state 푒 , which indicates the end. Here, 푠 and 푒 are auxiliary states. The auxiliary states do not emit specific bases and are only used to label the model structure; they are represented by circles in Figure 1. For the given sequence, the model assumes that it is regulated by at most 耠 different TFs. The 耠 states 1 , 2 , . . . , 퐾 represent the corresponding motif sites, and PWMs of these motifs are from 耠 . Considering that motif sites may appear on the reverse complementary strand of the sequence, we define the reverse complementary of the motifs. We use 퐾 +푖 to denote the reverse complementation of a motif 푖 . Thus, the model has 2 耠 motif states. The model introduces an auxiliary state next 푖 to indicate that a motif state next to 푖 is still within a CRM and to establish a transition of the motif 푖 to other motifs to capture the correlation between them and motif frequencies. 푐 (푖,푗) denotes spacers between sites of motifs 푖 and 푗 , also known as intra-CRM backgrounds; when not specifically referring to spacers between two specific motifs, the superscript is removed and expressed as 푐 . The model assumes that emission probabilities of these background states follow the same distribution.
The transition probabilities are regarded as unknowns in the model and are defined as follows. For each position of the sequence, a decision is made to determine whether to initiate a CRM or generate a segment of background, from the CRM model with probability 푟 or the background model with probability 1 − 푟 , respectively. If the model starts a CRM at the current position, then the current state becomes 푠 , indicating the start of a CRM. From the state 푠 , there is a probability 푖 to initiate the CRM's first motif site 푖 , and each position in the following region with the length 푖 has Figure 1: The ComSPS HMM state transition diagram. Emission states are represented by shaded circles. The auxiliary states are represented by circles, which are only used to model the structure (not to emit specific bases). and are the initiating and terminating states of the model, respectively. Only the transition probabilities marked by dotted lines need estimating in the model training.
the same state 푖 , where 푖 is the length of the motif 푖 . Then, the model transits to 푒 , with the probability 0 , to end the CRM and return to the background 푔 ; or the model can alternatively continue the CRM with the probability 1 − 0 and then chooses a next motif 푗 with the probability 푖,푗 , implanting the corresponding background 푐 (푖,푗) between the sites of 푖 and 푗 . Figure 1 shows the HMM architecture, with the transition probabilities marked at the arrows.
To capture the spatial correlations of coassociated motif sites as done in Stubb we introduce a parameter 푖,푗 , which describes the probability that, along a DNA strand, a motif 푗 site is located downstream of a motif 푖 site and characterizes their specific arrangement, as shown in Figure 2. All motif state pairs are initialized to be noncorrelated; when the model detects that the cooccurrences of motifs 푖 and 푗 are statistically significant, the correlation between motifs 푖 and 푗 is identified and the parameter 푖,푗 is added to the model parameters. Under this definition, the transition probability between any motif sites is calculated as follows. If there exists correlation between 푖 and 푗 , their transition probability 푖,푗 is 푖,푗 ; if they are not correlated, 푖,푗 is renormalized to ensure that ∑ 2퐾 푗=1 푖,푗 = 1. Thus, the model only considers the correlation probabilities of the motif pairs which frequently cooccur and the estimated model parameters are reduced. Let 푖 = { | cooccurrence times of motifs 푖 and 푗 be statistically significant}, and 푖,푗 can be specifically expressed as To determine whether the cooccurrence of motifs 푖 and 푗 is statistically significant on a given sequence set, a -score [21] for their cooccurrence times 푖 푗 is defined as follows: where 푖 푗 and 푖 푗 are the expectation and standard deviation of 푖 푗 , respectively. When 푖 푗 and 푖 푗 are greater than the given thresholds, the model determines the correlation between the two motifs. Figure 2: Correlation of coassociated motifs. In gene regulation, a TF usually works synergistically with other TFs by interactions to regulate a highly specific expression pattern. To facilitate such interactions, their binding sites are located adjacent to each other and form modules, also called composite elements (CEs) [3]. The same CEs perform similar functions in different genes, and they should be conserved in sequences and have a preferred arrangement. To capture coassociated motifs constituting CEs, the model defines the correlation probabilities. Based on CE' conservation assumptions, such motif pairs may cooccur repeatedly in regulatory sequences of genes, which are marked in rectangles in the example. The region within each pair of brackets represents a CRM. Each polygon represents a motif site within a CRM.
The emission probabilities of the model are considered as known. In our HMM model, each state can emit a string of bases with variable length instead of a single base; this type of HMM is called the HMM with duration [30]. In the HMM, the probability of each state emitting a base sequence with a specific length is expressed as the product of the probability that the state generates any sequence of the length and the probability that the state generates the base sequence given the length.
Given a motif 푘 , its PWM and length 푘 are known. Let 1:푙 = 1 2 ⋅ ⋅ ⋅ 푙 be a site of the motif and let 푘 [ , ] be the probability of base at position of its PWM. The probability generating the motif site is expressed as The inter-CRM background state 푔 and the intra-CRM background 푐 are modeled as the th order and 耠 th order local Markov chains with the parameters 0 and 1 , respectively. The two parameters are easily estimated from the given sequences. For the state 푐 , we assume that its length satisfies a geometric distribution with the expectation ℎ, and the corresponding geometric distribution parameter ℎ = 1/ℎ is taken as a parameter of the algorithm to be specified in the configuration. Under the distribution, the probability of an intra-CRM background segment has a length and is represented as follows: For the state 푔 , its length reflects the distance between CRMs and is approximated by a geometric distribution with the expectation 1/ 푟 .

Inference and Training of the Model.
Given a training set , the model assumes that these sequences are independent. Therefore, the estimation of parameters can be done separately on each sequence. Let = { 푟 , 0 , 1 , . . . , 퐾 , . . . , 푖,푗 , . . .} and Θ = 耠 ∪ { 0 , 1 }. In these parameters, other parameters except for 푖,푗 are derived from the existing HMM. To get 푖,푗 from a function, we follow the same process as done in [31]. Specifically, 푖,푗 is estimated on the training set as follows: where 耠 푖 is the complement of 푖 and 푖 푗 are cooccurrence times of motifs 푖 and 푗 on the given training set. Under the assumption of sequence independence, 푖 푗 is the sum of cooccurrence times 푖 푗 ( ) of motifs 푖 and 푗 on each sequence : ∈ . 푖 푗 and the second term ( 2 푖 푗 ) of 푖 푗 can be estimated on the training set as follows: where is a state path and 푖 푗푘 ( ) is the indicator variable for the event where motif 푖 is followed by motif 푗 located at position in a sequence of the training set .
To estimate , the Baum-Welch algorithm is extended to maximize the likelihood log ( | , Θ). For a sequence ∈ , we define and as a state path of the model and the state duration sequence, respectively. Finally, log ( | , Θ) is represented as follows: To maximize the likelihood log ( | , Θ), we turned to maximize the function of log ( , , | , Θ) following the process in [31]. The function is solved by the EM algorithm to iteratively update and finally converge to a locally optimum .
Based on the trained model, we use the Viterbi algorithm [31] to infer the most possible state path in each given sequence to be searched. The algorithm finds a group of CRMs in the sequences. For a given CRM , we give a score to it by a log likelihood ratio as follows: where ( | 1 , , 耠 ) and ( | 0 ) are the probabilities of generated by the CRM and background models, respectively.

Implementation.
The implementation is as follows: (i) In the determination of the correlation between motifs, thresholds of the 푖 푗 score and the expectation 푖 푗 are set to 1. (ii) In the model, for two local Markov chains used by background models, the default values of their orders and 耠 are 1 and 2, respectively, and their parameters 0 and 1 are off-line calculated in a bookkeeping way. (iii) Considering the fact that data sparsity can lead to overfitting, for the probability 푟 to initiate a CRM and the probability 0 to terminate a CRM, they are specified in the configuration without using the Baum-Welch algorithm for estimation. In the model, 푟 and 0 are set to 0.001 and 0.1, respectively, which works well in most cases. For the expectation ℎ of the distance between motif sites within a CRM, ℎ is set to 50 as a default; for a CRM with dense clustering of motif sites, ℎ is set to a small value, such as 20.
(iv) To improve time performance of ComSPS, we perform parallel optimization for the training and inference processes of the model. Since the model assumes that all sequences are independent, some steps of the two processes on a set of sequences can be performed concurrently. Specifically, for the model training, during each iteration, each of the calculations for expectations of state transition counts and the revaluation of sequence likelihoods can be executed concurrently on each sequence. The inferring process on each sequence is an independent task and thus it can be simultaneously performed on multiple sequences. In the concrete implementation, we perform the parallel acceleration by multithreading technology.

CRM Screening.
Before further processing, the CRMs predicted by the HMM are first filtered according to a weight threshold 푐 , leaving CRMs above the threshold as candidate CRMs. The threshold 푐 is set in the configuration file. These CRMs fit to the model and are statistically significant. Real CRMs should be conserved across multiple genes. So we further screen the candidates based on their conservation among given sequences. This step is executed only when the sequence set to be searched contains more than three sequences with predicted CRMs.
To explore the conserved CRMs between multiple sequences, we integrate all information on similarities between CRMs into one graph. In the graph, each node represents a CRM, and an edge between two nodes indicates that the corresponding CRMs coming from different sequences are sufficiently similar (their similarity score is above a given threshold). According to the definition, the node sets containing CRMs conserved across multiple sequences form a clique in the graph, and a maximum clique of the graph corresponds to a possible type of CRMs, as shown in Figure 3. Therefore, the steps of identifying conserved CRM can be described briefly as follows.
Firstly, an undirected adjacency graph to describe the similarity relation between CRMs is constructed. In the model, the similarity between two CRMs is scored based on the contained motifs and motif sites without considering intra-CRM background. Specifically, the similarity score ( 푖 , 푗 ) between two CRMs 푖 and 푗 is defined as follows: where 푚 is a weight coefficient as a configurable parameter, 푐 and MS 푐 are the motif set and the motif site set of a CRM 푖 , respectively.
Then, the process [29] is performed to enumerate all the maximal cliques with three or more than three nodes.
Lastly, we combine these found CRM cliques and then output all CRMs in these cliques. The conservation CS( type ) of a type of CRMs type from a clique with the node set can be scored as follows: where is the sequence number of the whole sequence set.

Results and Discussion
We selected five compared methods, CPModule [10], MotEvo [7], Cluster-Buster [20], Stubb [21], and BayCis [22], and evaluated these methods on three public benchmark datasets: XIE dataset [32], Muscle dataset [33], and REDfly dataset [34]. Using public benchmark datasets is more convenient to measure the similarity between the prediction results of the methods and the actual results and objectively evaluate their performances.

Performance Evaluation.
Given sequences to be searched, all CRM discovery methods output positions of predicted CRMs. The prediction of a method for CRMs can be viewed as classifying for the base at each position in the sequences. The bases are predicted to belong to a CRM, and they are annotated as Positive (P); the bases are predicted to belong to background, and they are annotated as Negative (N). Each annotation is either True (T) or False (F); thus, corresponding annotations are divided into four categories. Counting each category of annotations, they are denoted as nTP, nFP, nTN, and nFN. Obviously, the more true annotations of nTP and nTN relative to the false annotations of nFP and nFN, the higher the similarity between the prediction result and the actual result.
To quantify the similarity, some measures are designed. Let nPP (Predicted Positives) represent the number of bases that are predicted as CRMs, where nPP = nTP + nFP. Let nAP (Actual Positives) represent the number of bases that the actual CRMs contain, where nAP = nTP + nFN. The ratios of nTP to nPP and nAP define two important measures, sensitivity (Sn) and precision (Pr), respectively.
However, from their definitions, we can easily see that the two measures oppose each other. In this sense, if a method is expected to highlight a single measure by only changing the tightness of the search conditions but keeping the search strategy, it will inevitably decrease the other measure. Thus, to comprehensively evaluate the performance of a method, it needs to consider them together, for example, using P/R curve analysis or introducing balanced measures of Sn and Pr, such as F1-score and ASP.
In particular, the correlation coefficient (CC) as a common overall performance measure is often used to measure statistical correlation between the prediction results and the actual results. As a special case of the Pearson correlation coefficient of two variables, it was introduced by Burset and Guigó [35] to evaluate gene structure prediction and later widely used in other aspects of bioinformatics. The value of CC ranges between −1 and +1.
All the above measures are based on the classification for bases, and they are referred to as the measures at base (or nucleotide) level. A similar classification can be made at motif site level, when an evaluated method gives all motif site information within predicted CRMs. However, since not all methods give the specific motif sites within CRMs, our evaluation is only limited at base level.
For the experiment on the XIE and Muscle datasets, we used CC as the measure to evaluate the methods. For the experiment on the REDfly dataset, we evaluated the methods according to the evaluation protocol of Ivan et al. [34]. The evaluation protocol is simply described as follows. On each subdataset, the average length of CRMs in the subdataset is calculated and then provided to CRM discovery methods as a parameter. These methods are required to output the CRMs closest to the length. Limited by this evaluation framework, it has nPP = nAP; thus, Sn = Pr. Hence, it can use only Sn as the measure. Furthermore, to evaluate the statistical significance of the prediction results, an empirical value is introduced. Specifically, the empirical value is defined as the probability that Sn of a stochastic prediction is greater than that of the prediction result. This value is calculated by using the stochastic simulation. When the value of Sn is less than 0.05, the prediction is considered statistically significant.

XIE Dataset.
The dataset was constructed by Xie et al. [32]. The dataset contains 22 sequences. Each of these sequences is 1000 bp in length and they are from human, chicken, and mouse genomes. Among these sequences, there are 20 sequences with each containing an implanted CRM and two sequences containing no implanted CMRs. These CRMs have the length of at most 164 bp and contain binding sites of three TFs: Oct4, Sox2, and FoxD3. The distance between these TFBSs follows a Poisson distribution with expectation of 10.
The benchmark constructed four PWM test sets by adding different number of decoy PWMs. Specifically, noise ratios of these PWM test sets are 7/10, 17/20, 27/30, and 37/40, respectively. Each PWM test set contains 10 collections independently sampled from 516 TRANSFAC PWMs. This dataset can be downloaded from the web site provided by [9].

Testing
Results. On this dataset, following the way of three other HMM-based methods, Cluster-Buster, Stubb, and BayCis, all sequences are used as the training set and the testing set simultaneously for ComSPS. For Stubb, CPModule, and MotEvo, which depend on window size settings, they were tested with three window size settings, 100 bp, 150 bp, and 200 bp, which are around the average length of CRMs. Other parameters of all these methods remain default.  Figure 4 shows the means and variances of CCs of all methods at different noise levels on the dataset. Overall, ComSPS performed more stably than other methods and made the best predictions at different noise levels.
The figure also shows that, with increasing noise, the performances of all methods inevitably decreased, and their sensitivities to noise revealed great differences. ComSPS and Stubb performed the most stably, while Cluster-Buster was most sensitive to noise. Moreover, at the same noise level, these methods also showed a consistent trend for the collections of different decoy PWMs. The window clustering methods showed different performance trend for different window size settings. MotEvo seemed to perform better for small window size settings and CPModule was the opposite, while Stubb tended to perform best with the window size setting of 200 bp close to the average length of CRMs on the dataset. This may be the result of their different scoring or search strategies for CRMs. [36]. Later, it was extended by Klepper et al. [33] and used for the evaluation for CRM discovery methods in [33]. The dataset contains 24 sequences, as well as five motifs (Mef2, Tef, Srf, Sp1, and Myf), which play an important role in muscle regulation. The 24 genomic sequences are from mouse, cow, rat, chicken, and human, and their lengths range from 269 bp to 1000 bp (the average is 851 bp). The dataset has one CRM in each sequence. These CRMs range in length from 14 bp to 194 bp, with an average of 120 bp. Each CRM contains two to eight motif sites (the average is 3.5). The dataset and CRM annotations are taken from the companion web site of [33]. other three HMM-based methods, Cluster-Buster, Stubb, and BayCis. For the window clustering methods, we chose four different window sizes, 100 bp, 150 bp, 200 bp, and 300 bp, which are close to the median, mean, third quartile, and the maximum value of CRM lengths on the dataset, respectively. Other parameters of all these methods remain default.

Testing
Counting all TPs, TNs, FPs, and FNs predicted by each method on each sequence, we calculated their CC scores on the whole sequence set. Moreover, for the window clustering methods, we calculated the CC scores under each window size setting. The results are shown in Figure 5.
On this dataset, the prediction performance of ComSPS significantly outperformed the other methods. Overall, the HMM-based methods were superior to window clustering methods. Specifically, Cluster-Buster and BayCis made good predictions, which are second only to ComSPS. The window clustering methods showed a similar trend to that on the XIE dataset and still had a greatly different performance under different window settings. For example, Stubb made a good prediction close to ComSPS under the 200 bp window setting but performed slightly better than the lowest CPModule under the 100 bp window setting.

Cross Validation.
To evaluate the ability of ComSPS to learn the parameters, we performed 10-fold cross validation on the dataset. Since other HMM-based methods cannot explicitly specify the training set and the test set, we only verified ComSPS. The CRM prediction is not a simple binary classification task. The structural flexibility of CRMs leads to the great difference in performance of methods even to identify different occurrences of similar CRMs in different sequences, as shown in Table 1. As a result, variation in the prediction performances of methods when using different training sets will be obscured by inherent differences in prediction performance for different occurrences of CRMs. We performed 10 times 10-fold cross validation by shuffling and splitting the given sequence set into different training and testing sets. As shown in Table 1, the variance of the performances of ComSPS on most of sequences was small, which indicates that ComSPS was not sensitive to varying training sets. So it demonstrated that the process of parameter estimation of ComSPS stably converges to some values around a local optimum in most cases.

REDfly Dataset.
Compared to the XIE and Muscle datasets, the REDfly dataset [34] is a larger dataset and it contains longer sequences and more complex CRMs. The dataset contains a total of 33 subdatasets. These subdatasets contain 53 PWMs and 719 sequences in total. Each subdataset has 4-77 sequences, with an average of 16, and these sequences have a total length of about 5.7 Mbp. On each subdataset, each sequence contains only one CRM. Each CRM has a different length and overall their average lengths have a range of 442-1248 bp. These CRMs perform functions in the early development of Drosophila and their annotations are derived from the REDfly database [37]. For the statistics on each subdataset, please refer to Table 1 in [34].

Testing
Results. On this dataset, for ComSPS, all sequences were used simultaneously as the training set and  the test set. For the methods that depend on a window size, according to evaluation protocol in [34], we precomputed the average length of real CRMs on each subdataset and provided the length for the methods as a predetermined window size parameter. These methods outputted CRMs that have the length of the specified window size and have the highest scores as the final prediction results. Other methods still maintained their default settings.
Sensitivities and empirical values of the prediction results of all methods are shown in Table 2. The results indicate that ComSPS performed very well on 16 of the 33 subdatasets, was superior to other methods, and made significant predictions on approximately half of subdatasets.
In its all statistically significant predictions ( value ≤ 0.05), the range of the sensitivity of ComSPS is 15-56%, with an average of 26%. Since the evaluation protocol limits the length of CRMs to be predicted, 100% sensitivity was very hard to get. Actually, from the table, we can see that the maximum possible average sensitivity on all subdatasets is approximately 77%. It assumes that there is a 100 bp CRM to be predicted in a 1000 bp sequence. The average sensitivity of 26% means that the overlap of a predicted CRM with the real CRM is about 26 bp. According to the definition of the maximum possible sensitivity, the real CRM has 77 bp to be predictable. Thus, in the sense, the overlap of the prediction of ComSPS with it is more than 1/3. From a biological point of view, this resolution is sufficient to recover real CRMs. Specifically, we can first roughly locate CRMs by ComSPS and then accurately determine the specific positions of CRMs by experimental means.
Regarding other methods, their prediction performances were greatly different on the dataset. Stubb and MotEvo showed excellent prediction performance, as they made statistically significant predictions on 12 and 10 subdatasets, respectively; BayCis and CPModule made statistically significant predictions on 8 and 7 subdatasets, respectively; Cluster-Buster performed worst and made random predictions on most of the subdatasets.

Conclusions
CRMs play an important role in the transcriptional regulation of eukaryotic genes, and their identification is the key to understanding the mechanisms of gene transcription regulation. To improve the identification performance of CRMs, this paper presents a new CRM discovery algorithm from the perspective of exploring the rules of CRM transcriptional grammar to build a regulatory structure model of CRMs. Experimental results revealed that the proposed algorithm performed better than compared methods on these tested benchmark datasets.
CRM discovery algorithms have been developed for many years and have experienced great progress, but they are far from being mature and still require further improvement. Currently, chromatin immunoprecipitation sequencing (ChIP-Seq) technology provides a large amount of data that can be used for the identification of motifs and CRMs. With the help of these new data, we believe that the prediction accuracy for CRMs can be further improved. However, these data are often short in length and huge in number; thus, they bring new challenges to existing methods. After processing ChIP-Seq results, the data analysis requires special algorithms. Developing new algorithms that are able to effectively identify CRMs from ChIP-Seq data is the focus of our future research.