SimpLiFiCPM: A Simple and Lightweight Filter-Based Algorithm for Circular Pattern Matching

This paper deals with the circular pattern matching (CPM) problem, which appears as an interesting problem in many biological contexts. CPM consists in finding all occurrences of the rotations of a pattern 𝒫 of length m in a text 𝒯 of length n. In this paper, we present SimpLiFiCPM (pronounced “Simplify CPM”), a simple and lightweight filter-based algorithm to solve the problem. We compare our algorithm with the state-of-the-art algorithms and the results are found to be excellent. Much of the speed of our algorithm comes from the fact that our filters are effective but extremely simple and lightweight.


Introduction
The classical pattern matching problem is to find all the occurrences of a given pattern P of length in a text T of length , both being sequences of characters drawn from a finite character set Σ. This problem is interesting as a fundamental computer science problem and is a basic requirement of many practical applications. The circular pattern, denoted by C(P), corresponding to a given pattern P = P 1 ⋅ ⋅ ⋅ P , is formed by connecting P 1 with P and forming a sort of a cycle; this gives us the notion where the same circular pattern can be seen as different linear patterns, which would all be considered equivalent. In the circular pattern matching (CPM) problem, we are interested in pattern matching between the text T and the circular pattern C(P) of a given pattern P. We can view C(P) as a set of patterns starting at positions ∈ [1 : ] and wrapping around the end. In other words, in CPM, we search for all "conjugates" (two words , are conjugate if there exist words , V such that = V and = V ) of a given pattern in a given text.
The problem of circular pattern matching has been considered in [1], where an O( )-time algorithm is presented. A naive solution with quadratic complexity consists in applying a classical algorithm for searching a finite set of strings after having built the trie of rotations of P. The approach presented in [1] consists in preprocessing P by constructing a suffix automaton of the string PP, by noting that every rotation of P is a factor of PP. Then, by feeding T into the automaton, the lengths of the longest factors of PP occurring in T can be found by the links followed in the automaton in time O( ). In [2], the authors have presented an optimal average-case algorithm for CPM, by also showing that the average-case lower bound for the (linear) pattern matching of O( log / ) also holds for CPM, where = |Σ|. Recently, in [3], the authors have presented two fast averagecase algorithms based on word-level parallelism. The first algorithm requires average-case time O( log / ), where is the number of bits in the computer word. The second one is based on a mixture of word-level parallelism and -grams. The authors have shown that with the addition of -grams, and by setting = O(log ), an optimal average-case time of O( log / ) can be achieved. Very recently in [4], the authors have presented an efficient algorithm for CPM that runs in O( ) time on average. To the best of our knowledge, 2 International Journal of Genomics this is the fastest running algorithm for CPM in practice to date.
Notably, indexing circular patterns [5] and variations of approximate circular pattern matching under the edit distance model [6] have also been considered in the literature. Approximate circular pattern matching has also been studied recently in [4,7]. In this paper however, we focus on the exact version of CPM.
Apart from being interesting from the pure combinatorial point of view, CPM has applications in areas like geometry, astronomy, computational biology, and so forth. For example, the following application in geometry was discussed in [5]. A polygon may be encoded spelling its coordinates. Now, given the data stream of a number of polygons, we may need to find out whether a desired polygon exists in the data stream. The difficulty in this situation lies in the fact that the same polygon may be encoded differently depending on its "starting" coordinate and hence, there exist possible encodings where is the number of vertices of the polygon. Therefore, instead of traditional pattern matching, we need to resort to problem CPM. This problem seems to be useful in computer graphics as well and hence may be used as a built-in function in graphics cards handling polygon rendering.
CPM in fact appears in many biological contexts. This type of circular pattern occurs in the DNA of viruses [9,10], bacteria [11], eukaryotic cells [12], and archaea [13]. As a result, as has been noted in [14], algorithms on circular strings seem to be important in the analysis of organisms with such structures. Circular strings have also been studied in the context of sequence alignment. In [15], basic algorithms for pairwise and multiple circular sequence alignment have been presented. These results have later been improved in [16], where an additional preprocessing stage is added to speed up the execution time of the algorithm. In [17], the authors also have presented efficient algorithms for finding the optimal alignment and consensus sequence of circular sequences under the Hamming distance metric.
Furthermore, as has been mentioned in [5], this problem seems to be related to the much studied swap matching problem (in CPM, the patterns can be thought of as having a swap of two parts of it) [18] and also to the problem of pattern matching with address error (the circular pattern can be thought of as having a special type of address error) [19,20]. For further details on the motivation and applications of this problem in computational biology and other areas the readers are kindly referred to [9][10][11][12][13][14][15][16][17] and references therein.
In this paper, we present SimpLiFiCPM (pronounced Simplify CPM), which is a fast and efficient algorithm for the circular pattern matching problem based on some filtering techniques. In particular, we employ a number of simple and effective filters to preprocess the given pattern and the text. After this preprocessing, we get a text of reduced length on which we can apply any existing state-of-the-art algorithms to get the occurrences of the circular pattern. So, as the name sounds, SimpLiFiCPM, in some sense, simplifies the search space of the circular pattern matching problem.
We have conducted extensive experiments to compare our algorithm with the state-of-the-art algorithms and the results are found to be excellent. Our algorithm turns out to be much faster in practice because of the huge reduction in the search space through filtering. Also, the filtering techniques we use are simple and lightweight but as can be realized from the results extremely effective.
The rest of the paper is organized as follows. Section 2 gives a preliminary description of some terminologies and concepts related to stringology that will be used throughout this paper. In Section 3 we describe our filtering algorithms. Section 4 presents the experimental results. Section 5 draws conclusion followed by some future research directions.

Preliminaries
Let Σ be a finite alphabet. An element of Σ * is called a string. The length of a string is denoted by | |. The empty string is a string of length 0; that is, | | = 0. Let Σ + = Σ * − { }. For a string = , , , and are called a prefix, factor (or, equivalently, substring), and suffix of , respectively. The th character of a string is denoted by [ ] for 1 ≤ ≤ | |, and the factor of a string that begins at position and ends at position is denoted by [ : ] for 1 ≤ ≤ ≤ | |. For convenience, we assume [ : ] = if < . A -factor is a factor of length .
A circular string of length can be viewed as a traditional linear string which has the leftmost and rightmost symbols wrapped around and stuck together in some way. Under this notion, the same circular string can be seen as different linear strings, which would all be considered equivalent. Given a string P of length , we denote by P = P[ : ]P[1 : − 1], 0 < < , the th rotation of P and P 0 = P.
Here we consider the problem of finding occurrences of a pattern string P of length with circular structure in a text string T of length with linear structure. For instance, the DNA sequence of many viruses has a circular structure. So if a biologist wishes to find occurrences of a particular virus in a carrier's DNA sequence, which may not be circular, (s)he must locate all positions in T where at least one rotation of P occurs. This is the problem of circular pattern matching (CPM).
We consider the DNA alphabet, that is, Σ = { , , , }. In our approach, each character of the alphabet is associated with a numeric value as follows. Each character is assigned a unique number from the range [1 ⋅ ⋅ ⋅ |Σ|]. Although this is not essential, we conveniently assign the numbers from the range [1 ⋅ ⋅ ⋅ |Σ|] to the characters of Σ following their inherent lexicographical order. We use ( ), ∈ Σ, to denote the numeric value of the character . So, we have Example 2. Suppose we have a pattern P = . The numeric representation of P is P = 1423143. And this numeric representation has the following rotations: P 1 = 4231431, P 2 = 2314314, P 3 = 3143142, P 4 = 1431423, P 5 = 4314231, and P 6 = 3142314.
The problem we handle in this paper can be formally defined as follows.
Problem 3 (circular pattern matching (CPM)). Given a pattern P of length and a text T of length ≥ , find all factors F of T such that F = P , for some 0 ≤ < . And if we have F = P for some 0 ≤ < , then we say that the circular pattern C(P) matches T at position .
In the context of our filter-based algorithm the concept of false positives and negatives is important. So, we briefly discuss this concept here. Suppose we have an algorithm A to solve a problem B. Now suppose that S true represents the set of true solutions for problem B. Further suppose that A computes the set S A as the set of solutions for B. Now assume that S true ̸ = S A . Then, the set of false positives can be computed as follows: S A \ S true , where "\" refers to the set difference operation. In other words, the set computed by A contains some solutions that are not true solutions for problem B. And these are the false positives, because S A falsely marked these as solutions (i.e., positive). On the other hand, the set of false negatives can be computed as follows: S true \ S A . In other words, false negatives are those members in S true that are absent in S A . These are false negatives because S A falsely marked these as nonsolutions (i.e., negative).

Our Approach
As has been mentioned above, our algorithm is based on some filtering techniques. Suppose we are given a pattern P and a text T. We will frequently and conveniently use the expression "C(P) matches T at position " (or, equivalently, "P circularly matches T at position ") to indicate that one of the conjugates of P matches T at position . We start with a brief overview of our approach below.

Overview of SimpLiFiCPM.
In SimpLiFiCPM, we first employ a number of filters to compute a set N of indexes of T such that C(P) matches T at position ∈ N. As will be clear shortly, our filters are unable to compute the true set of indexes and hence N may have false positives. However, our filters are designed in such a way that there are no false negatives. Hence, for all ∉ N, we can be sure that there is no match. On the other hand, for all ∈ N, we may or may not have a match; that is, we may have false positives. So, after we have computed N, we compute T , a reduced version of T concatenating all the factors F[ ⋅ ⋅ ⋅ + − 1], ∈ N, putting a special character $ ∉ Σ in between the factors. One essential detail is as follows. There can be , ∈ N such that 1 < − < . In other words, there can exist overlapping factors matching with C( ). However, this can be handled easily through simple bookkeeping as will be evident from our algorithm in later sections. Clearly, once we have computed the reduced text T we can employ any state-of-the-art algorithm to solve CPM on T to get the actual occurrences. So the most essential and useful feature of SimpLiFiCPM is the application of filters to get a reduced text on which any existing algorithm can be applied to solve CPM.

Filters of SimpLiFiCPM.
In SimpLiFiCPM, we employ 6 filters. In this section we describe these filters. We also discuss the related notions and notations needed to describe these filters. In what follows we describe our filters in the context of two strings of equal length , namely, P and T, where the former is a circular string and the latter is linear. We will devise and apply different functions on these strings and present observations related to these functions which in the sequel will lead us to our desired filter. The key to our observations and the resulting filters is the fact that each function we devise results in a unique output when applied to the rotations of a circular string. For example, consider a hypothetical function X. We will always have the relation that X(P) = X(P ) for all 1 ≤ < . Recall that P 0 actually denotes P. For the sake of conciseness, for such functions, we will abuse the notation a bit and use X(C(P)) to represent X(P ) for all 0 ≤ < |P|.

Filter 1.
We define the function on a string P of length as follows: Our first filter, Filter 1, is based on this function. We have the following observation.
Observation 1. Consider a circular string P and a linear string T both having length . If C(P) matches T, then we must have (C(P)) = (T).
Example 4. Consider P = = . As can be easily verified, here P circularly matches T. In fact the match is due to the conjugate P 5 . Now we have T = 4314231 and (T) = 18. Then, according to Observation 1, we must have (C(P)) = 18. This can indeed be verified easily. Now consider another string T = , which is slightly different from T. It can be easily verified that C(P) does not match T . Now, T = 1413243 and hence here also we have (T ) = 18 = (C(P)). This is an example of a false positive with respect to Filter 1.

Filters 2 and 3.
Our second and third filters, that is, Filters 2 and 3, depend on a notion of distance between consecutive characters of a string. The distance between two consecutive characters of a string P of length is defined by . We also define an absolute version of it: International Journal of Genomics where ( ) returns the magnitude of ignoring the sign. Before we apply these two functions on our strings to get our filters, we need to do a simple preprocessing on the respective string, that is, P in this case as follows. We extend the string P by concatenating the first character of P at its end. We use (P) to denote the resultant string. So, we have (P) = PP [1]. Since (P) can simply be treated as another string, we can easily extend the notation and concept of C(P) over (P) and we continue to abuse the notation a bit for the sake of conciseness as mentioned at the beginning of Section 3.2 (just before Section 3.2.1). Now we have the following observation which is the basis of our Filter 2.
Observation 2. Consider a circular string P and a linear string T both having length and assume that A = (P) and B = (T). If C(P) matches T, then, we must have (C(A)) = (B). Note carefully that the function () has been applied on the extended strings.
Example 5. Consider the same two strings of Example 4, that is, P = = .
Here P circularly matches T (due to the conjugate P 5 ). Now consider the extended strings and assume that A = (P) and B = (T). We have T = 4314231. Hence B = 43142314. Hence, (B) = 14. It can be easily verified that (C(A)) is also 14. Now consider another string T = of the same length, which is slightly different from T. It can easily be checked that C(P) does not match T . However, assuming that B = (T ) we find that (B ) is still 14. So, this is an example of a false positive with respect to Filter 2. Now we present the following related observation which is the basis of our Filter 3. Note that Observation 2 differs with Observation 3 only through using the absolute version of the function used in the latter.
Observation 3. Consider a circular string P and a linear string T both having length and assume that A = (P) and B = (T). If C(P) matches T, then, we must have (C(A)) = (B). Note carefully that the function () has been applied on the extended strings. Example 6. Consider the same two strings of previous examples, that is, P = = .
Here P circularly matches T (due to the conjugate P 5 ). Now consider the extended strings and assume that A = (P) and B = (T). We have T = 4314231. Hence B = 43142314. Hence, (B) = 0. It can be easily verified that (C(A)) is also 0. Now consider another string T = of the same length, which is slightly different from T. It can easily be checked that C(P) does not match T . However, assuming that B = (T ) we find that (B ) is still 0. So, this is an example of a false positive with respect to Filter 3.

Filter 4. Filter 4 uses the
() function used by Filter 1, albeit in a slightly different way. In particular, it applies the () function on individual characters. So, for ∈ Σ we define (P) = ∑ 1≤ ≤|P|,P[ ]= [ ]. Now we have the following observation.
Observation 4. Consider a circular string P and a linear string T both having length . If C(P) matches T, then, we must have (C(P)) = (T) for all ∈ Σ.
Example 7. Consider the same two strings of previous examples, that is, P = = .
Recall that P circularly matches T (due to the conjugate P 5 ). It is easy to calculate that (T) = 2, (T) = 2, (T) = 6, and (T) = 8. Hence according to Observation 4, the individual sum values for all the conjugates of P must also match this. It can be easily verified that this is indeed the case. Now consider the other string T = of the same length, which is slightly different from T. It can easily be checked that C(P) does not match T . However, as we can see, still we have (T ) = 2, (T ) = 2, (T ) = 6, and (T ) = 8. This is an example of a false positive with respect to Filter 4.
Notably, a similar idea has been used by Kahveci et al. in [21] for indexing large strings with a goal to achieve fast local alignment of large genomes. In particular, for a DNA string, Kahveci et al. compute the so-called frequency vector that keeps track of the frequency of each character of the DNA alphabet in the string.

Filter 5.
Filter 5 depends on modulo operation between two consecutive characters. A modulo operation between two consecutive characters of a string P of length is defined as follows: ( Observation 5. Consider a circular string P and a linear string T both having length and assume that A = (P) and B = (T). If C(P) matches T, then, we must have (C(A)) = (B). Note carefully that the function () has been applied on the extended strings.
initialize all defined variables to zero (7) initialize fixed array to {1, 2, 3, 4} (8) for ← 1 to | | do (9) if ̸ = | | then (10) calculate different filtering values via Observations 1 and 4 and make a running sum (11) end if (12) calculate different filtering values via Observations 2, 3, 5, and 6 and make a running sum (13) end for (14) return all observations values (15) end procedure Algorithm 1: Exact circular pattern signature using Observations 1-6 in a single pass. Now consider another string T = of the same length, which is different from T. It can easily be checked that C(P) does not match T . However, assuming that B = (T ) we find that (B ) is still 5. So, this is an example of a false positive with respect to Filter 5.

Filter 6.
In Filter 6 we employ the () operation. A bitwise exclusive-OR ( ()) operation between two consecutive characters of a string P of length is defined as follows: ). Now we present the following observation which is the basis of Filter 6. Note that this observation is applied on the extended versions of the respective strings.
Observation 6. Consider a circular string P and a linear string T both having length and assume that A = (P) and B = (T). If C(P) matches T, then, we must have (C(A)) = (B). Note carefully that the function () has been applied on the extended strings.
Example 9. Consider the same two strings of previous examples, that is, P = = . Recall that P circularly matches T (due to the conjugate P 5 ). Now consider the extended strings and assume that A = (P) and B = (T). We have T = 4314231. Hence B = 43142314. Hence, (B) = 28. Now according to Observation 5, we must also have (C(A)) = 28. As can be verified easily, this is indeed the case. Now consider another string T = a of the same length, which is different from T. It can easily be checked that C(P) does not match T . However, assuming that B = (T ) we find that (B ) is still 28. So, this is an example of a false positive with respect to Filter 5. [8]. At this point a brief discussion with respect to our preliminary work in [8] is in order. To reduce the text T, we also employed six filters in [8]. While Filter 1 and Filter 4 remain identical, in SimpLiFiCPM, we have changed and improved Filters 2, 3, 5, and 6 to get better results. In particular, we have introduced the concept of extended string here and modified the filters accordingly. Much of the efficiency of these new filters comes from the fact that in the preliminary version, without the extended strings, we had to deal with a set of values as the output of the functions creating a small bottleneck. On the contrary, SimpLiFiCPM now needs to deal with only one value as the output of the functions of Filters 2, 3, 5, and 6. This makes SimpLiFiCPM even faster than its predecessor. This is evident from the experimental results presented later. Notably, this has essentially brought some more changes in the overall algorithm. In particular in the searching phase of the algorithm we now need to adapt accordingly to apply the corresponding filters on the extended strings. But the overall improvement outweighs this extra work by a long margin.

Circular Pattern Signature Using the Filters.
In this section, we discuss an O( )-time algorithm that SimpLiFiCPM uses to compute the signature of the circular pattern C(P) corresponding to pattern P of length . This signature is used at a later stage to filter the text. Here, we need five variables to save the output of the functions used for Filters 1, 2, 3, 5, and 6 (based on Observations 1, 2, 3, 5, and 6). And we need a list of size 4 to save the values of the function used in Filter 4 (Observation 4). We start with the extended string (P) = P[1 : ]P [1] and compute the values according to Observations 1 to 6. The algorithm will iterate + 1 times and hence the overall runtime of the algorithm is O( ). The algorithm is presented in Procedure (Algorithm 1). save the return value of Observations 1 : 6 for further use here (4) define an array of size 4 to keep fixed value of A, C, G, T (5) initialize fixed array to {1, 2, 3, 4} (6) ← 1 (7) for ← 1 to do (8) calculate different filtering values in T[1 : ] via Observations 1-6 and make a running sum (9) end for (10) if 1-6 observations values of P[1 : ] vs 1-6 observations values of T[1 : ] have a match then (11) ⊳ Found a filtered match (12) O u t p u tt ofi l eT[1 : ]

Reduction of Search
end if (15) for ← 1 to − do (16) calculate ⊳ Found a filtered match (19) if > then (20) O u t p u t a n e n d m a r k e r $ t o fi l e (21) end if the same technique that is applied in Procedure (Algorithm 1). We apply a sliding window approach with window length of and calculate the values applying the functions according to Observations 1-6 on the factor of T captured by the window. Note that, for Observations 2, 3, 5, and 6, we need to consider the extended string and hence the factor of T within the window need be extended accordingly for calculating the values. After we calculate the values for a factor of T, we check it against the returned values of Procedure . If it matches, then we output the factor to a file. Note that, in case of overlapping factors (e.g., when the consecutive windows need to output the factors to a file), Procedure outputs only the nonoverlapped characters. And Procedure uses a $ marker to mark the boundaries of nonconsecutive factors, where $ ∉ Σ.
Now note that we can compute the values of consecutive factors of T using the sliding window approach quite efficiently as follows. For the first factor, that is, T[1 ⋅ ⋅ ⋅ ], we exactly follow the strategy of Procedure . When it is done, we slide the window by one character and we only need to remove the contribution of the leftmost character of the previous window and add the contribution of the rightmost character of the new window. The functions are such that this can be done very easily using simple constant time operations. The only other issue that needs to be taken care of is due to the use of the extended string in four of the filters. But this too does not need more than simple constant time operations. Therefore, overall runtime of the algorithm The algorithm is presented in the form of Procedure (Algorithm 2).

The Combined SimpLiFiCPM Algorithm.
In this section we combine the algorithms presented so far and present the complete view of SimpLiFiCPM. We have already described the two main components of SimpLiFiCPM, namely, Procedure and Procedure , that in fact calls the former. Now Procedure provides a reduced text T (say) after filtering. At this point SimpLiFiCPM can use any algorithm that can solve CPM and apply it over T and output the occurrences. Now, suppose SimpLiFiCPM uses algorithm A at this stage which runs in O( (|T |)) time. Then, clearly, the overall running time of SimpLiFiCPM is O( ) + O( (|T |)). For example, if SimpLiFiCPM uses the linear time algorithm of [1], then clearly the overall theoretical running time of SimpLiFiCPM will be O( ).
In our implementation however we have used the recent algorithm of [4], which is a linear time algorithm on average and the fastest algorithm in practice to the best of our knowledge. In particular, in [4], the authors have presented an approximate circular string matching algorithm withmismatches (ACSMF-Simple) via filtering. They have built a library for ACSMF-Simple algorithm. The library is freely available and can be found in [22]. In this algorithm, if we set = 0, then ACSMF-Simple works for the exact matching case. In what follows, we will refer to this algorithm as ACSMF-SimpleZero . We have implemented SimpLiFiCPM using ACSMF-SimpleZero ; that is, we have used ACSMF-Simple algorithm simply by putting = 0.
3.6. An Illustrative Example. Now that we have fully described SimpLiFiCPM, in this section we present the simulation of SimpLiFiCPM on a particular example. We only show the simulation up to the output of Procedure , that is, the output of the reduced text, because afterwards we can employ any state-of-the-art algorithm within SimpLiFiCPM. Consider a pattern P =

Experimental Results
We have implemented SimpLiFiCPM and conducted extensive experiments to analyze its performance. We have coded SimpLiFiCPM in C++ using a GNU compiler with General Public License (GPL). Our code is available at [23]. As has been mentioned already above, our implementation of SimpLiFiCPM uses the ACSMF-SimpleZero [4]. ACSMF-Simple [4] has been implemented as library functions in the C programming language under GNU/Linux operating system. The library implementation is distributed under the GNU General Public License (GPL). It takes as input the pattern P of length , the text T of length , and the integer threshold < and returns the list of starting positions of the occurrences of the rotations of P in T with -mismatches as output. In our case we use = 0.
We have used real genome data in our experiments as the text string, T. This data has been collected from [24]. Here, we have taken 299 MB of data for our experiments. We have generated random patterns of different length by a random indexing technique in these 299 MB of text string.
We have conducted our experiments on a PowerEdge R820 rack serve PC with 6-core Intel Xeon processor E5-4600 product family and 64 GB of RAM under GNU/Linux. With the help of the library used in [4], we have compared the running time of our preliminary work in [8] (referred to as Filter-CPM henceforth), ACSMF-SimpleZero of [4], and SimpLiFiCPM. Table 2 reports the elapsed time and speedup comparisons for various pattern sizes (500 ≤ ≤ 3000). As can be seen from Table 2, Filter-CPM [8] runs faster than ACSMF-SimpleZero in all cases. And in fact Filter-CPM [8] achieves a minimum of twofold speed-up for all the pattern sizes. Again, referring to the same table, SimpLiFiCPM runs even faster than ACSMF-SimpleZero in all cases. And in fact SimpLiFiCPM achieves a minimum of threefold speed-up for all the pattern sizes.
In order to analyze and understand the effect of our filters we have run a second set of experiments as follows.

Conclusions
In this paper, we have employed some effective lightweight filtering technique to reduce the search space of the circular pattern matching (CPM) problem. We have presented Sim-pLiFiCPM, an extremely fast algorithm based on the abovementioned filters. We have conducted extensive experimental studies to show the effectiveness of SimpLiFiCPM. In our experiments, SimpLiFiCPM has achieved a minimum of threefold speed-up compared to the state-of-the-art algorithms. Much of the speed of our algorithm comes from the fact that our filters are effective but extremely simple and lightweight. The most intriguing feature of SimpLiFiCPM is perhaps its capability to plug in any algorithm to solve CPM and take advantage of it. We are now working towards adapting the filters so that it could work for the approximate version of CPM.