Mining Association Rules in Dengue Gene Sequence with Latent Periodicity

The mining of periodic patterns in dengue database is an interesting research problem that can be used for predicting the future evolution of dengue viruses. In this paper, we propose an algorithm called Recurrence Finder (RECFIN) that uses the suffix tree for detecting the periodic patterns of dengue gene sequence. Also, the RECFIN finds the presence of palindrome which indicates the possibilities of formation of proteins. Further, this paper computes the periodicity of nucleic acid and amino acid sequences of any length. The periodicity based association rules are used to diagnose the type of dengue. The time complexity of the proposed algorithm is O(n). We demonstrate the effectiveness of the proposed approach by comparing the experimental results performed on dengue virus serotypes dataset with NCBI-BLAST algorithm.


Introduction
Periodicity is the tendency where the sequences of events or values recur at particular intervals [1]. Periodicity plays an important role in discovering interesting frequent patterns in any sequence including genomic sequence that is made of amino acids present in the human cells. Latent periodicity refers to the presence of hidden or reverse subsequence in the given sequence during the particular interval. Finding the latent periodicities or regularities among gene sequences will be helpful for the drug designers in predicting the future evolution of viruses that cause the particular disease.
Cells of the human body have a central core called nucleus, which is packaged in units known as chromosomes. Humans have 23 pairs of chromosomes, which are together known as genome. Genes are a specific region of the genomes, which is the molecular unit of heredity of a living organism. Gene sequence contains a sequence of nucleic and amino acids. Nucleic acid consists of a chain of linked units called nucleotide. Nucleic acid sequence has the combination of nucleotide bases within deoxyribonucleic acid (DNA) or ribonucleic acid (RNA). DNA is a chain of four types of molecules adenine (A), cytosine (C), guanine (G), and thymine (T). A sample DNA sequence may be like TCCTGAT AAGTCAG TGTCTCCT. RNA is represented as the combination of four nucleotide bases adenine (A), cytosine (C), guanine (G), and uracil (U). RNA sequence may be like UCCUGAU AAGUCAG UGUCUCCU.
DNA and RNA play a major role in the formation of proteins. The constituents of proteins are amino acids which are represented using 20 English letters except for B, J, O, U, X, and . A sample protein sequence may look alike CFPUEQGHILDCLKSTFEWEGHILDWES. Protein sequences are shorter than DNA sequences [2].
Although the proposed work can be applied on any gene sequence such as ebola and chikungunya, with suitable modification, the main focus is shown on the dengue gene sequence alone owing to its significance in the recent years.
The incidence of dengue has grown dramatically around the world in recent decades. Over 2.5 billion people, 40% of the world's population, are now at risk on account of dengue. World Health Organization (WHO) currently estimates that there may be 50-100 million dengue infections worldwide every year [3]. As per the medical record of Government Under these circumstances, the research on dengue virus genome sequence plays a vital role in the diagnosis of the disease. Therefore, it is necessary to predict the presence of cooccurrence patterns which are similar elements present in dengue gene sequences. This work derives the periodic association rules (PAR) that will reveal the possibilities of occurrence of similar disease pattern using a novel technique called Periodic Association Rule Mining (PARM).
Periodicity in genome sequence can be classified into two types, namely, element periodicity and subsequence periodicity. Element periodicity deals with the repetition of individual elements of gene sequence during a particular period whereas subsequence periodicity deals with the periodicity of the entire sequence or some portion of the given sequence.
A palindrome is a sequence of letters or words such as racecar and madam I madam which are read the same in forward as well as in reverse direction [5]. The RECFIN finds the presence of palindrome in the given sequence which will be helpful in identifying the formation of protein. Each protein adopts a unique 3-dimensional structure, which is decided by its amino acid sequence. A slight change in the sequence can drastically change the functioning of the protein. In case of dengue gene sequences the presence of latent regularities affects the formation of proteins [6].
The dengue virus belongs to Flaviviridae family that is transmitted to people through the bite of the Aedes aegypti or Aedes albopictus mosquitoes. There are four types of dengue virus serotypes that cause the disease [7]. Serotypes refer to the subdivisions of a virus that are classified based on their cell surface. They are listed in Table 1.
There are three main types of dengue infection, namely, classic dengue fever (CD), dengue hemorrhagic (DH) fever, and dengue shock syndrome (DSS) [8]. All the types of dengue fever begin with noticeable symptoms within four to seven days after the Aedes aegypti mosquito's bite. The symptoms of CD include headache, pain behind the eyes, joints, and muscles, vomiting, and body rash. It also reduces the count of white blood cells (WBC). DH fever includes all the classic symptoms with higher fever and sharp decrease in the number of platelets in the blood. Platelets are small, disk shaped fragments that are the natural source of growth factors. They are circulated in the blood and involved in the formation of blood clots. As a result of this, victims bleed from the nose, gums, and skin. DSS is the most severe form of the disease which causes massive bleeding and fall in the blood pressure [9]. Each virus type has its own characteristics.
The RECFIN evaluates the element and subsequence periodic patterns including palindrome among the given dengue sequences. RECFIN comprises three parts. The first part deals with the formation of the suffix tree to find the periodic patterns. In the second part, a recurrence identification procedure is proposed to find the periodic patterns and, in the third part, a novel palindrome detection procedure is presented to find the presence of palindrome in the given sequence. Based on the resultant patterns, the periodic association rules are generated using PARM. These rules are In Section 2, the work related to the dengue and periodicity detection is outlined. Section 3 demonstrates the methodologies related to the prediction of the periodic pattern and palindrome in dengue gene sequences. Section 4 exhibits the experimental results that were obtained using dengue virus serotype dataset [10]. Section 5 illustrates the comparative analysis of the results. Finally, Section 6 describes conclusion.

Review of Related Works
The causes and effects of dengue have been focused on by the research community for the past two decades. Current research on dengue aims to provide better surveillance to limit the effect of dengue outbreak. Basic research includes a wide range of studies focused on learning how the dengue virus is transmitted and how it infects cells and causes disease. Further many research works investigate several aspects of dengue viral biology that includes exploration of the interactions between the virus and humans as well as the repetition of dengue virus serotypes. Researchers have also been studying the dengue viruses to understand the factors that are responsible for transmitting the virus to humans. They found that specific viral sequences are associated with severe dengue symptoms [11].
In a similar direction, we propose here an approach to find the latent periodicities and periodical associations in dengue virus serotypes in order to diagnose the dengue syndrome. The major works related to the identification of the latent periodicities in the time series and biological sequences [9] are described below.
Indyk et al. [12] presented periodic trends algorithm that finds the subsequence periodicity alone, by analyzing the recurrence of a sequence of elements in a given time series. Time series is a sequence of values observed over certain time intervals. They developed an algorithm whose time complexity was ( log 2 ), where is the length of the time series. They used the linear distance measure for finding latent periods.
Elfeky et al. [13] presented two algorithms to find symbol and segment periodicities in the time series. The complexity Computational Biology Journal 3 of their algorithm was ( log ). They used the fast Fourier transformation and convolution for discovering element and subsequence periodicities.
Rasheed et al. [14] proposed an algorithm that considers the periodicity of alternative substrings and introduced the concept of relaxed range window (RRW) for detecting periodic occurrences in biological sequences. This approach provides equal treatment for A and T and also for C and G. For example, the sequence TTACGAATGGTAGT has the periodicity for alternative string group (TT, AA, and TA) with period 4. The strings TA, TT, and AA are parts of an alternative group and the presence of any of these is counted as valid repetition. Another example for RRW concept is in the sequence abdadbacc. Here, "a" is periodic with period 3 starting from position "0" with periodic strength of 100%. They combined the results of the periodicity of individual symbols and combined them by considering their starting positions. They used the suffix tree representation for detecting the periodicities in DNA sequence by modifying the algorithms of Elfeky et al. [13] and Ma and Hellerstein [15].
The algorithm of Ma and Hellerstein [15] computed the symbol periodicity with time tolerance window which is used to accommodate various types of noise in the data. They used the edit distance measure for discovering periods of the element's occurrence. The result of the element periodicity was used to find the approximation of subsequence periodicity.
Huang and Chang [16] presented their algorithm for finding similar periodic patterns, by varying the time limit of the sequence. They used the dynamic time warping (DTW) method for discovering the periods. DTW is a technique for measuring similarity between two temporal sequences which may vary in time or speed. DTW has been applied to temporal sequences of audio, video, and graphics data. The warping function was used to compute the distance between any two elements.
Pujeri and Karthik [17] proposed the constraint-based periodicity mining (CBPM) algorithm that uses frequent pattern growth (FPG) tree in time series databases. For constraint-based association rule mining, the user can specify various types of constraints which include constraints based on knowledge, data, dimension, level, interestingness, and rule. By specifying CBPM, the user can evaluate the onedimensional rule such as buy (school bag) → buy (uniform) where the dimension is buy. Also, the user can evaluate the rule such as occupation (student) → buy (textbook) which has two dimensions occupation and buy. Further, multidimensional rules can be evaluated in a similar manner. The time complexity of CBPM algorithm is ( ), where is the length of input sequence and is the length of periodic pattern.
Apart from the above works, there are many research works in the field of biological science that are related to the dengue sequence. Some of the works that are relevant to the current work are furnished below.
Kececioglu and DeBlasio [18] developed a software tool for searching the similarity based on sequence alignment algorithms (SAA). SAA include local, global, and multiple sequence alignment for providing accurate results while analyzing the sequence.
Prada-Arismendy and Castellanos [19] presented a technique called Forensic Investigation Analysis which uses the information related to existing protein structure and predicts the formation of proteins by using visualization techniques.
Mairiang et al. [20] focused on the combined analysis of protein interactions. They tested each identified host protein against the proteins of all four serotypes of dengue and identified the interactions that are conserved across serotype. Their contribution was useful in understanding the interplay between dengue and its hosts.
Bletchly [21] proposed the pathogen analysis which helps to explore the human immune response to dengue virus infection and to analyze the antigen and structure of the protein. Pathogen is an infectious agent that causes disease or illness to its host. This analysis examines both the human immune response system and the circulation of the serum of infected patients.
Though, there are various techniques available to find the periodic patterns in time series and other sequences, the works related to the biological sequences are very limited. Further, the existing works concentrate mainly on element periodicity or subsequence periodicity. Therefore, there is a need for holistic approach that computes all kinds of periodicities and their associations.
In the current work, we propose an approach called RECFIN to compute several periodicities including latent periodicity. RECFIN algorithm adopts the suffix tree technique. PARM generates the periodic association rules from frequent item sets. Though our algorithm follows the worst case of time complexity of ( 2 ), it is helpful in predicting the future evolution of dengue virus types accurately. For the gene sequence of given length, RECFIN algorithm computes the element and subsequence periodicities. In addition, it finds the presence of palindrome which will be helpful in predicting the formation of protein. The proposed approach utilizes the suffix tree (ST) data structure. Based on the occurrence of element and subsequence periodicities, the PAR is generated.

Element Periodicity.
In a DNA sequence D, an element is said to be element periodic with a period if exists for almost every periodic intervals. For example, in the DNA sequence 1 = ACGACCACGC, the symbol is periodic with period 4 since exists every four periodic intervals (i.e., in positions 1, 5, and 9). Moreover, the element is periodic with period 3 since exists almost every three time intervals 4 Computational Biology Journal (i.e., in positions 0, 3, and 6 not 9). The element periodicity is defined as follows.
Let be a sequence. Then, , ( ) will be the projected sequence that contains the periodic values of element which starts at position in which period can be shown as where 0 ≤ < , = |( − )/ |, and is the length of . For example, if 1 = ACGACCACGC, then 4,1 ( 1 ) = CCC and 3,0 ( 1 ) = AAAC. Naturally, the ratio of the number of occurrences of an element in certain , ( 1 ) to the length of this projection indicates how often this element occurs after every periodic intervals.

Subsequence Periodicity.
Unlike element periodicity that focuses on the elements where different elements may have different periods, the subsequence periodicity focuses on the repetition of sequence of values. The DNA sequence is said to be periodic with a period if can be divided into equal-length subsequences, each of length , which are almost similar. For example, the DNA sequence 2 = ACGACGACG is clearly periodic with a period 3; likewise, the DNA sequence 3 = ACGACTACG is partially periodic with a period 3 for the same subsequence despite the fact that its second subsequence is not identical to other subsequences.

Latent Periodicity.
The detection of hidden regularity patterns like palindrome in DNA sequences plays a major role in deciding the classification of dengue virus serotypes such as DEN1 and DEN2. Consider, the DNA sequence 4 = CAGGAC, which has the palindrome sequence. The rearranged sequence of 4 is 4 = GACGAC. The periodic interval of each element in 4 is 3. The 4 is said to be a complicated palindrome [2].

Periodicity Detection with RECFIN Algorithm.
The RECFIN algorithm has four steps as described below.

Suffix Tree Based Representation.
A suffix tree (ST) is a nonlinear data structure that has been proved to be very useful in string processing [14]. It is useful in searching a substring of the original string. Also, it is useful in finding the frequent substring. Each of the branches of the suffix tree represents a suffix of the original string. Hence, a suffix tree for a string of length has branches and, thus, leaf nodes.
Each leaf node in the tree has an integer value showing the starting position of the substring achieved through the path from the root to that of leaf in the original string. Since there are exactly suffixes for a string, each starting at one of the index positions, there are leaf nodes in the tree. Each internal node has the value representing the length of the substring so far achieved while traversing from the root to the node. In a suffix tree, each node contains a unique field called index. It identifies the starting index of a substring in the multiple sequences.
Consider the DNA sequence t = CAGTCAGG. The sequence can be written based on its index. A symbol $ is added with being a termination indicator. The construction of ST is illustrated in Figure 1, where the non-leaf nodes are generated based on the first occurrence of the subsequence in reverse order of . The leaf nodes are generated based on the occurrence of parent node as well as in the indexed order till the end of the sequence. Therefore, the ST is useful in the identification of all subsequences such as G$, GG$, AGG$, CAGG$, TCAGG$, GTCAGG$, AGTCAGG$, and CAGTCAGG$.

Element and Subsequence
Periodicity. After the construction of the suffix tree, the tree traversal process is performed in the bottom up fashion. During the traversal, each leaf node passes its value to its parent. A subsequence starting with position can be found by traversing the corresponding leaf node that contains the value and its parent nodes till the root is reached. Consider the sequence starting with index 2, that is, AGTCAGG$. To get the sequence, the traversal is performed from the leaf node 2 towards its root through the parent nodes as shown in Figure 2. Similarly the traversal for the subsequence AGG$ is performed from the starting leaf node 6. The resultant sequence must be reversed in order to get the required subsequence. The traversal process from leaf node to root needs to be performed recursively and is known as recurrence calculation. In the algorithm, reccal procedure is used for this. In a suffix tree, a leaf can represent more than one parent. The total number of parents can be calculated as ( + 1) − , where is the length of and is the index value. For = 3 and Computational Biology Journal 5 = 8 the possible parent values can be calculated as ( + 1) − ; that is, (8 + 1) − 3 = 6. Therefore, six combinations such as G, GT, GTC, GTCA, GTCAG, and GTCAGG are possible. Hence, the value of reccal is incremented by 1. The count represents the frequency of the occurrence of a sequence [22]. Thus, the suffix tree based representation helps us to find the element and subsequence periodicities simultaneously for the given sequence.

Latent Periodicity in Suffix
Tree. Apart from finding the subsequence periodicity in the forward direction, the occurrence of palindrome can also be found.
If we calculate the reverse of the string, it provides the reverse of the first half; then, it is said to be the latent periodicity. For example, the DNA sequence 4 = CAGGAC has the palindrome sequence which contains rearranged values of first half in the second half. Thus, the presence of palindrome is found. In the algorithm, the procedure polycheck is used for this purpose.

Periodic Association Rules.
A further step in this direction is the prediction of cooccurrence patterns among the dengue gene sequences. This can be done by evaluating the rules that can reveal the occurrence of an element or subsequence. Such rules are called periodic association rules, and the corresponding technique is called Periodic Association Rule Mining. The PARM is similar to market basket analysis. In PARM terminology, the nucleic or amino acids may be considered as items and the gene subsequences as the baskets that contain the items. In the traditional association rules, only the number of frequent items is calculated whereas PARM calculates the occurrence order of frequent item sets along with its periodic position.
To obtain periodic association rule, the frequencies of nucleic or amino acids are computed in each dengue gene sequence. The rule can be expressed as A → C, where and are the associated items. The rules state that if a nucleic acid is present in a given sequence with 1 periodicity, then there will be another nucleic acid that will have similar periodicity with respect to their respective initial positions. The PARM procedure enables finding the periodicity 1 along with its starting positions. Let = { 1 , . . . , } be a set of elements, called items. Let = { 1 , . . . , } be a set of subsets of . We call each a set of transaction. In the market basket application [22], the set denotes the items stocked by a retail outlet and each basket is the set of items of a transaction. Similarly, in case of gene sequence, the set denotes the elements of nucleic or amino acid and the basket is the orderly subsequences. The order and frequency of the elements can be evaluated using the suffix tree. The PAR is intended to capture the orderly dependence among the elements of dengue virus dataset and the rule can be represented as 1 → 2 along with the period and starting position of 1 and 2 , provided that the following conditions hold good: (1) 1 and 2 occur at regular intervals in the sequence for at least % of the baskets where is the support and is the number of subsequences; Objective: To Mine PAR Input: Gene sequence of Dengue virus D, minimum support s and confidence c. Output: Periodic Association Rules Method: (1) Construction of Suffix tree (a) Read the given input. (2) for all the subsequences containing 1 , at least % of subsequences contains 2 where is the confidence [22].
The above definition can be extended to form multidimensional periodic association rule such as AC → GT, where AC and GT are element of nucleic acid with periodic dependence. The association rules are considered to be interesting if they satisfy both minimum support and confidence thresholds. The threshold values are set by users based on their domain expertise [22].
To evaluate the PAR, we propose the RECFIN algorithm. The following steps are involved in the RECFIN algorithm: (1) based on the occurrence positions, the elements are mapped into integers; (2) based on the support threshold, the element periodicity is found; the set of elements that satisfies the minimum support threshold is called the frequent item set; (3) the frequent item sets are used to generate association rules; for example, consider the item set {A, C, G}; the following rules can be evaluated using the given item set: Rule 1 is as follows: A ∧ C → G; Rule 2 is as follows: C ∧ G → A; Rule 3 is as follows: A ∧ G → C; Rule 4 is as follows: G ∧ A → C; Rule 5 is as follows: C ∧ A → G; Rule 6 is as follows: G ∧ C → A.
In the above rules the element that appears in left hand side is called antecedent and that of the right hand side is called consequent; the confidence is computed using the conditional probability of antecedent. For example, the confidence of the rule 1 is computed as follows: if the confidence is equal to or greater than a given confidence threshold, the rule is considered to be interesting rule; (4) based on the support and confidence, the PAR is generated.

RECFIN Algorithm.
In this section, we describe the pseudocode of the RECFIN algorithm in Algorithm 1 which covers the entire processes, element, subsequence periodicities, palindrome checking, and the generation of periodic association rules.

Experimental Results
To demonstrate the functionality of the RECFIN algorithm, dengue gene sequences datasets of NCBI have been used [10]. These datasets contain four different dengue viruses, namely, DEN1, DEN2, DEN3, and DEN4. This experiment utilizes the DNA sequence of DEN4 as the input sequence with support threshold 50% and confidence threshold 70%. The partial DNA sequence of DEN4 is shown in Box 1. The length of the input sequence is 10,735 characters. For the demonstration, consider the following given sequence: CATCATGG. The suffix tree of the given sequence is illustrated in Figure 3. Figure 4 illustrates the periodic occurences of the given sequence AGAA.  the partial number of output periods along with the latent periodicity. The final step of the RECFIN algorithm is to evaluate the PAR. The PAR that is generated by RECFIN contains interesting as well as extraneous patterns. Therefore, pruning is necessary to extract the useful patterns. The interesting pattern is the pattern that has the strong periodic dependence with high support and confidence. The interesting PAR covers the rules of similar periodic intervals among the different dengue virus serotypes shown in Table 3.
The PAR contains the elements along with their starting position, periodicity values, and their dependence with support and confidence. Rule 1 of Table 3 reveals the element periodicity. Further, the occurrence of elements A, G, and C with periodicity 21 reveals the periodic dependence of element T.

Comparative Analysis
The NCBI-GenBank database is used for the comparative analysis which has 171 million sequences as of February, 2014 [10]. For the comparative analysis of the algorithm, we have used DNA sequences of four different dengue virus serotypes. This dataset varies in the length of the characters. The varying length of each dengue virus serotypes is listed in Table 4.    The result of the NCBI-Basic Local Alignment Search Tool (BLAST) algorithm is compared with our proposed algorithm through the experiments. The most important aspect is the accuracy with respect to the discovered periods of the proposed algorithm as discussed in Section 5.1. Then, the time performance of the proposed algorithm is displayed in Section 5.2.

Accuracy.
The accuracy measure is the ability of the algorithm to detect the periodicities in the given sequence. To accurately discover a period, the periods discovered with a high periodicity threshold value are better candidates than those discovered with a lower periodicity threshold value. Therefore, we examine the accuracy by measuring the closeness of the periodic values estimated by the algorithm. The accuracy is measured by the average periodic intervals. Table 5 shows the average periodic interval of element, subsequences, and latent periodicities in the given sequence (DEN4). Figure 5 shows that the periodicity is increasing in level when more intervals are included.

Time Performance.
To evaluate the time performance of the proposed RECFIN algorithm, Figure 6 exhibits the sequential characteristics of algorithms with respect to the sequence length. most relevant results of our RECFINalgorithm. The NCBI-BLAST has enormous amount of datasets which compare the given sequence with the existing online dataset. The output of the proposed RECFIN algorithm shows the most similar result to NCBI-BLAST result. Also the alignments of the sequences are also compared and are shown in Figures 7(a) and 7(b).
The entire length of sequences is marked as Query. The color key can be used to show the score of alignment. In this case, the entire sequence will be aligned exactly and is represented as color key for alignment score ≥200 in Figure 7(a). The attributes, score, and identities show the alignment of the entire sequence in Figure 7(b).
Color key for alignment scores  PAR is generated based on the occurrence of both element and subsequence periodicities along with latent periodicity. After the analysis of the results, we have obtained some of the interesting frequent patterns based on the periodic intervals.

Conclusion
In this paper, we have derived PAR to predict the dengue serotype and to define three types of periodicities. The element periodicity addresses the periodic intervals among the elements; the subsequence periodicity addresses the periodic intervals among the subsequences along with the latent periodic patterns. The proposed RECFIN algorithm for detecting each type of periodicity in ( 2 ) time is based on suffix tree method, for a gene sequence of length . Finally, our algorithm is used to define periodic association rules for each dengue virus serotype with the interestingness measures of support and confidence thresholds which helps to predict the future evolution of dengue virus serotypes.