Finding Top-k Covering Irreducible Contrast Sequence Rules for Disease Diagnosis

Diagnostic genes are usually used to distinguish different disease phenotypes. Most existing methods for diagnostic genes finding are based on either the individual or combinatorial discriminative power of gene(s). However, they both ignore the common expression trends among genes. In this paper, we devise a novel sequence rule, namely, top-k irreducible covering contrast sequence rules (TopkIRs for short), which helps to build a sample classifier of high accuracy. Furthermore, we propose an algorithm called MineTopkIRs to efficiently discover TopkIRs. Extensive experiments conducted on synthetic and real datasets show that MineTopkIRs is significantly faster than the previous methods and is of a higher classification accuracy. Additionally, many diagnostic genes discovered provide a new insight into disease diagnosis.


Introduction
It has been proved that many diseases are closely related with genes [1][2][3]. In bioinformatics, such genes are called diagnostic genes. Capturing these genes is an important task, which helps in diagnosis, prediction, and treatment of diseases [4].
According to biological theory, only a small number of genes are directly related with a certain disease [5]. Biologists always want to exploit fewer genes to provide higher disease prediction accuracy. In practice, how to pick out these diagnostic genes to distinguish different disease phenotypes from a massive amount of gene expression data is often an intractable problem.
Many studies have shown that contrast rules are very promising for this problem. Contrast rules refer to the rules that frequently appear in one class but rarely in other classes, denoted as → , where represents the diagnostic genes and represents a certain disease. Most of such methods can be divided into two categories, that is, single discrimination based [6] and combinatorial discrimination based [7]. The former evaluates every gene according to their individual discriminative power to the target classes and then selects top-ranked genes. The latter often models the problem as a subset search problem and focuses on the combinatorial discriminative power of a set of genes. However, neither of the two exploits the relationship among genes such that some important diagnostic genes may be missed.
In this paper, we tackle the problem by utilizing the order relationship among genes. Below is a real example for an immediate comprehension to our basic idea. Example 1. Figure 1 consists of two subfigures. In the top subfigure, 4 genes are expressed over 25 samples. Samples 1∼16 are cancerous (labeled as " ") and samples 17∼25 are normal (labeled as " "). In the bottom subfigure, another set of 3 genes is expressed over the same set of samples. The existing singleton or combination discriminability-based methods cannot distinguish the two phenotypes. Since most genes are of similar average expression values in the two phenotypes, they will not be selected by the singleton approach. Moreover, all genes are expressed in both phenotypes. Thus, the combination approach based on the cooccurrence of genes will not select them either. Both of the methods ignore the hidden interrelation among genes. In the top subfigure, the gene order over the samples of cancerous phenotype " " is always 4 ≺ 3 ≺ 2 ≺ 1 . Such order is disturbed in normal phenotype " ". In the bottom subfigure, the gene order in normal phenotype " " is 5 ≺ 6 ≺ 7 , while in cancerous phenotype " " such order does not exist. Based on the ordered expression values, the disease phenotypes (the two shadowed "blocks") are well identified.
Example 1 indicates that contrast sequence rules may be a promising solution to the mentioned problem. Another advantage of incorporating the sequence rule into diagnostic gene finding is that we may obtain higher disease prediction accuracy by fewer genes. This is intuitively because the order contains both individual information and combinatorial information. In [8], we proposed a contrast sequence rules mining algorithm, namely, NRMINER, and showed its effectiveness and efficiency. However, there are still some issues demanding a further consideration.
Given genes, there is up to 2 subsets of genes. Moreover, each subset of genes corresponds to ! permutations. Thus, the number of contrast sequence rules is at least ∑ =1 ( ⋅ !) ≫ ! in theory. On one hand, massive rules pose a crucial challenge for biologists to interpret and validate the results. On the other hand, this may take too much time such that the proposed method is not practically feasible. In practice, we often need only a small set of representative contrast sequence rules instead of all the rules. This is also the so-called top-problem in database and data mining communities. Accordingly, the goal of this paper is to discover top-covering irreducible contrast sequence rules (TopkIRs for short) from a given gene expression dataset.
Compared with the existing methods, our contributions in this paper are claimed as follows.
(1) We propose the concept of top-covering irreducible contrast sequence rule, which greatly reduces the burden for biologist to interpret and validate the results and practically enables an efficient diagnostic gene finding method.
(2) We devise the criteria of ranking irreducible contrast sequence rules. Based on the criteria, we can pick out shorter and fewer but more representative rules to build classifier with higher classification accuracy.
(3) We develop a novel algorithm called MineTop IRs to directly discover top-covering irreducible contrast sequence rules without postprocess. As we know, few works address this problem in the context of sequence mining.
The rest of this paper is organized as follows. In Section 2, we introduce some preliminaries and give our problem definition. Section 3 introduces the criteria of ranking rules. Section 4 details the MineTop IRs algorithm. Section 5 includes the experimental results and analysis. Finally, Section 6 concludes this paper.

Preliminary
In this section, we first introduce some basic concepts useful for further discussion and then formalize the problem to be addressed in this paper.

Basic Concepts. A microarray dataset
is an × matrix, with samples = { 1 , 2 , . . . , } and genes = { 1 , 2 , . . . , }. A real value in represents the expression value of gene on sample . An example microarray dataset of 7 genes and 6 samples is shown in Table 1, where the last column lists the class label for each sample.
As mentioned, we want to tackle the problem from the gene order perspective. Accordingly, we propose the EWave model, a sequence model to represent the gene expression data. Next are some necessary concepts.

Definition 2.
Given an expression matrix of a sample set, = { 1 , 2 , . . . , }, and a gene set, = { 1 , 2 , . . . , }, if for a grouping threshold , ≥ 0, and some sample ∈ , there exists a subset, , of genes holding both (1) and (2), we say is an equivalent dimension group, or an EDG in short, of the sample : Specifically, we call a gene satisfying (1) but excluded from an EDG by (2) a "breakpoint. " The method of creating EDGs is detailed in [8]. It is worthy to note that no order is considered in an EDG, where the expression values have no significant differences.
An EWave model can be used to represent the sequences of EDGs. Figure 2 shows the EWave model corresponding to the running example in Table 1, where = 0.5. In each row of Different from the other traditional sequence-like data, since the overlap among different EDGs is allowed in the EWave model, a gene in an EDG can also belong to several other EDGs at the same time. Given a sample and a gene , the sequence of EDGs of is denoted as $ . Then, we call the index of the first EDG in $ containing the head position of with respect to and the index of the last EDG in $ containing the tail position of with respect to , denoted as ( ) and ( ), respectively.
Further, we refer to the gene sequence G, where any pair of genes are not in the same EDG, as a significant chain. Example 5. In Figure 2, 6 7 5 is a significant chain of $ 1 , but 6 2 5 is not, since 2 and 5 coexist in the same EDG.
As mentioned above, we aim to capture the difference among different sample phenotypes from a sequence point of view. Thus, the benefit of EWave model has two aspects. On one hand, not only the gene expression data are very noisy, but also sometimes the gene expression values are very close. If we only consider the significant chain, the difference between genes is large enough so that the difficulty to determine the order among genes is overcome. On the other hand, the high dimension of gene expression data is largely reduced at the same time. Next, we introduce some concepts related with the contrast sequence rule under the EWave model. Definition 6. Let be EWave modeled gene expression data. Then, for a given sequence rule , denoted as → , where is a significant chain and is a given class label, the support of is defined as the number of the sequences of EDGs in containing , denoted as supp( ) and the sample support set of denoted as ( ). The confidence of is defined as the ratio of the number of the sequences of EDGs containing  Table 1. to that of the sequence of EDGs containing , denoted as conf( ) = supp( )/ supp( ). is called a contrast sequence rule if supp( ) and conf( ) are no less than the minimum support threshold and the confidence threshold , respectively, where is a sequence and is a class label.
Definition 12. For any given contrast sequence rule : → of conf( ) = , we call it an irreducible contrast sequence rule if any of → ( ⊑ ) has conf( → ) < . In other words, any subrule of a contrast sequence rule should not be a contrast sequence rule.

Definition 14.
Given , an EWave modeled gene expression dataset, the top-covering irreducible contrast sequence rules for a sample is the set of rules { , } (1 ≤ ≤ ), where the antecedent of , is contained by , ∀ , ∈ , ( , ) ̸ = ( , ) and there exists no rule , ∉ { , } such that can substitute any rule in { , } based on the rule priority. For brevity, we will use the abbreviation Top IRs to refer to top-covering irreducible contrast sequence rules for each sample.
Example 15. Suppose = 2. Then, for sample 1 in Figure 2, the top-covering irreducible contrast sequence rules is the set of rules { 1 ,1 : and 1 ,2 are irreducible contrast sequence rules, and (3) there is no other rule which can substitute

Problem Description. Given (1) a gene expression dataset
where each sample is attached with a unique class label, (2) the equivalent threshold , (3) the minimum support threshold , and (4) the confidence threshold , the problem is to efficiently discover the set of top-covering irreducible contrast sequence rules for each sample.

Criteria of Ranking Rules
In this section, we introduce the criteria of ranking rules. In order to evaluate the (dis)similarity between sequences, we propose the concept of projection distance which is more suitable for EWave modeled gene expression data. The reason is that projection distance takes into account not only the difference on the same position between two sequences but also the displacement between the two items.
Assume is a gene sequence and $ is the gene sequence corresponding to sample , the projection of on $, denoted as | , refers to the sequence of all elements in , permuted according to their relative orders in $. Further, if a pair of items in , denoted as ( , ), has the reversal relative order in | , we call it a reverse pair. Then, for an item , if it is at the th locus in and at the th locus in | , we call | − | the displacement of between and | , denoted as dist ( , ).
Definition 16. Given a gene sequence and the sequence $ corresponding to sample , the projection distance between and | is defined by the following formula: where ( , ) is a Boolean function expressed as ( , ) = 1, if ( , ) is a reversal pair; otherwise, ( , ) = 0. Now, we adopt a similarity function defined based on the concept of projection distance (or simply PD) to identify the (dis)similarity between a sequence and its projection on sample . The similarity function is formally defined as follows.
Definition 17. Given a gene sequence and the gene sequence corresponding to sample , the PD similarity between and | , denoted as Sim PD ( , | ), is defined as where | | is the length of gene sequence .
From (4), we can find that the smaller the projection distance between two sequences, the more the similarity of the sequences. If PD( 1 , 2 ) = 0, Sim PD ( 1 , 2 ) = 1, which means the two sequences are totally the same. Next, we introduce the criteria of ranking rules with two cases.
Definition 18. The priority within the same rule group: given two rules 1 : From (5), we can conclude that the more the antecedent of the rule is different from the gene sequence in the nonsupport set, the higher the priority the rule has.

The MineTop IRs Algorithm
In this section, we present our algorithm, called MineTop-IRs, to solve the problem given in Problem Statement. First, we give a naive method to construct classifier based on contrast sequence rules.
Step 1. Discover all the frequent sequence patterns with a low minimum support threshold.
Step 2. Combine each sequence pattern with a class label to generate a sequence rule. Then, pick out the contrast sequence rule with highest confidence for each sample in the dataset.
Obviously, this naive two-step mining method generates too many rules in Step 1, which takes too long time. Moreover, selecting only one rule for each sample is often not enough. Instead, our algorithm is one-pass process, which is much more efficient. Further, each sample is guaranteed to be covered by top-irreducible contrast sequence rules. In what follows, we detail the proposed MineTop IRs algorithm.

Head-Tail
Matrix. The Head-Tail matrix is a useful structure to accelerate the detection whether a sequence is a significant chain corresponding to some sample template sequence $ , which is a necessary condition of the antecedent of a contrast sequence rule. Table 2 gives the Head-Tail matrix corresponding to the model shown in Figure 2, where each row represents a considered sample, and each column represents a remained gene. Every entry ( , ) in the matrix records a two-dimensional vector ( , ), where denotes the head position of the gene in $ , that is, ( ), and denotes the tail position of the gene in $ , that is, ( ). For example, in Figure 2, 3 ( 1 ) = 2 and 3 ( 1 ) = 3, so the entry at row 3 and column 1 of Table 2 records (2, 3).
An efficient way to decide whether a sequence is a significant chain with respect to $ is that we only consider any neighboring pair of genes such as and ; if ( ) < ( ) is always true, we say that must be a significant chain for $ , which is the sequence of EDGs of sample . Note: While computing the support of a gene sequence, we use the Head-Tail matrix with > 0, which makes the order between genes in the sequence significant enough. However, when computing the projection distance of a gene sequence for some $ , we use the Head-Tail matrix with = 0, which makes the displacement of a reverse pair easily determined.

The Mining
Algorithm. The search space of enumerating all gene sequences is prohitably large. Thus, a suitable traversal framework with some effective pruning strategies is necessary.
In this paper, we adopt a breadth-first traversal framework. As we know, most sequence pattern mining methods such as BIDE [9] and FEAT [10] adopt a depth-first traversal. The goodness is that exploiting the antimonotonicity of support, the depth-first traversal can directly prune searching space based on the current sequence without generating candidate set. However, depth-first traversal is not suitable to solve the problem raised in this paper. The reason is that (1) the confidence of irreducible contrast sequence rule is not antimonotonic, which requires us to detect whether all subrules of the current rule satisfy the conditions defined in Definition 12 that is the confidence of all subrules below . For example, suppose the length of current sequence rule is , we need to detect all the subrules, which shows the computation is very large. (2) Under the premise of not establishing access rules index, it is possible to repeatedly access many rules. The abovementioned two cases are very time-consuming. On the contrary, the breadth-first traversal can be a good solution to the problem mentioned above. We only need to detect whether all the ( − 1)-size subrules meet the conditions. Further these subrules can be obtained by directly accessing the current rule candidate set which is more efficient.
Formally, the algorithm is shown in Algorithm 1. There are four input parameters of the algorithm, the original dataset , equivalent threshold , the minimum support , and confidence threshold . Because of solving the problem in gene sequence perspective, the algorithm will first transform into the EWave model and then construct the Head-Tail matrix which can accelerate the calculation of rule support. At the same time, the top-covering irreducible contrast sequences rules for each sample with consequent , denoted as = [ 1 , . . . , ], will be initialized. Also, we put all the 1-size rules that consist of single gene into rule candidate set Candi R. Then the function breathfirst search is called to perform the breath-first traversal to find out the top-rules for each sample.
The function breathfirst search takes in four parameters: the rule candidate set Candi R, minimum support , confidence threshold , and the size of rule . When the algorithm executes to the level, it generates all the ( + 1)-size rules based on the rules in Candi R (line 2). For each ( + 1)-size rule, the algorithm is based on three pruning rules (lines 4, 6, and 11) to detect whether it will be put into Candi R for further extension (line 8) or used to update the topcovering rules for samples in its support set (line 13) or just be pruned. It is worth noting that the confidence of all the rules in Candi R must be below because once the if∀ -size subrule of exists in candi R then (4) i fsupp( ) > then Pruning rule 1; (5) i fconf ( ) < then (6) Pruning rule 2; (7) a d d into candi R; (8) e l s e (9) C h e c k t h e th covering rule for each sample ∈ ( ) to find the lowest confidence minconf and the corresponding support sup; (10) i Pruning rule 3; (12) U p d a t e = [ 1 , . . . , ] for each sample ∈ ( ) with based on Definitions 18 and 20; (13) e n d (14) end (15) Deleteallthe -size rules in candi R; (16) ++; (17) end

Pruning Strategies.
We next illustrate the pruning techniques that are used in MineTop IRs. With the help of these pruning rules, we can find out the top-covering irreducible contrast sequence rules for each sample efficiently.

Pruning Rule 1. Let :
→ be the current considered sequence rule; if there exists a sequence rule : → , ⊑ , and conf( ) > , the rule itself and all its super rules can be pruned.
Proof. Based on the definition of irreducible contrast sequence rule, if a sequence rule : → is irreducible contrast sequence rule, it requires that ∀ : → ( ⊑ ), conf( ) < . Thus, if any of its subrules : → do not satisfy this condition, the sequence rule : → cannot be an irreducible contrast sequence rule. Similarly, none of its super rules can be irreducible contrast sequence rules.
Specific to our algorithm, we store each rule whose confidence and all its subrules' confidence are below in Candi R for further extension. When deciding whether a newly generated -size rule is to be pruned, we only need to test if all of its ( − 1)-size subrules are in Candi R. If not, we can safely prune this sequence rule and all its super rules.

Pruning Rule 2. Let :
→ be the current considered sequence rule and the minimum support threshold. If supp( → ) < , then the current rule and all its super rules are pruned.
Proof. It is immediately derived from the a priori property of sequence [11] and Definition 12.
In MineTop IRs, we can use the constraint of topto prune rules. Combined with Definition 20, we compute minconf and sup, the critical point of Top IRs thresholds for the samples in ( ), where minconf is the minimum confidence value of the discovered Top IRs of all the samples in ( ) and sup is the corresponding support. Assume the top-covering irreducible contrast sequence rules of each sample are ranked according to the priority between rule groups such that 1 ≺ 2 ≺ ⋅ ⋅ ⋅ ≺ : Computational and Mathematical Methods in Medicine 7  Pruning Rule 3. Given the current considered sequence rule : → and conf( ) ≥ , minconf and sup computed according to (6), if the rule is less prior based on the priority between rule groups (Definition 20) than (conf( ) = minconf, sup = sup( )), then the rule and all its super rules cannot become a rule in the top-covering irreducible contrast sequence rules list of any sample and can be safely pruned.
If the current sequence rule : → cannot be pruned by Pruning Rule 3, there are two situations. On one hand, ∀ ∈ ( ) when there are no rules in { 1 , . . . , } that have the same sample support set as that of , we only need to detect if is more prior than , if so, we substitute for . On the other hand, because in this paper we want to find out top-rules for each sample with different sample support sets, ∀ ∈ ( ) when there is some rule in { 1 , . . . , } that has the same sample support set as that of , we need to find out that if is more prior than the rule has the sample support set with based on the priority within the same rule group (Definition 18), if so, we replace this rule with which can guarantee the current rules in { 1 , . . . , } have the highest priority.
In addition, another optimization method is utilized in Pruning rule 3. If we find all Top IRs have 100% confidence and the lowest support value of rules is larger than , we dynamically increase the user-specified support threshold.

Performance Studies
In this section, we will look at both the efficiency of our algorithm in discovering Top IRS and the usefulness of the discovered rules. All our experiments were performed on a HP PC with 2.33 GHz Intel Core 2 CPU, 2 GB RAM, and a 160 GB hard disk running Windows XP. Algorithms were coded in Standard C.
Datasets. Four real gene expression datasets for experimental studies: Leukemia [1], DLBCL Tumor [2], Hereditary Breast Cancer (HBC) [3], and Prostate Cancer (PC) [12]. Table 3 shows the characteristics of the four datasets: the number of samples (#sample), the number of genes (#gene), and the label of class ( ). The number of samples in every class is shown in the last column. Moreover, we generate the synthetic datasets by using a specialized dataset generator [8].

Efficiency of MineTop IRs.
In term of efficiency, we compare MineTop IRs with R-FEAT and NRMINER [8]. On one hand, R-FEAT is changed from the sequence generator mining algorithms FEAT [10]. Briefly, we apply FEAT on a given dataset, when a generator is found, we decide whether → could be a result by checking all rules → , where ⊑ , satisfying the conditions based on Definition 14. On the other hand, the NRMINER algorithm adopts the template driven method to find out all the interesting nonredundant contrast sequence rules, which are necessary for checking whether the conditions in Definition 14 are satisfied. We should point out that the rules discovered by MineTop IRs are a subset of the above two existing methods.
In Figure 3, we study how the running time varies with #sample and #gene by increasing #sample from 10 to 30 while fixing #gene to 100 and then increasing #gene from 20 to 100 while fixing #sample to 30, where the synthetic datasets are utilized. Figures 3(a) and 3(b) show that the running time becomes longer with #sample and #gene increasing. This is because the searching space also becomes larger. However, the MineTop IRs is always much faster than the other two methods; the reason is that our algorithm can directly discover the results in one step. However, the other two are two-step mining methods, which need to first discover a bigger result set and then conduct the postprocessing. Further, with the searching space increasing, the number of rules after first step mining grows exponentially. Thus, it is very time-consuming. Figure 4 shows the effect of varying towards runtime. We observe similar tendencies on all datasets. It is quite reasonable that MineTop IRs is monotonously increasing with . Also, as shown in Figure 5, MineTop IRs is monotonously decreasing with . Figures 6 and 7 show the effect of varying minimum support threshold and the minimum confidence threshold on four real gene expression datasets. Figures 6(a)-6(d) show the running time varying with the minimum support threshold , where the other two parameters and are set to 0.8 and 0. Note that the -axes in Figures 6 and 7 are in logarithmic scale. We run MineTop IRs by setting = 10. In Figure 7, changes from 70% to 90% while = 0 and is fixed in every dataset. As seen from Figure 6, running time decreases with the increasing of . This is because the increasing of prunes more useless rules. We also find out that MineTop IRs is usually one order of magnitude faster than the other two algorithms, especially at low minimum support. The reason MineTop IRs outperforms the other two algorithms is that R-FEAT and NRMINER discover a large number of rules at lower minimum support while the number of rules discovered by MineTop IRs is bounded. Besides, MineTop IRs can use Pruning strategy 1 to prune the search space; however, R-FEAT and NRMINER do not meet this property. Figure 7 shows that the running time of both NRMINER and R-FEAT does not change significantly as is increasing, which is because the pruning strategies of these methods are mainly based on support threshold . However, the running time a little increases with the increasing . This is because with the increasing of , the rules whose confidence below will also increase; thus the pruning ability decreases a little. Despite so, the MineTop IRs is still faster than the other two algorithms based on the above reasons in Figure 6.

Effectiveness of MineTop IRs.
In terms of the effectiveness of MineTop IRs, the classification accuracy and the complexity are used as the performance standard for evaluation. Moreover, the biological significance of the discovered genes is also discussed.

Accuracy and Complexity.
We build a classifier called Top IR classifier based on the rules that MineTop IRs discovered. The Top IR classifier is composed of subclassifiers, denoted as IR 1 , . . . , IR . Each IR 1 classifier is built based on all the top-rules for each sample in the dataset. We call IR 1 the main classifier and IR 2 , . . . , IR are backup classifiers. We use every subclassifier in order until the test sample is successfully classified. Besides both main and backup classifiers we set a default class which is set as the majority class of the training data. If a test sample cannot be classified by the classifiers, we put it into the default class.
When building each subclassifier, the score function in (7) [13] is adopted, where ∈ R( , ) represents the rules matching the test sample in class , and ∈ R( ) represents all the rules in class . To which class a test sample should be assigned is decided by a matched rule of the highest score: In the experiments, we adopt 10-fold cross validation to test the average classification accuracy of Top IR classifier and compare it with NR [8] and CBA and IRG [14] classifiers. The results in Table 4 show that Top IR classifier performs much better than CBA and IRG classifiers. Compared with CBA which is built with the Top-1 covering irreducible  contrast sequence rules, Top IR classifies much fewer test data using default class. IRG classifier is built based on the association rules, which illustrates that sequence rules can reflect data characteristics better. Top IR classifier is more accurate than NR classifier on most dataset; however, it uses much fewer rules ( * ) to build classifier than NR. In our experiment, = 10 and the rules used in NR classifier are usually more than ten thousand [8]. Furthermore, we discover that the average length (AL for short) of sequence rules used in Top IR classifier is shorter than that of IRG. This result verifies that the MineTop IRs could provide as high as diagnostic accuracy using as fewer as possible genes, which is very valuable for biologists to further follow up biological or clinical validation of selected genes [15].

Biological
Significance. Different from the traditional methods, MineTop IRs characterizes the pathogenesis of a disease from a sequence-like point of view, which incorporates the orders among genes and can be seen as the pathway of disease causing. In this part, by showing some interesting results from Leukemia dataset [1], we emphasize the fact that not only can MineTop IRs find the genes revealed by the traditional methods, but also it can find some genes ignored by the traditional methods. Table 5 lists the top-10 genes most frequently occurring in the discovered Top IRs for the diagnosis of "AML" samples and "ALL" samples, where the genes with " * " mean they are also included in the benchmark, that results from eight statistics based gene ranking methods [16]. The two most  frequent genes appear in Table 5, which also appear in the benchmark. Gene TIMP2 is a member of the TIMP gene family, the proteins encoded by which are natural inhibitors of the matrix metalloproteinases. Reference [17] reveals that the transcription of TIMP2 in SHI-1 cells of AML is higher than other leukemic cells. Gene ZFP36 expression is upregulated in human T-lymphotropic virus 1-(HTLV-1-) infected cells. HTLV-1 is associated with adult T-cell leukemia/lymphoma [18]. In addition, for the genes without " * ", though they are not in the benchmark, we still cannot ignore these genes. For example, the gene sequence ⟨ 2 5⟩ including frequent gene CCT5 in Table 5 appears in most "ALL" sample but fewer occurs in "AML" samples. But, any of its subsequence does not have the ability of distinguishing samples which indicates that any gene in ⟨ 2 5⟩ is irreducible and well reflects the synergy between the genes. Thus, these genes also have very important potential values for biologists to further explain.

Conclusion
In this paper, we study an important problem in bioinformatics, that is, discovering diagnostic gene patterns from gene expression data. Unlike any previous work on this topic, we tackle the problem by exploiting the ordered expression trend of genes, which can better reflect the gene regulation pathway. In order to capture the more accurate diagnosis by using as few as possible rules, we propose the concept of topcovering irreducible contrast sequence rules for each sample of gene expression data. Further, an efficient method called MineTop IRs is developed to find all Top IRs. Considering the real noisy scenario in gene expression data, we first use an EWave model, which, essentially different from the current models, characterizes gene expression data from a sequencelike perspective. Then, we can use MineTop IRs to discover the bounded number of Top IRs in one mining process, which can directly be used to build classifier. Extensive experiments conducted on both synthetic and real datasets show that MineTop IRs is both effective and efficiency. It may offer a new point of view from diagnostic gene discovery to the biologists.