Evolutionary Approach for Relative Gene Expression Algorithms

A Relative Expression Analysis (RXA) uses ordering relationships in a small collection of genes and is successfully applied to classiffication using microarray data. As checking all possible subsets of genes is computationally infeasible, the RXA algorithms require feature selection and multiple restrictive assumptions. Our main contribution is a specialized evolutionary algorithm (EA) for top-scoring pairs called EvoTSP which allows finding more advanced gene relations. We managed to unify the major variants of relative expression algorithms through EA and introduce weights to the top-scoring pairs. Experimental validation of EvoTSP on public available microarray datasets showed that the proposed solution significantly outperforms in terms of accuracy other relative expression algorithms and allows exploring much larger solution space.


Introduction
Extracting accurate and simple rules that exploit marker genes is crucial in understanding and identifying casual relationships between specific genes. Finding a meaningful and robust classification rule is a real challenge; especially when in different studies of the same cancer, diverse genes are considered to be marked [1,2].
A Relative Expression Analysis (RXA) was firstly proposed by Geman et al. in [3] and represents simple yet powerful set of classifiers. It is based on the relative orderings among the expressions of a small number of genes. Instead of using expression values directly, only ranks of the expression data are used, making the algorithms insensitive to data normalization procedures. Moreover, use of the ordering relationships for a small collection of genes has potential for identification of gene-gene interactions with plausible biological interpretation and direct clinical applicability [4]. Major and well-known drawback of RXA is a high computational complexity, which grows exponentially with the size of the collection of genes.
In this paper, we propose an Evolutionary Top-Scoring Pairs (EvoTSP) solution that combines the power of evolutionary approach with simplicity of relative expression algorithms. We managed to unify different top-scoring extensions, limit their restrictions, and with application of EA explore larger solution space. We have also changed the unweighted TSP voting, by introducing the weights of each gene pair.
The rest of the paper is organized as follows. In the next section the relative expression algorithms are briefly recalled. Section 3 describes our motivation and Section 4 presents in detail the EvoTSP solution. Next, experimental validation on real-life microarray datasets is performed. The paper is concluded in the last section where possible future works are also sketched.

Background
The first and the most popular solution from RXA is called Top-Scoring Pair (TSP) [3]. It is based on pairwise comparisons of gene expression values. Discrimination between two classes depends on finding one pair of genes that achieves the highest ranking value called "score. " Consider a gene expression microarray dataset consisting of genes and samples. Let the data be represented as a × matrix in which an expression value of th gene from th sample is denoted as . Each row represents observation of a particular gene over training samples, and each column represents a gene expression instance composed from genes. Let us for the simplicity of presentation assume 2 The Scientific World Journal that there are only two classes, 1 and 2 , and instances with indexes from 1 to 1 ( 1 < ) that belong to the first class ( 1 ) and instances from range ⟨ 1 +1, ⟩ to the second class ( 2 ).
The TSP method focuses on gene pair matching ( , ) ( , ∈ {1, . . . , }, ̸ = ) for which there is the highest difference in probability of an event < ( = 1, 2, . . . , ) between class 1 and 2 . For each pair of genes ( , ) two probabilities are calculated, ( 1 ) and ( 2 ): where | 1 | denotes the number of instances from class 1 and ( < ) is the indicator function defined as TSP is a rank-based method; therefore, for each pair of genes ( , ) the "score" denoted Δ is calculated as In the next step, the algorithm chooses a pair with the highest score. There should be only one top pair in the TSP method; however, it is possible that multiple gene pairs achieve the same top score. In that case a secondary ranking proposed in [5] is used to eliminate draws. It is based on the rank differences in classes and samples. In the literature, the TSP solution is extended in several directions, each having its pros and cons. In one of the first extensions called -TSP [5] the number of top-scoring pairs included in the final prediction was increased. The classifier uses no more than top scoring disjoint gene pairs that have the highest score. The parameter is determined by the internal cross-validation and the simple majority vote is used to make the final decision.
Different approach for the TSP extension is discussed in [4] where authors instead of using several pairs of genes compare relationships for three genes. A three-gene version of RXA called Top-Scoring Triplet (TST) [4] was proposed as potentially more discriminating than TSP since there are six possible orderings that must be analyzed. With the TST solution authors successfully predict the germline BRCA1 mutations in breast cancer. This method was later extended in [6] where general idea of pairwise or triplet rank comparisons was proposed. The top-scoring N (TSN) algorithm uses generic permutations and dynamically adjusts the size to control both the permutation and combination space available for classification. Variable denotes the size of the classifier; therefore, in case = 2 the TSN algorithm simply reduces to the TSP method and when = 3, the TSN can be seen as TST. The classifier's size can be defined by user or by internal cross-validation that checks classification accuracy for different values of (on a training data, in a range specified by the user) and selects the classifier with the highest score.
A hybrid solution of -TSP and a top-down induced decision tree is proposed in [7]. In each node of the decision tree called TSPDT a test analogous to the -TSP method is searched. Then, the set of instances is divided according to decision of the best pair (or pairs) of genes in the current node and next; each derived subset goes to the corresponding branch. The process is recursively repeated for each branch until leaf node is reached. This solution was recently extended by global induction of decision tree called GTSPDT [8]. Preliminary experiments showed that this hierarchical evolutionary method can also be a good alternative to traditional relative expression algorithms. Figure 1 illustrates the extensions of the relative expression algorithms. We can observe that EvoTSP unifies two main extensions of the TSP solution: application of multiple pairs of genes instead of one and comparison relationships for more than two genes.
There exist other solutions in RXA like Weight -TSP [9] which focuses on the ratio of two genes in order to find more accurate top-scoring pairs. Different look at ranking the genes in microarray classification was also proposed in [10].
The RXA can be used as a feature selection in more complex classifiers [11][12][13] and as a protein expression classifier [14]. Multiple implementations of TSP-family solutions may be found as package [15] or as a stand-alone application [16].

Motivation
The first drawback of RXA is the enormous computational requirement as the complexity of aforementioned algorithms is ( * ), where is the number of top-scoring groups, is the number of features, and is the size of group of genes with which ordering relationships are compared. In the literature, there are some attempts of improving TSP performance by parallelization of the algorithm using graphic processing unit (GPU) for calculations [17]. Although the improvement is significant, the parameter or/and still must be small-the highest tested value of equals 4 with = 1 and only when was significantly reduced by the feature selection. This illustrates how computationally demanding RXA is.
Finding accurate values of the parameters and is the second problem. The TSP extensions define them ad hoc or by internal cross-validation. The first way is strongly dependent on analyzed dataset and the second one is extremely time consuming and decreases the size of the training dataset which is usually very small in case of microarray data. In addition, it is not clear which extension should be preferred: -TSP or TSN. It should be noted that the -TSP algorithm cannot be replaced by the TST as -TSP has restrictions to use only disjoint gene pairs. On the other side, the -TST or -TSN was not even analyzed in the literature, probably due to its computational complexity.
In this paper, we would like to limit aforementioned drawbacks of TSP extensions through the evolutionary Czajkowski and Kr towski ȩ  approach. Our goal is to improve classification accuracy and identification of marker genes interactions. We let the EA to search for the best multiple pairwise comparisons of the gene expression values. The number of top-scoring pairs is determined also by the evolution and with no restrictions on disjoint gene pairs; EvoTSP may compare relationships for more than two genes like in TSN. Application of EA to the RXA allows exploring larger solution space with reasonable computation time.

Evolutionary Top-Scoring Pairs
In this section, we would like to propose EvoTSP-an evolutionary algorithm for top-scoring pairs. Evolutionary algorithms [18] belong to a family of metaheuristic methods which represent techniques for solving a wide variety of difficult optimization problems. The general framework of EA (see Figure 2) is inspired by biological mechanisms of evolution. The algorithm operates on a population of individuals and each individual represents a candidate solution to the target problem. Individuals are assessed using a quality measure named the fitness function which measures their performance and those with higher fitness are usually more often selected for reproduction. Genetic operators such as mutation and crossover modify new generations of individuals, producing new offspring. This guided random search (offspring usually inherits some traits from its ancestors) is stopped when some convergence criteria are satisfied.  is able to represent the TST solution with the 2 top-scoring pairs that involve only three genes. In the analogous way, TSN, -TSP, or even variations of -TSN can be represented in EvoTSP.

Representation and
In this paper, we also propose additional parameter for each pair of genes that represents its weight. This way, some gene pairs have higher influence than others on the final decision. This idea is completely new in TSP as aforementioned algorithms used a simple majority voting where each top-scoring pair's vote has the same weight. The purpose of using unweighted voting in TSP and all its extensions was probably directed by the necessity of limiting computational requirements. Figure 3 shows an example EvoTSP model, which includes possible representation of -TSP and the TST solution. We could generate initial population randomly to cover the entire range of possible solutions; however, due to the large solution space, we decided to speed up evolutionary search and seed initial population with good solutions (default number of individuals in population equals 100).
Each initial individual has a random number of gene pairs (0 < ≤ 5) created with the mixed dipole strategy [19] and constructed as follows. Among feature vectors located in the node two objects from different classes are randomly chosen. Next, an effective top-scoring pair is constructed with 2 randomly selected genes. By the effective top-scoring pair, we understand the pair of genes which separates two objects from different classes. In other words, genes and can constitute effective top-scoring pair only if there are at least two instances and that are from different classes and one of the relations is satisfied: or the opposite: This operation is repeated until pairs is selected. All created gene pairs have equal weights (parameter = 1 where = 1, . . . , ). With this strategy we are able to limit the number of initial individuals which select only one class.

Fitness Function.
Fitness function is one of the most important and sensitive elements in the design of the evolutionary algorithm. It drives the evolutionary search process by measuring how good a single individual is in terms of meeting the problem objective. Direct minimization of prediction error measured on the learning set usually results in overfitting and leads to spurious results. In case of EvoTSP, we need to balance the error of classification and the number of genes that build the classifier. We have applied a similar idea that was used in the cost complexity pruning in the CART system [20]. The fitness function is maximized and has the following form: where Reclass is the reclassification quality on the training set, is the number of gene pairs, and is the number of unique genes in top-scoring pairs that were used to build the classifier. The parameter is the relative importance of the complexity term specified by user (default value is 0.005). Penalty associated with the classifier complexity increases proportionally with the number of genes that constitute the top-pairs. To reduce overfitting and to encourage searching relation between more than two genes, unique genes are doubly penalized. It should be noticed that there is no optimal value of for all possible datasets and tuning it may improve classifier results for specific problem. Further research on setting this parameter automatically on a particular training data is planned.

Genetic Operators.
To maintain genetic diversity, two specialized genetic operators corresponding to the classical cross-over and mutation were applied. Each evolutionary iteration starts with selecting individuals from the population that will be affected by the genetic operators. Probability of applying a cross-over operator equals 0.5 for each individual.
With the same probability a mutation operator can also be applied. Next, one of the variants of genetic operator is selected.
We propose two variants of recombination: (i) a randomly chosen pair of genes is exchanged between two affected individuals. Probability of pairs to exchange equals 0.9; (ii) a randomly chosen pair from the best individual founded so far replaces a random pair from the affected individual. In this variant only one individual is modified and the probability of this variant equals 0.1.
If the mutation operator is chosen, one of the variants with equal probability of being drawn is applied to the individual: (i) add a new pair of genes created with the mixed dipole strategy; (ii) remove randomly chosen pair; (iii) replace randomly chosen pair by the new one created with the mixed dipole strategy; (iv) exchange one feature from randomly chosen pair; (v) increase/decrease the weight of the randomly chosen pair (by multiplying or dividing by 2); (vi) switch the relation sign among randomly chosen pair.

Selection and Termination
Condition. Ranking linear selection [18] is applied as a selection mechanism. In each iteration, a single individual with the highest value of fitness function in current population is copied to the next one (elitist strategy). In addition, this strategy is partially boosted by possible cross-over of individuals from current population with the best individual founded so far. Evolution terminates when fitness of the best individual in the population does not improve during fixed number of generations (default value: 1000). In case of a slow convergence, maximum number of generations is also specified (default value: 10000), which allows us to limit the computation time.
The Scientific World Journal 5 Table 1: Details of tested gene expression datasets.

Results and Discussions
In this section, all performed experiments are presented. At first, we share some details about datasets and settings of tested algorithms. Next, we validate and discuss the overall performance of EvoTSP solution and its competitors with respect to classification accuracy and its size.

Datasets and Setup.
Performance of classifiers was investigated on several public available microarray datasets deposited in NCBI's Gene Expression Omnibus [21] and summarized in Table 1. All datasets are binary classification problems and mainly refer to the studies of human cancer. As the data was not predivided we used typical 10-fold crossvalidation as it was the only option in AUREA software [16]. We confront EvoTSP with three competitors: the primary solution TSP and its two main extensions: -TSP and TST (TSN with = 3). To obtain comparison results, we used the AUREA software, which is an open-source system for identification of relative expression molecular signatures [16]. Classification was performed with default parameters for all algorithms through all datasets and to ensure stable results average score of 20 runs is shown. A statistical analysis of all obtained results was performed with the Friedman test and the corresponding Dunn's multiple comparison test (significance level equal to 0.05) as recommended by Demšar [22].
The AUREA software sets the maximum number of topscoring pairs (parameter ) for -TSP to 10 by default. In addition, all algorithms except EvoTSP operate on a subset of genes for analysis based on the differential expression of the presented gene set (the Wilcoxon signed-rank test was used to choose the most differentially expressed genes between the defined classes). Authors [16] state that this feature selection step have dramatic effect on the computational complexity of the algorithms and by limiting the set of genes, problem of over-fitting can be mitigated. In case of EvoTSP we have decided not to use any feature selection and allow searching for relations through all high and low-ranked genes. Table 2 summaries classification performance for the proposed solution EvoTSP and its competitors: TSP, TST, and -TSP. The model size of TSP and TST is not shown as it is fixed and equals correspondingly 2 and 3. We had to use approximation of -TSP size as AUREA software did not allow checking the value during cross-validation; therefore, the value of on full dataset treated as a training set is presented.

Comparison of Top-Scoring Family Algorithms Methods.
Results show that, in general, the existing extensions, TST and -TST, outperform TSP in terms of accuracy. The price for better performance is the higher complexity of the classification model, which for -TSP is 5.75 times higher (an average value from 8 datasets) than TSP size and almost 4 times than TST. Slightly larger size of classification model is not a problem, as all tested algorithms are simple to analyze; however, checking several different genes per model may be considered difficult in biological interpretation, which is the case for -TSP.
In the last two columns of Table 2 we present the results of the proposed solution. We can observe that the accuracy of the classifier in 6 out of 8 datasets is the highest. However, for the last two databases EvoTSP accuracy score is slightly lower than -TSP. Additional experiments showed that the convergence of EA in EvoTSP is too slow for that particular set. When the maximum number of generation in EA was increased, the proposed algorithm managed to have similar or even outperform -TSP on both datasets.
According to the Friedman test, there is a statistically significant difference ( value of 0.0003) in the accuracy of all versions. Based on Dunn's Multiple Comparison Test Difference, there is a statistically significant difference in classification quality between EvoTSP, TSP, and the TST algorithm. Although there were no statistical differences in accuracy between EvoTSP and -TSP, there is one in the size of their models. The size of classification model of proposed solution remains small, in contrast to -TSP, making the EvoTSP a good tool for identifying gene-gene interactions with direct clinical applicability. In Table 2 we can also observe that the standard deviation of accuracy for solutions was on similar level.
Total time to build an EvoTSP model varies between 1 and 8 minutes on a typical PC (Intel Core I5, 4 GB RAM), depending on the dataset and it is few times longer than for AUREA software which was from tens of seconds to a minute. However, it should be noted that EvoTSP works without any feature selection which is a must for AUREA software (checking of all combinations of pairs would take many orders of magnitude more).

Conclusion
In this paper, we propose the EvoTSP system for solving classification problems using microarray data. Our approach is a hybrid solution that combines the power of EA and relative expression algorithms. We have designed several variants of specialized operators to mutate and cross-over individuals and a fitness function that helps mitigating the overfitting problem. With the new weighted gene pairs voting and extended representation of top-scoring pairs that involve different variants of TSP, we were able to significantly improve TSP accuracy with still relatively small size of classification 6 The Scientific World Journal model. Application of EAs allows exploring much larger solution space and searching for different, more complex relations between genes.
In this paper we only focus on the general concept of EvoTSP as an effective tool; therefore, we do not enclose any biological aspects of the rules generated by proposed system or case studies on particular datasets. Furthermore improvement is still required especially in terms of fitness functions to handle cost-sensitive and multiclass problems. Speeding up the convergence of the EA is also desirable and can be achieved by application of local optimizations (memetic algorithms), new specialized operators, and selfadaptive parameters. Finally, more work on preprocessing datasets, gene selection, and using additional problemspecific knowledge is also required to improve EvoTSP classification accuracy and rule discovery.