1. Introduction

COMPLEXITY

Complexity

1099-0526 1076-2787

Hindawi

10.1155/2017/5024867

5024867

Research Article

FAACOSE: A Fast Adaptive Ant Colony Optimization Algorithm for Detecting SNP Epistasis

http://orcid.org/0000-0002-9694-8191

Yuan

Lin

¹ Yuan

Chang-An

http://orcid.org/0000-0002-6759-2691

Huang

De-Shuang

¹ Wang

Jianxin

Institute of Machine Learning and Systems Biology

School of Electronics and Information Engineering

Tongji University

Caoan Road 4800

Shanghai 201804

China

tongji.edu.cn

Science Computing and Intelligent Information Processing of Guang Xi Higher Education Key Laboratory

Guangxi Teachers Education University

Nanning

Guangxi 530001

China

gxtc.edu.cn

2017

792017

2017 31 03 2017 24 07 2017 792017

2017

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

The epistasis is prevalent in the SNP interactions. Some of the existing methods are focused on constructing models for two SNPs. Other methods only find the SNPs in consideration of one-objective function. In this paper, we present a unified fast framework integrating adaptive ant colony optimization algorithm with multiobjective functions for detecting SNP epistasis in GWAS datasets. We compared our method with other existing methods using synthetic datasets and applied the proposed method to Late-Onset Alzheimer’s Disease dataset. Our experimental results show that the proposed method outperforms other methods in epistasis detection, and the result of real dataset contributes to the research of mechanism underlying the disease.

National Natural Science Foundation of China

61520106006

31571364

61732012

61532008

U1611265

61672382

61402334

61472280

61472173

61572447

61672203

61472282

61373098

China Postdoctoral Science Foundation

2014M561513

2015M580352

2017M611619

2016M601646

Guangxi Bagui Scholars Program Special Fund

1. Introduction

Accompanied by the rapid development of genomics and gene chip technology, Genome-Wide Association Studies (GWAS) predicted massive genetic variations related to complex traits [1, 2]. Although this method has achieved great success. It can only explain a small part of the mechanism under the complex diseases known as “missing heritability” [3]. That is to say, marginal genetic effects of GWAS identified single nucleotide polymorphisms (SNPs) account for small part of pathogenic causes. For single-locus SNPs related disease [4], GWAS can identify SNPs that are responsible for disease trait. However, complex diseases are often due to the small and complex effects of large SNPs, such as type 2 diabetes [5], prostate cancer, and rheumatoid arthritis (RA) [6]. More and more studies have shown that epistasis exists in SNPs interaction. Many SNPs will interact with each other in the process of affecting the disease traits [7]. Some SNPs will affect the disease and dominate the effect of others. The relationship of one SNP repressing the effect of another SNP is known as epistasis. In many complex human diseases, the effect of epistasis among complex human diseases is unclear. The proposed methods for SNP related disease may have poor performance due to failure to identify epistasis.

During the past decade, a lot of approaches have been proposed to detect epistasis. Some methods focus on the interaction between two certain SNPs. Zhang et al. [8] proposed a Bayesian partition method for epistatic eQTL modules. Kang et al. [9] proposed four different models to measure epistasis effect between two loci and suggest a statistical strategy to infer the hierarchical relationships. Recently, Lin et al. [10] reported forty-five SNP-SNP interaction models by considering the inheritance modes and model structures. Though these methods have been successful in studying epistasis between two SNPs. The GWAS data is high dimension data which contains hundreds of thousands or even million SNPs; at the same time, GWAS data only contains dozens or hundreds of individual sample data, for example, the small number sample data and the high dimension features; it needs vast amounts of time to identify the interaction between each pair of SNPs [11–13]. The computational burden is out of bounds.

More and more machine learning methods are applied to research epistasis. Many methods were proposed to model epistasis effect from the perspective of the overall data. Moore et al. [14] applied regression method to identify the relationship between gene expression and epistasis effect. Michael et al. [15] applied Bayesian networks to identify the epistasis effect network from the original SNPs data. Although these methods solved some problems, they still did not show significant effects with the large scale Genome-Wide Association Study datasets owing to the same “high-dimensional small sample size problem.” With the rapid development of multiobjective optimization method and machine learning discipline, ant colony optimization (ACO) algorithm was applied to epistasis research. Wang et al. [16] proposed AntEpiSeeker; AntEpiSeeker combines heuristic search with the ant colony optimization to identify SNPs which dominate other SNPs. Experimental results on real rheumatoid arthritis dataset show that AntEpiSeeker is better than other methods. The drawback of this method is that other methods show different performance on different disease models. Zhang and Liu [17] developed the Bayesian inference method which identifies the epistatic interactions in case-control studies. However, the BEAM method needs a lot of time in GWAS dataset. In this paper we extend SNP epistasis study to a fast adaptive ant colony optimization algorithm for detecting SNP epistasis. We search SNP epistasis with two-objective functions and fast adaptive ant colony optimization.

The experiments on several simulated datasets show the good performance of our method. We also compare our method with the benchmark methods, including BEAM, generic ACO, and AntEpiSeeker. Experimental results show that our method has better performance in GWAS datasets containing epistasis effect among SNPs.

2. Methods 2.1. Ant Colony Optimization

In the research of artificial intelligence and large scale problem solving, the ant colony optimization (ACO) algorithm is inspired by the ants food search behaviour in nature. Assume that the food search paths constitute a graph; the ant colony optimization algorithm can reduce time of search paths through graphs [18]. This algorithm with other ant colony optimization algorithms is kind of swarm intelligence methods, and it is member of metaheuristic optimizations. Marco Dorigo proposed the ant colony optimization algorithm in 1992 in his Ph.D. thesis. In the GWAS datasets, the datasets often contain tens of hundreds to millions of SNPs. It is not feasible to identify the relationship of every pair of SNPs within an acceptable time. ACO algorithm was used here to reduce the complexity of exhaustive search. In kingdom of insects, in the process of finding food, ants look like they are walking randomly, and in the back and forth path of searching for food, the ants will leave pheromones on the path. If the path is found by other ants, other ants tend to follow the path but not walk randomly; going further, if they find food through this path, they will also leave pheromones; the pheromone value on this path is enhanced. Subject to other factors in nature, pheromone value starts to evaporate and the path’s attractive strength starts to decrease. The longer the path is, the more the time the ants are looking for food. As a comparison, the time the ants take to walk through the short path is greatly shortened, and pheromone values will be larger on shorter paths than longer paths. Pheromone evaporation results in dynamic changes in the path. Path dynamic changes can avoid the convergence of solutions to a locally optimal solution. If there is no pheromone values evaporation, the food search path selected by first ants would tend to be the only path or the most attractive path. This phenomenon will lead to limitation of the solution space. The mechanism of pheromone evaporation in ant colony is unclear, but pheromone evaporation is a very important application in artificial intelligence systems. Though the ant colony optimization algorithm has achieved great success in application [19–21].

The travelling salesperson problem (TSP) is a problem with some cities and physical distances between each pair of cities. The question is what is the shortest possible path where travelling salesperson visits each city once and finally returns to the origin city? Suppose there are n cities; there are n-1!/2 solutions to the problem. The feasible solutions will increase exponentially when the number of city increases, making the computation impractical. Obviously, it is an NP-hard problem of combinatorial optimizations.

Suppose that m ants are randomly placed in n cities, the kth ant in the ith city; the probability if ant chooses the next city j is(1)pijk=τijαtηijβt∑o∈candidatekτioαtηioβt,j∈candidatek,0,otherwise,ηijt=1dij,where τijt indicates the surplus information on path ij in moment t. ηijt indicates the heuristic function. dij indicates the physical distance between city i and city j. tabuk indicates the cities set which indicates ant k has visited. candidatek indicates the set of cities which ant k can visit next.

Over time, after n moments, the ants complete a cycle; the information of each path should be adjusted according to (2)τijt+n=1-ρτijt+Δτij,Δτij=∑k=1mΔτijk,where Δτij indicates information increment of path ij after this cycle. (3)Δτij=Qlk,ij∈Lk0,otherwise,where Lk indicates ant k’s paths in this cycle. lk indicates the path length of ant k in this cycle. The parameters needed to be determined are α,β,ρ,m,Q; the number of ants is less than or equal to city number; Q is a large suitable number. ACO is always used in large scale data problems. However, slowness is still a bottleneck in the application of the ant colony algorithm for large scale search optimization problems. Pheromone update strategy is one of the keys to determine the convergence rate.

In the process of applying ant colony optimization to specific problems, the search space should be as large as possible. At the same time, ACO should consider time efficiency. ACO should balance the optimal solutions and solve speed. On the basis of previous studies [22–24]. We only consider pheromone evaporation factor ρ and pheromone importance factor α. In (2), ρ is used to balance the effects of old pheromone value and current pheromone value. When ρ is too small, the residual pheromone value is too much and leads to local minimum solution. We adopt adaptive ρ, when the algorithm does not improve the current optimal solution within n iterations.(4)ρt+n=gρt,ρt≤ρmaxρmax,otherwise,where ρmax equals 0.85 in practice. g equals 1.02 as tune parameter. When the pheromone value reaches the critical value, the pheromone importance factor begins to play a role. With the increase of pheromone importance factor α, the algorithm will jump out of the local optimal solution and has ability to search for global optimal solution.(5)αt+n=g1αt,αt≤αmaxαmax,otherwise,where g1 is a constant larger than one and αmax is less than or equal to five. In the process of calculation, first, we follow the standard ant colony optimization algorithm for N iterations. N is predefined number. If the current optimal solution is not improved after N iterations, update the parameters according to formulas (4) and (5). Then update all pheromone value according to (2).

Given pheromone values and transfer rules, we can use the ant colony optimization algorithm to find a group of SNPs which affect the disease. Assume there are P SNPs in the global Genome-Wide Association Studies dataset, we can construct a p-dimensional symmetric matrix M to store every ant’s pheromone value. The element mij of matrix M denotes the interaction which is related to disease between ith SNP and jth SNP. At the beginning of our method, every element of matrix M is assigned to a constant value m0; equivalent value shows the epistasis in every pair of SNPs and there is equal possibility relationship between the SNPs and disease.

At the final pheromone iteration, the ACO algorithm will obtain the optimal solutions through forward selection strategy. The advantage of ACO algorithm in this paper is that the result contains nondominated solutions which have the potentially equivalent possibility and potentially highest related strength with disease and omit dominated solutions.

The disadvantages of traditional ant colony optimization algorithm are long search time and tendency to fall into the local optimal solution. The drawback of this working mode is that the current pheromone evaporation factor and pheromone importance factor are predefined. As an improved strategy, we extended the “dynamic adaptive strategy” to ant colony optimization. The advantage of this strategy is the fast convergence rate and searching for global optimization solution. Compared with traditional ACO, the new strategy can provide more accurate result.

2.2. Two-Objective Function Optimization

The results of ant colony optimization need to be evaluated. We combine two-objective methods to assess the final epistasis results. In general, one of two-objective functions combines Akaike Information Criterion (AIC) score and logistic regression function to measure relationship between phenotypic trait and genotype data; Akaike Information Criterion indicates the effectiveness and complexity of the model [25, 26]. In our method, on the basis of the standard logistic regression, following the North et al. [27] strategy, we use ADDINT logistic regression model to search the relationship between disease and SNP nodes. The second objective function uses frequency measurement based on mutual information theory to model the relationship between genotype data and phenotypic trait from the perspective of information theory. The second objective function used to represent the selected SNP subsets can explain how much information is about the disease trait. Our proposed method obtains information from data rather than a lot of priori information. The above two-objective functions are designed from the different perspective to measure the quality of the search results, and the simulation data experiment results show that our two-objective functions have a better performance than other methods on simulated and real biological datasets.

In order to avoid the bad impact of high dimension small size sample problem, the identification of disease-associated SNPs is known as a heuristic optimization problem. In our proposed method, proposed method yields optimal solutions which is nondominated solutions; the proposed two-objective functions method actually is kind of multiobjective optimization; the proposed method uses ant colony optimization to search for optimal solution [28].

Our proposed fast adaptive ACO framework contains two stages. In the first stage, we use modified ACO optimization algorithm with two-objective functions to search for nondominated SNP subset. After generating the nondominated SNP subset, we apply Fisher exact test [29, 30] to the dataset containing nondominated SNP generated in the algorithm first stage. The Fisher exact test will be used to identify the relationship between disease and SNPs.

2.2.1. AIC Score

The Akaike Information Criterion (AIC) is used to measure quality of dataset statistical models. AIC is from information theory, and it estimates loss of information when a statistical model is used to express the data generation process. The mechanism of Akaike Information Criterion is that it deals with the trade-off between the goodness of fit of the model and the complexity of the model. Based on the nature of the AIC, we construct AIC model from the perspective of GWAS dataset. The goal of our method is to measure the relationship between the genotype data of genome and phenotype disease trait. Logistic regression is widely used to quantitatively analyze the correlation between dependent variable and independent variable. Based on above methods, we construct AIC score model containing logistic regression and gradient penalty function. Logistic regression can compute the maximized log-likelihood of the model; k is used to express the number of free parameters. AIC score deals with the trade-off between the fitness effect of the model and the complexity of the model. We follow Jing and Shen [28] strategy:(6)AICscore=2k-2log⁡lik,where k denotes the number of free parameters.

2.2.2. Explanation Score

In GWAS research, the relationship between two loci and disease, in SNP research, each locus has three values, 0, 1, and 2; 0 means major allele homozygous, 1 means heterozygote, and 2 means minor allele homozygous [31]. For two loci, there are nine cases of their combination; the disease related SNP locus often changes when the disease occurs. In the case of double locus combination, xi means the number of ith combinations of two SNP loci, Y means case or control state, y1 means state case, and y2 means state control. The potential interrelationships of two discrete random variables X and Y are defined as H(X;Y); the relationship between locus combination and disease is measured based on the information of locus frequency. H(X;Y) is described as below:(7)HX;Y=∑i=1Ixiy1-xiy2,where I means the total number of locus combinations. To avoid unbalanced sample, the size affects score. For example, if data size of case is larger than control, we extract the same size of control data from case samples randomly. To avoid the impact of randomness, we extract sample several times and average the results. The large value H means the potential association probability between disease and SNPs is large. Equation can also be applied to more than two locus combinations. We name this score explain score.

2.3. Pareto Optimality for SNP Epistasis Detection

Pareto optimality defines such a situation. Pareto optimality is proposed to solve the following questions where it is impossible to make all objective function values of multiobjective optimization optimal values [32, 33]. Pareto optimality is first applied to the area of income distribution and economy. Now Pareto optimality has been extended to engineering and multiobjective optimization research. On the basis of previous proposed methods, the modified ant colony optimization algorithm with first objective function and second objective function, the first objective function is AIC score with logistic regression and related parameters; the second objective function is explain score. For the first objective functions, the lower score of the objective function indicates the high potential relationship between disease phenotype trait and SNPs [34]. For the second objective functions, the higher score of the objective function indicates the high potential relationship between the disease phenotype trait and SNPs. The target of fast two-stage ant colony optimization algorithm is to find the epistasis effect among SNPs and extract real SNP subset with respect to the above proposed methods.

In the real GWAS datasets, an identified SNP subset may perform the best compared with other method solutions in terms of one-objective function, but SNP subset may perform poorly in terms of another objective function. Thus, the target is how to select better SNP subset with respect to both objective functions. In practical application, rare subset performs better than other solutions while satisfying both conditions. Thus, for a framework with two-objective functions, it is hard and even impossible to calculate the global optimal solution. On the basis of previous studies [28, 34, 35], we adopt Pareto optimality to find the practical optimal solution. We first compare the two solutions, in terms of GWAS SNP subset, a solution named S1, and another solution named S2; comparing S1 and S2 only have two consequences; one result is one solution dominates the other; another result is S1 does not dominate S2; in turn, the solution S2 does not dominate S1. Based on the mind of Pareto optimality, we consider S1 dominates S2 if they satisfy the following two conditions. The first condition is the value of fe(S1) is not higher than fe(S2) for those two-objective functions. The second condition is the objective function fe(S1) is lower than fe(S2) for at least one-objective function. The function fe denotes the objective function: modified AIC score objective function and explain score objective function. The e equal to one denotes the first objective function; the e equal to two denotes the second objective function. If solutions S1 and S2 satisfy the above two conditions, we say solution S1 is a nondominated solution; in turn, we say solution S2 is a dominated solution. Based on above Pareto optimality approach and two-objective functions, all solutions can be divided into two kinds; one is nondominated set and another is dominated set. Finally, nondominated sets contain many solutions and all the solutions from our proposed method with respect to two-objective functions; now our goal is to find a nondominated set which is the best under certain conditions.

Next, we will use the judgment rule mentioned earlier to sort the solutions of nondominated sets to find the optimal nondominated set. Specifically, in the first case, f1(S2) is larger than f1(S1); at the same time, fe(S2) is larger than f2(S1). In the second case, f1(S2) equals f1(S1); at the same time, f2(S2) is larger than f2(S1). In the third case, f1(S2) is larger than f1(S1); at the same time, f2(S2) equals f2(S1).

2.4. Fisher Exact Test for Experimental Results

Fisher exact test is used in contingency tables to get a statistical significance [36–38]. Although in practice it is used in small size sample, it is can also be used in all sample sizes. Ronald Fisher first proposed this method and Fisher exact test is one kind of exact tests.

In terms of our GWAS datasets research article, on the basis of unified framework which contains fast adaptive ant colony optimization (ACO) algorithm, Akaike Information Criterion (AIC) score, explain score, and Pareto optimality, we can obtain the final result which is a nondominated SNP set; in this section, we will use Fisher exact test to exhaustively search for the epistasis effect. Fisher exact test is based on hypergeometric distribution; the P value in the Fisher exact test is accurate for all individual samples. Fisher exact test is used on the basis of contingency table. The null hypothesis is that the identified SNP subset and disease are not associated. The alternative hypothesis is that SNP subset affects the expression of the disease when the Fisher exact test’s P value is significant, when P value is less than predetermined value such as 0.05 or smaller value. Our proposed method will identify significance SNP subsets.

2.5. Power Test

In previous section, we introduce each part of our proposed fast adaptive ant colony optimization algorithm for detecting SNP epistasis. Our proposed unified framework contains fast adaptive ant colony optimization algorithm, Akaike Information Criterion (AIC) score, explain score, Pareto optimality, and modified Fisher exact test. In this section, we introduce how to verify the significance of the results. We construct 100 datasets according to the same parameters. Then we use the traditional power test to measure the effect of methods. The power test is defined as follows:(8)Power=SD100,where SD denotes the number of disease related datasets which were correctly selected from 100 datasets. Only using the single test criterion may not clearly show the quality of results. We use precision recall standard to measure true positive rate and false positive rate. Precision recall criteria have been widely used in classification model evaluation model [39, 40]. In pattern recognition and information retrieval with binary classification, precision, also called positive predictive value, is the fraction of retrieved instances that are relevant; while recall, also known as sensitivity, is the fraction of relevant instances that are retrieved [26]. Both precision and recall are therefore based on an understanding and measure of relevance. We use precision recall criteria to determine whether the classification results are good or bad. The precision recall criteria can avoid the imbalance problem of precision recall numbers. In our research, the number of precision and recall always differs greatly. In terms of the SNP epistasis research, precision is also known as positive predictive value, equivalent to the true disease related SNP subsets; recall is also known as sensitivity or negative, equivalent to the true disease unrelated SNP subsets. If we use only one judgment criterion, thus false positive rate, single indicator cannot make the real result clear. We use false positive rate and true positive rate to measure the real result. This is why we use precision and recall. We also use F1 score (also F score or F measure) to measure the precision recall test accuracy. The precision and recall will be introduced next with confusion matrix (Figure 1).(9)recall=TPTP+FN,precision=TPTP+FP,F1=precision·recallprecision+recall.

Figure 1

Precision recall explanation matrix.

The precision, also known as specificity, denotes true positive number ratio in the result through the number of true positives divided by the sum of true positive number and false positive number; precision is often used to report false positive rate of an algorithm’s false positive rate. The recall, also known as sensitivity, denotes true positive ration in the sum of true positives and false negative. In terms of SNPs selection problem, the larger the recall value is, the larger the number of real true disease-related SNP combinations can be found. Simultaneously, the larger the precision value, the larger the number of real true disease-related SNP combinations account for a high proportion of the identified SNP combinations. The criterion F measure is the harmonic mean of precision and recall, which is a synthesized measure combining both precision and recall [41].

3. Simulation Experiments 3.1. Compared with One-Objective Function

In this section, we use simulation data to compare our proposed method with other existing methods. In order to avoid data favor caused by the model, we adopt BEAM package to generate simulation datasets [17]. Data was simulated following three genetic models: (1) additive model, (2) epistatic interactions with multiplicative effects, and (3) epistatic interactions with threshold effects. In order to introduce our experiments, the additive model is referred to as ADDME. The model about epistatic interactions with multiplicative effects is referred to as EIME. The epistatic interactions with threshold effects are referred to as EITEME. In the next section, we will use the short name to indicate the corresponding data model.

Because our method is two-objective-based SNP epistasis search method, first, we compared our proposed method with existing single objective-based exhaustive SNP epistasis search method to demonstrate the effectiveness of two-objective function SNP epistasis subset search method. Second, we compare our proposed method with recently proposed method BEAM [17], generic ACO algorithm, and AntEpiSeeker [16]. In the one-objective function SNP epistasis search method, the objective function is used to score every SNP combinations; in general, the score for every SNP combination is not the same. Based on the nature of the method, low score indicates the association between SNP combination and disease is relatively small; high score indicates the association between SNP combination and disease is relatively large. Then the one-objective function ranks all SNP combinations based on the scores. However, the two-objective-based SNP epistasis search method is to find a set of nondominated results, and every nondominated SNP epistasis results’ score is the same. To ensure fairness, for the one-objective function, we collect the same number as two-objective-based SNP epistasis search method from the top of one-objective-based SNP rank. The comparing results show that the two-objective-based SNP epistasis search method is better than one-objective-based SNP epistasis search method in three simulation data models. In terms of two single objective-based SNP epistasis search methods, the results of one-objective-based SNP epistasis search methods are similar with the other one-objective-based SNP epistasis search methods. The simulation data experiment results show the effectiveness of two-objective-based SNP epistasis search method, and the poor experimental results show the insufficiency of one-objective functions. The experiment results are shown in Figure 2. The abscissa of Figure 2 is minor allele frequency (MAF) which is assigned 0.1, 0.2, and 0.5. We generate the simulate dataset and study the parameter setting following many previous studies [17, 42–44]. For each simulate dataset of parameter combination, we generated 100 datasets which contain 2,000 experimental samples (1,000 case samples and 1,000 control samples) and 1000 SNPs were simulated. We evaluate the algorithm performance through calculating the ratio of real number identified following the significance level 0.01 which is adjusted after Bonferroni correction. The parameter λ was set to 0.3 for ADDME and 0.2 for EIME and EITEME. The parameter range of linkage disequilibrium between SNPs is r2 from 0.7 to 1.

Figure 2

Power test comparisons between one-objective and two-objective methods on three different model with MAF value 0.1, 0.2, and 0.5.

3.2. Compared with Benchmark Methods

After comparing with single objective function. We compare our proposed method with existing method. The performance of our proposed method was evaluated by comparison with benchmark methods [45]. In many previous studies, the authors have already discussed the parameter settings problem. In this section, we set the parameters according to the existing strategy. We evaluated performance of FAACOSE by comparing with two recent methods, BEAM, generic ACO algorithm, and the AntEpiSeeker; we use BEAM package and previous parameter strategy to generate simulate dataset. Be aware of the fact that the generic ACO algorithm could not select larger size SNP set. We use simulated dataset introduced in Section 3.1. We evaluate the algorithm performance through calculating the ratio of real number identified following the significance level 0.01 which is adjusted after Bonferroni correction. We generate simulate datasets following three genetic models: ADDME, EIME, and EITEME. Other parameters for data simulation were the effective size λ, a measure of marginal effects as defined by Marchini et al. [42], linkage disequilibrium between SNPs measured by r2, and minor allele frequencies (MAFs). λ was set to 0.3 for ADDME and 0.2 for EIME and EITEME. For r2, two values (0.7 and 1.0) were used for each model. For MAFs, three values (0.1, 0.2, and 0.5) were considered. The parameters for BEAM were set as default. The parameter settings for AntEpiSeeker were large dataset size = 6, small dataset size = 3, count large = 150, count small = 300, epistasis model = 2, ant count = 1000, α=1, ρ=0.05, and τ0=100 (also available in the software package documentation of AntEpiSeeker). The parameters of the generic ACO algorithm were set as ant count = 1000, α=1, ρ=0.05, τ0 = 100, count (number of iterations) = 900, and epistasis model = 2. The comparison of detection power for BEAM, genetic ACO algorithm, and the AntEpiSeeker is presented in Figure 3. The results show that FAACOSE outperforms BEAM and the generic ACO in all parameter settings and is superior to AntEpiSeeker in most parameter settings.

Figure 3

Power comparisons between existing methods and FAACOSE on three models.

In this section, we compare our proposed method with benchmark methods. First, we use power test to detect how many real SNP subsets can be found with our proposed method. Second, we use precision, recall, and F1 score to evaluate the results. Precision denotes how many right SNP subsets in the total final identified SNP subsets. Recall denotes the number of right SNP subsets that are identified. F1 score is an indicator used in statistics to measure the accuracy of two classification models. It takes into account the precision and recall of the classification model simultaneously. F1 score can be seen as a weighted average of precision and recall, its maximum is 1, and minimum is 0.l. We show the results of FAACOSE with other methods on r2 = 0.7 and MAF = 0.2 in Table 1.

Table 1

F 1 score comparison between FAACOSE and other methods.

Model	Method	Recall	Precision	F 1 score
ADDME	BEAM	0.29	0.15	0.20
	gACO	0.45	0.36	0.40
	AntEpiSeeker	0.6	0.55	0.57
	FAACOSE	0.82	0.74	0.78

EIME	BEAM	0.3	0.45	0.36
	gACO	0.35	0.32	0.33
	AntEpiSeeker	0.34	0.56	0.42
	FAACOSE	0.9	0.82	0.86

EITEME	BEAM	0.1	0.14	0.12
	gACO	0.15	0.20	0.17
	AntEpiSeeker	0.54	0.46	0.50
	FAACOSE	0.65	0.62	0.63

The F1 score of FAACOSE is better than other methods. We run the same experiment on datasets with different parameter combination. In all eighteen datasets FAACOSE has the highest F1 score in fifteen of them. In real GWAS dataset experiment, the sample size of real dataset is huge. The efficiency of the method is also to be considered. The experimental results indicate that our proposed method is more effective method in real GWAS dataset. AntEpiSeeker is the most efficient algorithm among three methods. In different data samples, we compare run time of AntEpiSeeker and FAACOSE. And averaging the results, FAACOSE is faster 30% than AntEpiSeeker.

4. Application to Real SNP Dataset

Late-Onset Alzheimer’s Disease (LOAD) is the most frequent form of Alzheimer’s disease, which is frequently identified in people older than 65 years; the LOAD or AD is a kind of chronic neurodegenerative diseases which is frequently not obvious in the onset of the disease and slowly changes dementia over time. It is the cause of 60% to 70% of cases of dementia. The most common early symptom is difficulty in remembering recent events (short-term memory loss). As the disease advances, symptoms can include problems with language, disorientation (including easily getting lost), mood swings, loss of motivation, not managing self-care, and behavioural issues. LOAD is a multifactor genetic disease; its etiology and pathogenesis have not yet been fully understood. The apolipoprotein (APOE) gene is a definite risk factor for LOAD. The APOE gene has three forms. The ε2, ε3, and ε4; the effect of ε2 is positive; ε2 can effectively prevent the occurrence of the disease. There has been research report that genetic variant ε4 has induced effect on disease. Between 40 and 80% of people with AD possess at least one APOE ε4 allele [46]. Previous studies have reported some significant SNPs in the field of Genome-Wide Association Studies [47]. Reference [47] reported that 10 SNPs in the area of GAB2 gene have an epistasis effect with APOE e4 in relation to Late-Onset Alzheimer’s Disease. We applied our proposed method to the LOAD GWAS dataset from website https://www.tgen.org/ [47]. After data preprocessing, the real biological dataset contains 1368 samples [48, 49]. Of these, 836 samples were identified case studies; the remaining 532 samples were normal sample [50, 51]. Each sample of real biological dataset contains 309,316 SNPs with genotype information, APOE status, and LOAD status [52]. For the next calculation, we code the APOE gene state with a binary variable; the value 1 represents the ε4 variant and in turn the value 0 represents the other three variants [53]. An SNP locus was coded as a quaternary variable considering the missing state. The high potential LOAD disease related SNP is shown in Table 2.

Table 2

The number of selected SNPs of FAACOSE in LOAD dataset.

SNP rs#
rs7756992	rs611154	rs191840	rs7294919
rs1887922	rs304900	rs1999764	rs1385600
rs2373115	rs7101429	rs609812	rs613375
rs1007837	rs2510038	rs4945261	rs10793294
rs520227	rs191740	rs7924284	rs829465
rs602106	rs7174511	rs606889	rs602192

5. Discussions

In this paper, we proposed a novel ant colony optimization based fast search method for the discovery of epistasis interactions in large scale real GWAS dataset. FAACOSE was evaluated through comparison with existing three approaches on both simulated and real datasets. FAACOSE, which adopts a fast adaptive optimization procedure, is a modified algorithm derived from the generic ACO. And with two-objective function, to demonstrate the advantages of fast adaptive ant colony optimization algorithm, we also compared the performance of the FAACOSE with that of the generic ACO.

In future studies, we intend to find more powerful modeling approaches, ant colony optimization algorithm with faster convergence, objective functions which can better measure data structure of GWAS dataset, more efficient optimal SNP subset search, and identification strategies that can be combined and flexibly embedded into our SNP epistasis search framework to find more accurate SNP subset. With the rapid development of bioinformatics, more and more biological information related to disease is identified. More and more studies will consider prior knowledge. An important future research direction is that we will try to apply expert prior knowledge to GWAS dataset with our proposed method, that is, the fast adaptive ant colony optimization algorithm for detecting SNP epistasis. Expert prior knowledge can improve the power and efficiency of epistasis detection.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work is partly supported by National Natural Science Foundation of China (Grant nos. 61520106006, 31571364, 61732012, 61532008, U1611265, 61672382, 61402334, 61472280, 61472173, 61572447, 61672203, 61472282, and 61373098) and China Postdoctoral Science Foundation (Grant nos. 2014M561513, 2015M580352, 2017M611619, and 2016M601646) Guangxi Bagui Scholars Program Special Fund.

Hirschhorn

J. N.

Daly

M. J.

Genome-wide association studies for common diseases and complex traits

Nature Reviews Genetics 2005 6 2 95 108

2-s2.0-13144306071

10.1038/nrg1521

Howie

B. N.

Donnelly

Marchini

A flexible and accurate genotype imputation method for the next generation of genome-wide association studies

PLoS Genetics 2009 5 6

2-s2.0-67651222400

10.1371/journal.pgen.1000529

e1000529

Manolio

T. A.

Collins

F. S.

Cox

N. J.

Goldstein

D. B.

Hindorff

L. A.

Hunter

D. J.

McCarthy

M. I.

Ramos

E. M.

Cardon

L. R.

Chakravarti

Cho

J. H.

Guttmacher

A. E.

Kong

Kruglyak

Mardis

Rotimi

C. N.

Slatkin

Valle

Whittemore

A. S.

Boehnke

Clark

A. G.

Eichler

E. E.

Gibson

Haines

J. L.

MacKay

T. F. C.

McCarroll

S. A.

Visscher

P. M.

Finding the missing heritability of complex diseases

Nature 2009 461 7265 747 753

2-s2.0-70349956433

10.1038/nature08494

Shastry

B. S.

SNP alleles in human disease and evolution

Journal of Human Genetics 2002 47 11 561 566

2-s2.0-0036955891

12436191

10.1007/s100380200086

Stubbs

Vancampfort

De Hert

Mitchell

A. J.

The prevalence and predictors of type two diabetes mellitus in people with schizophrenia: a systematic review and comparative meta-analysis

Acta Psychiatrica Scandinavica 2015 132 2 144 157

10.1111/acps.12439

2-s2.0-84965177898

Liao

K. P.

Cardiovascular disease in patients with rheumatoid arthritis

Trends in Cardiovascular Medicine 2017 27 2 136 140

2-s2.0-84995480729

10.1016/j.tcm.2016.07.006

Mao

London

N. R.

Dvorkin

Detection of SNP epistasis effects of quantitative traits using an extended Kempthorne model

Physiological Genomics 2006 28 1 46 52

2-s2.0-34247848075

10.1152/physiolgenomics.00096.2006

Zhang

Zhu

Schadt

E. E.

Liu

J. S.

A Bayesian partition method for detecting pleiotropic and epistatic eQTL modules

PLoS Computational Biology 2010 6 1

e1000642

10.1371/journal.pcbi.1000642

MR2601383

Kang

Zhang

Chun

H.-W.

Ding

Liu

Gao

EQTL epistasis: Detecting epistatic effects and inferring hierarchical relationships of genes in biological pathways

Bioinformatics 2015 31 5 656 664

10.1093/bioinformatics/btu727

2-s2.0-84928994263

Lin

Chen

Huang

Liu

Ochoa

Zabaleta

Mercante

D. E.

Fang

Sellers

T. A.

Pow-Sang

J. M.

Cheng

Eeles

Easton

Kote-Jarai

Amin Al Olama

Benlloch

Muir

Giles

G. G.

Wiklund

Gronberg

Haiman

C. A.

Schleutker

Nordestgaard

B. G.

Travis

R. C.

Hamdy

Pashayan

Khaw

Stanford

J. L.

Blot

W. J.

Thibodeau

S. N.

Maier

Kibel

A. S.

Cybulski

Cannon-Albright

Brenner

Kaneva

Batra

Teixeira

M. R.

Pandha

Park

J. Y.

SNP interaction pattern identifier (SIPI): an intensive search for SNP–SNP interaction patterns

Bioinformatics 2016

10.1093/bioinformatics/btw762

Prentice

R. L.

Aspects of the design and analysis of high-dimensional SNP studies for disease risk estimation

Biostatistics 2006 7 3 339 354

2-s2.0-33745453811

10.1093/biostatistics/kxj020

Zbl1170.62398

Deng

S.-P.

Zhu

Huang

D.-S.

Mining the bladder cancer-associated genes by an integrated strategy for the construction and analysis of differential co-expression networks

BMC Genomics 2015 16 3, article no. S4

10.1186/1471-2164-16-S3-S4

2-s2.0-84965184081

Deng

S.-P.

Huang

D.-S.

SFAPS: An R package for structure/function analysis of protein sequences based on informational spectrum method

Methods 2014 69 3 207 212

10.1016/j.ymeth.2014.08.004

2-s2.0-84927135636

Moore

J. H.

Lamb

J. M.

Brown

N. J.

Vaughan

D. E.

A comparison of combinatorial partitioning and linear regression for the detection of epistatic effects of the ACE I/D and PAI-1 4G/5G polymorphisms on plasma PAI-1 Levels

Clinical Genetics 2002 62 1 74 79

2-s2.0-0036665198

10.1034/j.1399-0004.2002.620110.x

Michael

B. M.

Neapolitan

R. E.

Jiang

Shyam

Learning genetic epistasis using Bayesian network scoring criteria

BMC Bioinformatics 2011 12 1 89

Wang

Liu

Robbins

Rekaya

AntEpiSeeker: detecting epistatic interactions for case-control studies using a two-stage ant colony optimization algorithm

BMC Research Notes 2010 3, article 117

10.1186/1756-0500-3-117

2-s2.0-77953271464

Zhang

Liu

J. S.

Bayesian inference of epistatic interactions in case-control studies

Nature Genetics 2007 39 9 1167 1173

10.1038/ng2110

2-s2.0-34548352849

Dorigo

Birattari

Blum

Ant colony optimization and swarm intelligence

SpringerVerlag 2004 5217 8 767 771

Stützle

López-Ibáñez

Pellegrini

Maur

Montes De Oca

Birattari

Dorigo

Parameter adaptation in ant colony optimization

Autonomous Search 2012 9783642214349 191 215

2-s2.0-84929860736

10.1007/978-3-642-21434-9_8

10.1007/978-3-642-21434-9

Blum

Sampels

An ant colony optimization algorithm for shop scheduling problems

Journal of Mathematical Modelling and Algorithms 2004 3 3 285 308

10.1023/B:JMMA.0000038614.39977.6f

MR2231451

Zbl1146.90405

2-s2.0-18744404504

Musa

Arnaout

J.-P.

Jung

Ant colony optimization algorithm to solve for the transportation problem of cross-docking network

Computers and Industrial Engineering 2010 59 1 85 92

10.1016/j.cie.2010.03.002

2-s2.0-77955273868

Varela

G. N.

Sinclair

M. C.

Ant colony optimisation for virtual-wavelength-path routing and wavelength allocation

Proceedings of the 1999 Congress on Evolutionary Computation (CEC '99)

July 1999

Washington, DC, USA

1809 1816

10.1109/CEC.1999.785494

2-s2.0-84901474801

Sim

K. M.

Sun

W. H.

Ant colony optimization for routing and load-balancing: survey and new directions

Systems Man & Cybernetics Part A Systems Humans IEEE Transactions on 2003 33 5 560 572

10.1109/TSMCA.2003.817391

Ngo

S.-H.

Jiang

Horiguchi

Adaptive routing and wavelength assignment using ant-based algorithm

Proceedings of the 2004 12th IEEE International Conference on Networks, ICON 2004 - Unity in Diversity

November 2004

482 486

10.1109/ICON.2004.1409214

2-s2.0-21644461688

Vrieze

S. I.

Model selection and psychological theory: a discussion of the differences between the Akaike information criterion (AIC) and the Bayesian information criterion (BIC)

Psychological Methods 2012 17 2 228 243

10.1037/a0027127

2-s2.0-84866456803

Huang

D.-S.

J.-X.

A constructive hybrid structure optimization methodology for radial basis probabilistic neural networks

IEEE Transactions on Neural Networks 2008 19 12 2099 2115

10.1109/TNN.2008.2004370

2-s2.0-57749092656

North

B. V.

Curtis

Sham

P. C.

Application of logistic regression to case-control association studies involving two causative loci

Human Heredity 2005 59 2 79 87

2-s2.0-21044450087

10.1159/000085222

Jing

P.-J.

Shen

H.-B.

MACOED: A multi-objective ant colony optimization algorithm for SNP epistasis detection in genome-wide association studies

Bioinformatics 2015 31 5 634 641

2-s2.0-84928987535

10.1093/bioinformatics/btu702

Ryman

CHIFISH: A computer program testing for genetic heterogeneity at multiple loci using chi-square and Fisher's exact test

Molecular Ecology Notes 2006 6 1 285 287

2-s2.0-33644852119

10.1111/j.1471-8286.2005.01146.x

Mehta

C. R.

Patel

N. R.

A network algorithm for performing Fisher's exact test in r × c contingency tables

Journal of the American Statistical Association 1983 78 382 427 434

10.1080/01621459.1983.10477989

MR711119

Sobrino

Brión

Carracedo

SNPs in forensic genetics: A review on SNP typing methodologies

Forensic Science International 2005 154 2-3 181 194

2-s2.0-25444527442

10.1016/j.forsciint.2004.10.020

Shoval

Sheftel

Shinar

Hart

Ramote

Mayo

Dekel

Kavanagh

Alon

Evolutionary trade-offs, pareto optimality, and the geometry of phenotype space

Science 2012 336 6085 1157 1160

2-s2.0-84861663597

10.1126/science.1217405

Huang

D.-S.

Jiang

A general CPL-AdS methodology for fixing dynamic parameters in dual environments

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 2012 42 5 1489 1500

10.1109/TSMCB.2012.2192475

2-s2.0-84866490627

Zhu

Guo

W.-L.

Deng

S.-P.

Huang

D.-S.

ChIP-PIT: enhancing the analysis of chip-seq data using convex-relaxed pair-wise interaction tensor decomposition

IEEE/ACM Transactions on Computational Biology and Bioinformatics 2016 13 1 55 63

10.1109/TCBB.2015.2465893

2-s2.0-84962038941

Angione

Carapezza

Costanza

Lio

Nicosia

Pareto optimality in organelle energy metabolism analysis

IEEE/ACM Transactions on Computational Biology and Bioinformatics 2013 10 4 1032 1044

2-s2.0-84890513236

10.1109/TCBB.2013.95

Fisher

R. A.

On the Interpretation of χ2 from Contingency Tables, and the Calculation of P

Journal of the Royal Statistical Society 1922 85 1 87

10.2307/2340521

Agresti

A survey of exact inference for contingency tables

Statistical Science 1992 7 1 131 153

2-s2.0-84972545970

10.1214/ss/1177011454

Zbl0955.62587

Wenzheng

Yuehui

Dong

Prediction of protein structure classes with flexible neural tree

Bio-Medical Materials and Engineering 2014 24 6 3797 3806

2-s2.0-84907261302

10.3233/BME-141209

Zhu

You

Z.-H.

Huang

D.-S.

Wang

t-LSE: a novel robust geometric approach for modeling protein-protein interaction networks

PLoS ONE 2013 8 4

e58368

10.1371/journal.pone.0058368

2-s2.0-84875669119

Zheng

C.-H.

Zhang

V. T.-Y.

Shiu

C. K.

Huang

D.-S.

Molecular pattern discovery based on penalized matrix decomposition

IEEE/ACM Transactions on Computational Biology and Bioinformatics 2011 8 6 1592 1603

10.1109/TCBB.2011.79

2-s2.0-80052888681

Huang

D.-S.

H.-J.

Normalized feature vectors: a novel alignment-free sequence comparison method based on the numbers of adjacent amino acids

IEEE/ACM Transactions on Computational Biology and Bioinformatics 2013 10 2 457 467

10.1109/tcbb.2013.10

2-s2.0-84882954143

Marchini

Donnelly

Cardon

L. R.

Genome-wide strategies for detecting multiple loci that influence complex diseases

Nature Genetics 2005 37 4 413 417

2-s2.0-16844366786

10.1038/ng1537

Jiang

Tang

A random forest approach to the detection of epistatic interactions in case-control studies

BMC Bioinformatics 2009 10 1, article S65

10.1186/1471-2105-10-S1-S65

2-s2.0-60849093174

Kruppa

Ziegler

König

I. R.

Risk estimation and risk prediction using machine-learning methods

Human Genetics 2012 131 10 1639 1654

2-s2.0-84866731649

10.1007/s00439-012-1194-y

Huang

D.-S.

Zheng

C.-H.

Independent component analysis-based penalized discriminant method for tumor classification using gene expression data

Bioinformatics 2006 22 15 1855 1862

10.1093/bioinformatics/btl190

2-s2.0-33747865502

Mahley

R. W.

Weisgraber

K. H.

Huang

Apolipoprotein E4: a causative factor and therapeutic target in neuropathology, including Alzheimer's disease

Proceedings of the National Academy of Sciences of the United States of America 2006 103 15 5644 5651

10.1073/pnas.0600549103

2-s2.0-33645808672

Reiman

E. M.

Webster

J. A.

Myers

A. J.

Hardy

Dunckley

Zismann

V. L.

Joshipura

K. D.

Pearson

J. V.

Hu-Lince

Huentelman

Craig

D. W.

Coon

K. D.

Liang

W. S.

Herbert

R. H.

Beach

Rohrer

K. C.

Zhao

A. S.

Leung

Bryden

Marlowe

Kaleem

Mastroeni

Grover

Heward

C. B.

Ravid

Rogers

Hutton

M. L.

Melquist

Petersen

R. C.

Alexander

G. E.

Caselli

Kukull

Papassotiropoulos

Stephan

D. A.

GAB2 alleles modify Alzheimer's Risk in APOE ε4 carriers

Neuron 2007 54 5 713 720

10.1016/j.neuron.2007.05.022

2-s2.0-34249745769

Zheng

C.-H.

Huang

D.-S.

Zhang

Kong

X.-Z.

Tumor clustering using nonnegative matrix factorization with gene selection

IEEE Transactions on Information Technology in Biomedicine 2009 13 4 599 607

10.1109/titb.2009.2018115

2-s2.0-67749108622

Deng

S.-P.

Zhu

Huang

D.-S.

Predicting hub genes associated with cervical cancer through gene co-expression networks

IEEE/ACM Transactions on Computational Biology and Bioinformatics 2016 13 1 27 35

10.1109/TCBB.2015.2476790

2-s2.0-84961994343

Zhu

Deng

S.-P.

Huang

D.-S.

A two-stage geometric method for pruning unreliable links in protein-protein networks

IEEE Transactions on Nanobioscience 2015 14 5 528 534

10.1109/TNB.2015.2420754

2-s2.0-84939186112

Huang

D.-S.

Zhang

Han

Deng

Yang

Zhang

Prediction of protein-protein interactions based on protein-protein correlation using least squares regression

Current Protein and Peptide Science 2014 15 6 553 560

10.2174/1389203715666140724084019

2-s2.0-84906904242

Huang

D.-S.

Systematic Theory of Neural Networks for Pat-tern Recognition May 1996

Publishing House of Electronic Industry of China

Huang

D.-S.

Radial basis probabilistic neural networks: model and application

International Journal of Pattern Recognition and Artificial Intelligence 1999 13 7 1083 1101

10.1142/s0218001499000604

2-s2.0-0000759063