The epistasis is prevalent in the SNP interactions. Some of the existing methods are focused on constructing models for two SNPs. Other methods only find the SNPs in consideration of one-objective function. In this paper, we present a unified fast framework integrating adaptive ant colony optimization algorithm with multiobjective functions for detecting SNP epistasis in GWAS datasets. We compared our method with other existing methods using synthetic datasets and applied the proposed method to Late-Onset Alzheimer’s Disease dataset. Our experimental results show that the proposed method outperforms other methods in epistasis detection, and the result of real dataset contributes to the research of mechanism underlying the disease.
National Natural Science Foundation of China61520106006315713646173201261532008U16112656167238261402334614722806147217361572447616722036147228261373098China Postdoctoral Science Foundation2014M5615132015M5803522017M6116192016M601646Guangxi Bagui Scholars Program Special Fund1. Introduction
Accompanied by the rapid development of genomics and gene chip technology, Genome-Wide Association Studies (GWAS) predicted massive genetic variations related to complex traits [1, 2]. Although this method has achieved great success. It can only explain a small part of the mechanism under the complex diseases known as “missing heritability” [3]. That is to say, marginal genetic effects of GWAS identified single nucleotide polymorphisms (SNPs) account for small part of pathogenic causes. For single-locus SNPs related disease [4], GWAS can identify SNPs that are responsible for disease trait. However, complex diseases are often due to the small and complex effects of large SNPs, such as type 2 diabetes [5], prostate cancer, and rheumatoid arthritis (RA) [6]. More and more studies have shown that epistasis exists in SNPs interaction. Many SNPs will interact with each other in the process of affecting the disease traits [7]. Some SNPs will affect the disease and dominate the effect of others. The relationship of one SNP repressing the effect of another SNP is known as epistasis. In many complex human diseases, the effect of epistasis among complex human diseases is unclear. The proposed methods for SNP related disease may have poor performance due to failure to identify epistasis.
During the past decade, a lot of approaches have been proposed to detect epistasis. Some methods focus on the interaction between two certain SNPs. Zhang et al. [8] proposed a Bayesian partition method for epistatic eQTL modules. Kang et al. [9] proposed four different models to measure epistasis effect between two loci and suggest a statistical strategy to infer the hierarchical relationships. Recently, Lin et al. [10] reported forty-five SNP-SNP interaction models by considering the inheritance modes and model structures. Though these methods have been successful in studying epistasis between two SNPs. The GWAS data is high dimension data which contains hundreds of thousands or even million SNPs; at the same time, GWAS data only contains dozens or hundreds of individual sample data, for example, the small number sample data and the high dimension features; it needs vast amounts of time to identify the interaction between each pair of SNPs [11–13]. The computational burden is out of bounds.
More and more machine learning methods are applied to research epistasis. Many methods were proposed to model epistasis effect from the perspective of the overall data. Moore et al. [14] applied regression method to identify the relationship between gene expression and epistasis effect. Michael et al. [15] applied Bayesian networks to identify the epistasis effect network from the original SNPs data. Although these methods solved some problems, they still did not show significant effects with the large scale Genome-Wide Association Study datasets owing to the same “high-dimensional small sample size problem.” With the rapid development of multiobjective optimization method and machine learning discipline, ant colony optimization (ACO) algorithm was applied to epistasis research. Wang et al. [16] proposed AntEpiSeeker; AntEpiSeeker combines heuristic search with the ant colony optimization to identify SNPs which dominate other SNPs. Experimental results on real rheumatoid arthritis dataset show that AntEpiSeeker is better than other methods. The drawback of this method is that other methods show different performance on different disease models. Zhang and Liu [17] developed the Bayesian inference method which identifies the epistatic interactions in case-control studies. However, the BEAM method needs a lot of time in GWAS dataset. In this paper we extend SNP epistasis study to a fast adaptive ant colony optimization algorithm for detecting SNP epistasis. We search SNP epistasis with two-objective functions and fast adaptive ant colony optimization.
The experiments on several simulated datasets show the good performance of our method. We also compare our method with the benchmark methods, including BEAM, generic ACO, and AntEpiSeeker. Experimental results show that our method has better performance in GWAS datasets containing epistasis effect among SNPs.
2. Methods2.1. Ant Colony Optimization
In the research of artificial intelligence and large scale problem solving, the ant colony optimization (ACO) algorithm is inspired by the ants food search behaviour in nature. Assume that the food search paths constitute a graph; the ant colony optimization algorithm can reduce time of search paths through graphs [18]. This algorithm with other ant colony optimization algorithms is kind of swarm intelligence methods, and it is member of metaheuristic optimizations. Marco Dorigo proposed the ant colony optimization algorithm in 1992 in his Ph.D. thesis. In the GWAS datasets, the datasets often contain tens of hundreds to millions of SNPs. It is not feasible to identify the relationship of every pair of SNPs within an acceptable time. ACO algorithm was used here to reduce the complexity of exhaustive search. In kingdom of insects, in the process of finding food, ants look like they are walking randomly, and in the back and forth path of searching for food, the ants will leave pheromones on the path. If the path is found by other ants, other ants tend to follow the path but not walk randomly; going further, if they find food through this path, they will also leave pheromones; the pheromone value on this path is enhanced. Subject to other factors in nature, pheromone value starts to evaporate and the path’s attractive strength starts to decrease. The longer the path is, the more the time the ants are looking for food. As a comparison, the time the ants take to walk through the short path is greatly shortened, and pheromone values will be larger on shorter paths than longer paths. Pheromone evaporation results in dynamic changes in the path. Path dynamic changes can avoid the convergence of solutions to a locally optimal solution. If there is no pheromone values evaporation, the food search path selected by first ants would tend to be the only path or the most attractive path. This phenomenon will lead to limitation of the solution space. The mechanism of pheromone evaporation in ant colony is unclear, but pheromone evaporation is a very important application in artificial intelligence systems. Though the ant colony optimization algorithm has achieved great success in application [19–21].
The travelling salesperson problem (TSP) is a problem with some cities and physical distances between each pair of cities. The question is what is the shortest possible path where travelling salesperson visits each city once and finally returns to the origin city? Suppose there are n cities; there are n-1!/2 solutions to the problem. The feasible solutions will increase exponentially when the number of city increases, making the computation impractical. Obviously, it is an NP-hard problem of combinatorial optimizations.
Suppose that m ants are randomly placed in n cities, the kth ant in the ith city; the probability if ant chooses the next city j is(1)pijk=τijαtηijβt∑o∈candidatekτioαtηioβt,j∈candidatek,0,otherwise,ηijt=1dij,where τijt indicates the surplus information on path ij in moment t. ηijt indicates the heuristic function. dij indicates the physical distance between city i and city j. tabuk indicates the cities set which indicates ant k has visited. candidatek indicates the set of cities which ant k can visit next.
Over time, after n moments, the ants complete a cycle; the information of each path should be adjusted according to (2)τijt+n=1-ρτijt+Δτij,Δτij=∑k=1mΔτijk,where Δτij indicates information increment of path ij after this cycle. (3)Δτij=Qlk,ij∈Lk0,otherwise,where Lk indicates ant k’s paths in this cycle. lk indicates the path length of ant k in this cycle. The parameters needed to be determined are α,β,ρ,m,Q; the number of ants is less than or equal to city number; Q is a large suitable number. ACO is always used in large scale data problems. However, slowness is still a bottleneck in the application of the ant colony algorithm for large scale search optimization problems. Pheromone update strategy is one of the keys to determine the convergence rate.
In the process of applying ant colony optimization to specific problems, the search space should be as large as possible. At the same time, ACO should consider time efficiency. ACO should balance the optimal solutions and solve speed. On the basis of previous studies [22–24]. We only consider pheromone evaporation factor ρ and pheromone importance factor α. In (2), ρ is used to balance the effects of old pheromone value and current pheromone value. When ρ is too small, the residual pheromone value is too much and leads to local minimum solution. We adopt adaptive ρ, when the algorithm does not improve the current optimal solution within n iterations.(4)ρt+n=gρt,ρt≤ρmaxρmax,otherwise,where ρmax equals 0.85 in practice. g equals 1.02 as tune parameter. When the pheromone value reaches the critical value, the pheromone importance factor begins to play a role. With the increase of pheromone importance factor α, the algorithm will jump out of the local optimal solution and has ability to search for global optimal solution.(5)αt+n=g1αt,αt≤αmaxαmax,otherwise,where g1 is a constant larger than one and αmax is less than or equal to five. In the process of calculation, first, we follow the standard ant colony optimization algorithm for N iterations. N is predefined number. If the current optimal solution is not improved after N iterations, update the parameters according to formulas (4) and (5). Then update all pheromone value according to (2).
Given pheromone values and transfer rules, we can use the ant colony optimization algorithm to find a group of SNPs which affect the disease. Assume there are P SNPs in the global Genome-Wide Association Studies dataset, we can construct a p-dimensional symmetric matrix M to store every ant’s pheromone value. The element mij of matrix M denotes the interaction which is related to disease between ith SNP and jth SNP. At the beginning of our method, every element of matrix M is assigned to a constant value m0; equivalent value shows the epistasis in every pair of SNPs and there is equal possibility relationship between the SNPs and disease.
At the final pheromone iteration, the ACO algorithm will obtain the optimal solutions through forward selection strategy. The advantage of ACO algorithm in this paper is that the result contains nondominated solutions which have the potentially equivalent possibility and potentially highest related strength with disease and omit dominated solutions.
The disadvantages of traditional ant colony optimization algorithm are long search time and tendency to fall into the local optimal solution. The drawback of this working mode is that the current pheromone evaporation factor and pheromone importance factor are predefined. As an improved strategy, we extended the “dynamic adaptive strategy” to ant colony optimization. The advantage of this strategy is the fast convergence rate and searching for global optimization solution. Compared with traditional ACO, the new strategy can provide more accurate result.
2.2. Two-Objective Function Optimization
The results of ant colony optimization need to be evaluated. We combine two-objective methods to assess the final epistasis results. In general, one of two-objective functions combines Akaike Information Criterion (AIC) score and logistic regression function to measure relationship between phenotypic trait and genotype data; Akaike Information Criterion indicates the effectiveness and complexity of the model [25, 26]. In our method, on the basis of the standard logistic regression, following the North et al. [27] strategy, we use ADDINT logistic regression model to search the relationship between disease and SNP nodes. The second objective function uses frequency measurement based on mutual information theory to model the relationship between genotype data and phenotypic trait from the perspective of information theory. The second objective function used to represent the selected SNP subsets can explain how much information is about the disease trait. Our proposed method obtains information from data rather than a lot of priori information. The above two-objective functions are designed from the different perspective to measure the quality of the search results, and the simulation data experiment results show that our two-objective functions have a better performance than other methods on simulated and real biological datasets.
In order to avoid the bad impact of high dimension small size sample problem, the identification of disease-associated SNPs is known as a heuristic optimization problem. In our proposed method, proposed method yields optimal solutions which is nondominated solutions; the proposed two-objective functions method actually is kind of multiobjective optimization; the proposed method uses ant colony optimization to search for optimal solution [28].
Our proposed fast adaptive ACO framework contains two stages. In the first stage, we use modified ACO optimization algorithm with two-objective functions to search for nondominated SNP subset. After generating the nondominated SNP subset, we apply Fisher exact test [29, 30] to the dataset containing nondominated SNP generated in the algorithm first stage. The Fisher exact test will be used to identify the relationship between disease and SNPs.
2.2.1. AIC Score
The Akaike Information Criterion (AIC) is used to measure quality of dataset statistical models. AIC is from information theory, and it estimates loss of information when a statistical model is used to express the data generation process. The mechanism of Akaike Information Criterion is that it deals with the trade-off between the goodness of fit of the model and the complexity of the model. Based on the nature of the AIC, we construct AIC model from the perspective of GWAS dataset. The goal of our method is to measure the relationship between the genotype data of genome and phenotype disease trait. Logistic regression is widely used to quantitatively analyze the correlation between dependent variable and independent variable. Based on above methods, we construct AIC score model containing logistic regression and gradient penalty function. Logistic regression can compute the maximized log-likelihood of the model; k is used to express the number of free parameters. AIC score deals with the trade-off between the fitness effect of the model and the complexity of the model. We follow Jing and Shen [28] strategy:(6)AICscore=2k-2loglik,where k denotes the number of free parameters.
2.2.2. Explanation Score
In GWAS research, the relationship between two loci and disease, in SNP research, each locus has three values, 0, 1, and 2; 0 means major allele homozygous, 1 means heterozygote, and 2 means minor allele homozygous [31]. For two loci, there are nine cases of their combination; the disease related SNP locus often changes when the disease occurs. In the case of double locus combination, xi means the number of ith combinations of two SNP loci, Y means case or control state, y1 means state case, and y2 means state control. The potential interrelationships of two discrete random variables X and Y are defined as H(X;Y); the relationship between locus combination and disease is measured based on the information of locus frequency. H(X;Y) is described as below:(7)HX;Y=∑i=1Ixiy1-xiy2,where I means the total number of locus combinations. To avoid unbalanced sample, the size affects score. For example, if data size of case is larger than control, we extract the same size of control data from case samples randomly. To avoid the impact of randomness, we extract sample several times and average the results. The large value H means the potential association probability between disease and SNPs is large. Equation can also be applied to more than two locus combinations. We name this score explain score.
2.3. Pareto Optimality for SNP Epistasis Detection
Pareto optimality defines such a situation. Pareto optimality is proposed to solve the following questions where it is impossible to make all objective function values of multiobjective optimization optimal values [32, 33]. Pareto optimality is first applied to the area of income distribution and economy. Now Pareto optimality has been extended to engineering and multiobjective optimization research. On the basis of previous proposed methods, the modified ant colony optimization algorithm with first objective function and second objective function, the first objective function is AIC score with logistic regression and related parameters; the second objective function is explain score. For the first objective functions, the lower score of the objective function indicates the high potential relationship between disease phenotype trait and SNPs [34]. For the second objective functions, the higher score of the objective function indicates the high potential relationship between the disease phenotype trait and SNPs. The target of fast two-stage ant colony optimization algorithm is to find the epistasis effect among SNPs and extract real SNP subset with respect to the above proposed methods.
In the real GWAS datasets, an identified SNP subset may perform the best compared with other method solutions in terms of one-objective function, but SNP subset may perform poorly in terms of another objective function. Thus, the target is how to select better SNP subset with respect to both objective functions. In practical application, rare subset performs better than other solutions while satisfying both conditions. Thus, for a framework with two-objective functions, it is hard and even impossible to calculate the global optimal solution. On the basis of previous studies [28, 34, 35], we adopt Pareto optimality to find the practical optimal solution. We first compare the two solutions, in terms of GWAS SNP subset, a solution named S1, and another solution named S2; comparing S1 and S2 only have two consequences; one result is one solution dominates the other; another result is S1 does not dominate S2; in turn, the solution S2 does not dominate S1. Based on the mind of Pareto optimality, we consider S1 dominates S2 if they satisfy the following two conditions. The first condition is the value of fe(S1) is not higher than fe(S2) for those two-objective functions. The second condition is the objective function fe(S1) is lower than fe(S2) for at least one-objective function. The function fe denotes the objective function: modified AIC score objective function and explain score objective function. The e equal to one denotes the first objective function; the e equal to two denotes the second objective function. If solutions S1 and S2 satisfy the above two conditions, we say solution S1 is a nondominated solution; in turn, we say solution S2 is a dominated solution. Based on above Pareto optimality approach and two-objective functions, all solutions can be divided into two kinds; one is nondominated set and another is dominated set. Finally, nondominated sets contain many solutions and all the solutions from our proposed method with respect to two-objective functions; now our goal is to find a nondominated set which is the best under certain conditions.
Next, we will use the judgment rule mentioned earlier to sort the solutions of nondominated sets to find the optimal nondominated set. Specifically, in the first case, f1(S2) is larger than f1(S1); at the same time, fe(S2) is larger than f2(S1). In the second case, f1(S2) equals f1(S1); at the same time, f2(S2) is larger than f2(S1). In the third case, f1(S2) is larger than f1(S1); at the same time, f2(S2) equals f2(S1).
2.4. Fisher Exact Test for Experimental Results
Fisher exact test is used in contingency tables to get a statistical significance [36–38]. Although in practice it is used in small size sample, it is can also be used in all sample sizes. Ronald Fisher first proposed this method and Fisher exact test is one kind of exact tests.
In terms of our GWAS datasets research article, on the basis of unified framework which contains fast adaptive ant colony optimization (ACO) algorithm, Akaike Information Criterion (AIC) score, explain score, and Pareto optimality, we can obtain the final result which is a nondominated SNP set; in this section, we will use Fisher exact test to exhaustively search for the epistasis effect. Fisher exact test is based on hypergeometric distribution; the P value in the Fisher exact test is accurate for all individual samples. Fisher exact test is used on the basis of contingency table. The null hypothesis is that the identified SNP subset and disease are not associated. The alternative hypothesis is that SNP subset affects the expression of the disease when the Fisher exact test’s P value is significant, when P value is less than predetermined value such as 0.05 or smaller value. Our proposed method will identify significance SNP subsets.
2.5. Power Test
In previous section, we introduce each part of our proposed fast adaptive ant colony optimization algorithm for detecting SNP epistasis. Our proposed unified framework contains fast adaptive ant colony optimization algorithm, Akaike Information Criterion (AIC) score, explain score, Pareto optimality, and modified Fisher exact test. In this section, we introduce how to verify the significance of the results. We construct 100 datasets according to the same parameters. Then we use the traditional power test to measure the effect of methods. The power test is defined as follows:(8)Power=SD100,where SD denotes the number of disease related datasets which were correctly selected from 100 datasets. Only using the single test criterion may not clearly show the quality of results. We use precision recall standard to measure true positive rate and false positive rate. Precision recall criteria have been widely used in classification model evaluation model [39, 40]. In pattern recognition and information retrieval with binary classification, precision, also called positive predictive value, is the fraction of retrieved instances that are relevant; while recall, also known as sensitivity, is the fraction of relevant instances that are retrieved [26]. Both precision and recall are therefore based on an understanding and measure of relevance. We use precision recall criteria to determine whether the classification results are good or bad. The precision recall criteria can avoid the imbalance problem of precision recall numbers. In our research, the number of precision and recall always differs greatly. In terms of the SNP epistasis research, precision is also known as positive predictive value, equivalent to the true disease related SNP subsets; recall is also known as sensitivity or negative, equivalent to the true disease unrelated SNP subsets. If we use only one judgment criterion, thus false positive rate, single indicator cannot make the real result clear. We use false positive rate and true positive rate to measure the real result. This is why we use precision and recall. We also use F1 score (also F score or F measure) to measure the precision recall test accuracy. The precision and recall will be introduced next with confusion matrix (Figure 1).(9)recall=TPTP+FN,precision=TPTP+FP,F1=precision·recallprecision+recall.
Precision recall explanation matrix.
The precision, also known as specificity, denotes true positive number ratio in the result through the number of true positives divided by the sum of true positive number and false positive number; precision is often used to report false positive rate of an algorithm’s false positive rate. The recall, also known as sensitivity, denotes true positive ration in the sum of true positives and false negative. In terms of SNPs selection problem, the larger the recall value is, the larger the number of real true disease-related SNP combinations can be found. Simultaneously, the larger the precision value, the larger the number of real true disease-related SNP combinations account for a high proportion of the identified SNP combinations. The criterion F measure is the harmonic mean of precision and recall, which is a synthesized measure combining both precision and recall [41].
3. Simulation Experiments3.1. Compared with One-Objective Function
In this section, we use simulation data to compare our proposed method with other existing methods. In order to avoid data favor caused by the model, we adopt BEAM package to generate simulation datasets [17]. Data was simulated following three genetic models: (1) additive model, (2) epistatic interactions with multiplicative effects, and (3) epistatic interactions with threshold effects. In order to introduce our experiments, the additive model is referred to as ADDME. The model about epistatic interactions with multiplicative effects is referred to as EIME. The epistatic interactions with threshold effects are referred to as EITEME. In the next section, we will use the short name to indicate the corresponding data model.
Because our method is two-objective-based SNP epistasis search method, first, we compared our proposed method with existing single objective-based exhaustive SNP epistasis search method to demonstrate the effectiveness of two-objective function SNP epistasis subset search method. Second, we compare our proposed method with recently proposed method BEAM [17], generic ACO algorithm, and AntEpiSeeker [16]. In the one-objective function SNP epistasis search method, the objective function is used to score every SNP combinations; in general, the score for every SNP combination is not the same. Based on the nature of the method, low score indicates the association between SNP combination and disease is relatively small; high score indicates the association between SNP combination and disease is relatively large. Then the one-objective function ranks all SNP combinations based on the scores. However, the two-objective-based SNP epistasis search method is to find a set of nondominated results, and every nondominated SNP epistasis results’ score is the same. To ensure fairness, for the one-objective function, we collect the same number as two-objective-based SNP epistasis search method from the top of one-objective-based SNP rank. The comparing results show that the two-objective-based SNP epistasis search method is better than one-objective-based SNP epistasis search method in three simulation data models. In terms of two single objective-based SNP epistasis search methods, the results of one-objective-based SNP epistasis search methods are similar with the other one-objective-based SNP epistasis search methods. The simulation data experiment results show the effectiveness of two-objective-based SNP epistasis search method, and the poor experimental results show the insufficiency of one-objective functions. The experiment results are shown in Figure 2. The abscissa of Figure 2 is minor allele frequency (MAF) which is assigned 0.1, 0.2, and 0.5. We generate the simulate dataset and study the parameter setting following many previous studies [17, 42–44]. For each simulate dataset of parameter combination, we generated 100 datasets which contain 2,000 experimental samples (1,000 case samples and 1,000 control samples) and 1000 SNPs were simulated. We evaluate the algorithm performance through calculating the ratio of real number identified following the significance level 0.01 which is adjusted after Bonferroni correction. The parameter λ was set to 0.3 for ADDME and 0.2 for EIME and EITEME. The parameter range of linkage disequilibrium between SNPs is r2 from 0.7 to 1.
Power test comparisons between one-objective and two-objective methods on three different model with MAF value 0.1, 0.2, and 0.5.
3.2. Compared with Benchmark Methods
After comparing with single objective function. We compare our proposed method with existing method. The performance of our proposed method was evaluated by comparison with benchmark methods [45]. In many previous studies, the authors have already discussed the parameter settings problem. In this section, we set the parameters according to the existing strategy. We evaluated performance of FAACOSE by comparing with two recent methods, BEAM, generic ACO algorithm, and the AntEpiSeeker; we use BEAM package and previous parameter strategy to generate simulate dataset. Be aware of the fact that the generic ACO algorithm could not select larger size SNP set. We use simulated dataset introduced in Section 3.1. We evaluate the algorithm performance through calculating the ratio of real number identified following the significance level 0.01 which is adjusted after Bonferroni correction. We generate simulate datasets following three genetic models: ADDME, EIME, and EITEME. Other parameters for data simulation were the effective size λ, a measure of marginal effects as defined by Marchini et al. [42], linkage disequilibrium between SNPs measured by r2, and minor allele frequencies (MAFs). λ was set to 0.3 for ADDME and 0.2 for EIME and EITEME. For r2, two values (0.7 and 1.0) were used for each model. For MAFs, three values (0.1, 0.2, and 0.5) were considered. The parameters for BEAM were set as default. The parameter settings for AntEpiSeeker were large dataset size = 6, small dataset size = 3, count large = 150, count small = 300, epistasis model = 2, ant count = 1000, α=1, ρ=0.05, and τ0=100 (also available in the software package documentation of AntEpiSeeker). The parameters of the generic ACO algorithm were set as ant count = 1000, α=1, ρ=0.05, τ0 = 100, count (number of iterations) = 900, and epistasis model = 2. The comparison of detection power for BEAM, genetic ACO algorithm, and the AntEpiSeeker is presented in Figure 3. The results show that FAACOSE outperforms BEAM and the generic ACO in all parameter settings and is superior to AntEpiSeeker in most parameter settings.
Power comparisons between existing methods and FAACOSE on three models.
In this section, we compare our proposed method with benchmark methods. First, we use power test to detect how many real SNP subsets can be found with our proposed method. Second, we use precision, recall, and F1 score to evaluate the results. Precision denotes how many right SNP subsets in the total final identified SNP subsets. Recall denotes the number of right SNP subsets that are identified. F1 score is an indicator used in statistics to measure the accuracy of two classification models. It takes into account the precision and recall of the classification model simultaneously. F1 score can be seen as a weighted average of precision and recall, its maximum is 1, and minimum is 0.l. We show the results of FAACOSE with other methods on r2 = 0.7 and MAF = 0.2 in Table 1.
F1 score comparison between FAACOSE and other methods.
Model
Method
Recall
Precision
F1 score
ADDME
BEAM
0.29
0.15
0.20
gACO
0.45
0.36
0.40
AntEpiSeeker
0.6
0.55
0.57
FAACOSE
0.82
0.74
0.78
EIME
BEAM
0.3
0.45
0.36
gACO
0.35
0.32
0.33
AntEpiSeeker
0.34
0.56
0.42
FAACOSE
0.9
0.82
0.86
EITEME
BEAM
0.1
0.14
0.12
gACO
0.15
0.20
0.17
AntEpiSeeker
0.54
0.46
0.50
FAACOSE
0.65
0.62
0.63
The F1 score of FAACOSE is better than other methods. We run the same experiment on datasets with different parameter combination. In all eighteen datasets FAACOSE has the highest F1 score in fifteen of them. In real GWAS dataset experiment, the sample size of real dataset is huge. The efficiency of the method is also to be considered. The experimental results indicate that our proposed method is more effective method in real GWAS dataset. AntEpiSeeker is the most efficient algorithm among three methods. In different data samples, we compare run time of AntEpiSeeker and FAACOSE. And averaging the results, FAACOSE is faster 30% than AntEpiSeeker.
4. Application to Real SNP Dataset
Late-Onset Alzheimer’s Disease (LOAD) is the most frequent form of Alzheimer’s disease, which is frequently identified in people older than 65 years; the LOAD or AD is a kind of chronic neurodegenerative diseases which is frequently not obvious in the onset of the disease and slowly changes dementia over time. It is the cause of 60% to 70% of cases of dementia. The most common early symptom is difficulty in remembering recent events (short-term memory loss). As the disease advances, symptoms can include problems with language, disorientation (including easily getting lost), mood swings, loss of motivation, not managing self-care, and behavioural issues. LOAD is a multifactor genetic disease; its etiology and pathogenesis have not yet been fully understood. The apolipoprotein (APOE) gene is a definite risk factor for LOAD. The APOE gene has three forms. The ε2, ε3, and ε4; the effect of ε2 is positive; ε2 can effectively prevent the occurrence of the disease. There has been research report that genetic variant ε4 has induced effect on disease. Between 40 and 80% of people with AD possess at least one APOE ε4 allele [46]. Previous studies have reported some significant SNPs in the field of Genome-Wide Association Studies [47]. Reference [47] reported that 10 SNPs in the area of GAB2 gene have an epistasis effect with APOE e4 in relation to Late-Onset Alzheimer’s Disease. We applied our proposed method to the LOAD GWAS dataset from website https://www.tgen.org/ [47]. After data preprocessing, the real biological dataset contains 1368 samples [48, 49]. Of these, 836 samples were identified case studies; the remaining 532 samples were normal sample [50, 51]. Each sample of real biological dataset contains 309,316 SNPs with genotype information, APOE status, and LOAD status [52]. For the next calculation, we code the APOE gene state with a binary variable; the value 1 represents the ε4 variant and in turn the value 0 represents the other three variants [53]. An SNP locus was coded as a quaternary variable considering the missing state. The high potential LOAD disease related SNP is shown in Table 2.
The number of selected SNPs of FAACOSE in LOAD dataset.
SNP rs#
rs7756992
rs611154
rs191840
rs7294919
rs1887922
rs304900
rs1999764
rs1385600
rs2373115
rs7101429
rs609812
rs613375
rs1007837
rs2510038
rs4945261
rs10793294
rs520227
rs191740
rs7924284
rs829465
rs602106
rs7174511
rs606889
rs602192
5. Discussions
In this paper, we proposed a novel ant colony optimization based fast search method for the discovery of epistasis interactions in large scale real GWAS dataset. FAACOSE was evaluated through comparison with existing three approaches on both simulated and real datasets. FAACOSE, which adopts a fast adaptive optimization procedure, is a modified algorithm derived from the generic ACO. And with two-objective function, to demonstrate the advantages of fast adaptive ant colony optimization algorithm, we also compared the performance of the FAACOSE with that of the generic ACO.
In future studies, we intend to find more powerful modeling approaches, ant colony optimization algorithm with faster convergence, objective functions which can better measure data structure of GWAS dataset, more efficient optimal SNP subset search, and identification strategies that can be combined and flexibly embedded into our SNP epistasis search framework to find more accurate SNP subset. With the rapid development of bioinformatics, more and more biological information related to disease is identified. More and more studies will consider prior knowledge. An important future research direction is that we will try to apply expert prior knowledge to GWAS dataset with our proposed method, that is, the fast adaptive ant colony optimization algorithm for detecting SNP epistasis. Expert prior knowledge can improve the power and efficiency of epistasis detection.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work is partly supported by National Natural Science Foundation of China (Grant nos. 61520106006, 31571364, 61732012, 61532008, U1611265, 61672382, 61402334, 61472280, 61472173, 61572447, 61672203, 61472282, and 61373098) and China Postdoctoral Science Foundation (Grant nos. 2014M561513, 2015M580352, 2017M611619, and 2016M601646) Guangxi Bagui Scholars Program Special Fund.
HirschhornJ. N.DalyM. J.Genome-wide association studies for common diseases and complex traitsHowieB. N.DonnellyP.MarchiniJ.A flexible and accurate genotype imputation method for the next generation of genome-wide association studiesManolioT. A.CollinsF. S.CoxN. J.GoldsteinD. B.HindorffL. A.HunterD. J.McCarthyM. I.RamosE. M.CardonL. R.ChakravartiA.ChoJ. H.GuttmacherA. E.KongA.KruglyakL.MardisE.RotimiC. N.SlatkinM.ValleD.WhittemoreA. S.BoehnkeM.ClarkA. G.EichlerE. E.GibsonG.HainesJ. L.MacKayT. F. C.McCarrollS. A.VisscherP. M.Finding the missing heritability of complex diseasesShastryB. S.SNP alleles in human disease and evolutionStubbsB.VancampfortD.De HertM.MitchellA. J.The prevalence and predictors of type two diabetes mellitus in people with schizophrenia: a systematic review and comparative meta-analysisLiaoK. P.Cardiovascular disease in patients with rheumatoid arthritisMaoY.LondonN. R.MaL.DvorkinD.DaY.Detection of SNP epistasis effects of quantitative traits using an extended Kempthorne modelZhangW.ZhuJ.SchadtE. E.LiuJ. S.A Bayesian partition method for detecting pleiotropic and epistatic eQTL modulesKangM.ZhangC.ChunH.-W.DingC.LiuC.GaoJ.EQTL epistasis: Detecting epistatic effects and inferring hierarchical relationships of genes in biological pathwaysLinH.ChenD.HuangP.LiuY.OchoaA.ZabaletaJ.MercanteD. E.FangZ.SellersT. A.Pow-SangJ. M.ChengC.EelesR.EastonD.Kote-JaraiZ.Amin Al OlamaA.BenllochS.MuirK.GilesG. G.WiklundF.GronbergH.HaimanC. A.SchleutkerJ.NordestgaardB. G.TravisR. C.HamdyF.PashayanN.KhawK.StanfordJ. L.BlotW. J.ThibodeauS. N.MaierC.KibelA. S.CybulskiC.Cannon-AlbrightL.BrennerH.KanevaR.BatraJ.TeixeiraM. R.PandhaH.LuY.ParkJ. Y.SNP interaction pattern identifier (SIPI): an intensive search for SNP–SNP interaction patternsPrenticeR. L.QiL.Aspects of the design and analysis of high-dimensional SNP studies for disease risk estimationDengS.-P.ZhuL.HuangD.-S.Mining the bladder cancer-associated genes by an integrated strategy for the construction and analysis of differential co-expression networksDengS.-P.HuangD.-S.SFAPS: An R package for structure/function analysis of protein sequences based on informational spectrum methodMooreJ. H.LambJ. M.BrownN. J.VaughanD. E.A comparison of combinatorial partitioning and linear regression for the detection of epistatic effects of the ACE I/D and PAI-1 4G/5G polymorphisms on plasma PAI-1 LevelsMichaelB. M.NeapolitanR. E.JiangX.ShyamV.Learning genetic epistasis using Bayesian network scoring criteriaWangY.LiuX.RobbinsK.RekayaR.AntEpiSeeker: detecting epistatic interactions for case-control studies using a two-stage ant colony optimization algorithmZhangY.LiuJ. S.Bayesian inference of epistatic interactions in case-control studiesDorigoM.BirattariM.BlumC.Ant colony optimization and swarm intelligenceStützleT.López-IbáñezM.PellegriniP.MaurM.Montes De OcaM.BirattariM.DorigoM.Parameter adaptation in ant colony optimizationBlumC.SampelsM.An ant colony optimization algorithm for shop scheduling problemsMusaR.ArnaoutJ.-P.JungH.Ant colony optimization algorithm to solve for the transportation problem of cross-docking networkVarelaG. N.SinclairM. C.Ant colony optimisation for virtual-wavelength-path routing and wavelength allocationProceedings of the 1999 Congress on Evolutionary Computation (CEC '99)July 1999Washington, DC, USA1809181610.1109/CEC.1999.7854942-s2.0-84901474801SimK. M.SunW. H.Ant colony optimization for routing and load-balancing: survey and new directionsNgoS.-H.JiangX.HoriguchiS.Adaptive routing and wavelength assignment using ant-based algorithmProceedings of the 2004 12th IEEE International Conference on Networks, ICON 2004 - Unity in DiversityNovember 200448248610.1109/ICON.2004.14092142-s2.0-21644461688VriezeS. I.Model selection and psychological theory: a discussion of the differences between the Akaike information criterion (AIC) and the Bayesian information criterion (BIC)HuangD.-S.DuJ.-X.A constructive hybrid structure optimization methodology for radial basis probabilistic neural networksNorthB. V.CurtisD.ShamP. C.Application of logistic regression to case-control association studies involving two causative lociJingP.-J.ShenH.-B.MACOED: A multi-objective ant colony optimization algorithm for SNP epistasis detection in genome-wide association studiesRymanN.CHIFISH: A computer program testing for genetic heterogeneity at multiple loci using chi-square and Fisher's exact testMehtaC. R.PatelN. R.A network algorithm for performing Fisher's exact test in r × c contingency tablesSobrinoB.BriónM.CarracedoA.SNPs in forensic genetics: A review on SNP typing methodologiesShovalO.SheftelH.ShinarG.HartY.RamoteO.MayoA.DekelE.KavanaghK.AlonU.Evolutionary trade-offs, pareto optimality, and the geometry of phenotype spaceHuangD.-S.JiangW.A general CPL-AdS methodology for fixing dynamic parameters in dual environmentsZhuL.GuoW.-L.DengS.-P.HuangD.-S.ChIP-PIT: enhancing the analysis of chip-seq data using convex-relaxed pair-wise interaction tensor decompositionAngioneC.CarapezzaG.CostanzaJ.LioP.NicosiaG.Pareto optimality in organelle energy metabolism analysisFisherR. A.On the Interpretation of χ2 from Contingency Tables, and the Calculation of PAgrestiA.A survey of exact inference for contingency tablesWenzhengB.YuehuiC.DongW.Prediction of protein structure classes with flexible neural treeZhuL.YouZ.-H.HuangD.-S.WangB.t-LSE: a novel robust geometric approach for modeling protein-protein interaction networksZhengC.-H.ZhangL.NgV. T.-Y.ShiuC. K.HuangD.-S.Molecular pattern discovery based on penalized matrix decompositionHuangD.-S.YuH.-J.Normalized feature vectors: a novel alignment-free sequence comparison method based on the numbers of adjacent amino acidsMarchiniJ.DonnellyP.CardonL. R.Genome-wide strategies for detecting multiple loci that influence complex diseasesJiangR.TangW.WuX.FuW.A random forest approach to the detection of epistatic interactions in case-control studiesKruppaJ.ZieglerA.KönigI. R.Risk estimation and risk prediction using machine-learning methodsHuangD.-S.ZhengC.-H.Independent component analysis-based penalized discriminant method for tumor classification using gene expression dataMahleyR. W.WeisgraberK. H.HuangY.Apolipoprotein E4: a causative factor and therapeutic target in neuropathology, including Alzheimer's diseaseReimanE. M.WebsterJ. A.MyersA. J.HardyJ.DunckleyT.ZismannV. L.JoshipuraK. D.PearsonJ. V.Hu-LinceD.HuentelmanM.CraigD. W.CoonK. D.LiangW. S.HerbertR. H.BeachT.RohrerK. C.ZhaoA. S.LeungD.BrydenL.MarloweL.KaleemM.MastroeniD.GroverA.HewardC. B.RavidR.RogersJ.HuttonM. L.MelquistS.PetersenR. C.AlexanderG. E.CaselliR.KukullW.PapassotiropoulosA.StephanD. A.GAB2 alleles modify Alzheimer's Risk in APOE ε4 carriersZhengC.-H.HuangD.-S.ZhangL.KongX.-Z.Tumor clustering using nonnegative matrix factorization with gene selectionDengS.-P.ZhuL.HuangD.-S.Predicting hub genes associated with cervical cancer through gene co-expression networksZhuL.DengS.-P.HuangD.-S.A two-stage geometric method for pruning unreliable links in protein-protein networksHuangD.-S.ZhangL.HanK.DengS.YangK.ZhangH.Prediction of protein-protein interactions based on protein-protein correlation using least squares regressionHuangD.-S.HuangD.-S.Radial basis probabilistic neural networks: model and application