Understanding associations between genotypes and complex traits is a fundamental problem in human genetics. A major open problem in mapping phenotypes is that of identifying a set of interacting genetic variants, which might contribute to complex traits. Logic regression (LR) is a powerful multivariant association tool. Several LRbased approaches have been successfully applied to different datasets. However, these approaches are not adequate with regard to accuracy and efficiency. In this paper, we propose a new LRbased approach, called fishswarm logic regression (FSLR), which improves the logic regression process by incorporating swarm optimization. In our approach, a school of fish agents are conducted in parallel. Each fish agent holds a regression model, while the school searches for better models through various preset behaviors. A swarm algorithm improves the accuracy and the efficiency by speeding up the convergence and preventing it from dropping into local optimums. We apply our approach on a real screening dataset and a series of simulation scenarios. Compared to three existing LRbased approaches, our approach outperforms them by having lower type I and type II error rates, being able to identify more preset causal sites, and performing at faster speeds.
Understanding the genotypephenotype association is one of the major problems in human genetics. Much effort has been devoted to mapping complex traits with one or pairwise single nucleotide polymorphisms (SNPs). These studies were mainly supported by the “common diseasecommon variant (CDCV)” hypothesis [
In the broad sense heritability model, there is a focus on two types of interactions in the quantitative research, which are the genotypebygenotype interactions, also known as epistasis, and genotypebyenvironment interactions. The genotypebygenotype interactions consider that the effect of one genetic variation is conditional on genotypes at one or more other unlinked loci, while the genotypebyenvironment interactions consider that the effect of one genetic variation is conditional on environmental factors, such as behaviors and temperature [
Along with the growing evidence of genotypebygenotype interactions being important contributors to genetic variations in complex human diseases, there are many different formulations in modeling both types of interactions [
Genetic studies now generate SNP data with thousands or millions of variants from more than ten thousand sampled individuals. A main deficiency of existing LRbased approaches is that these approaches are not efficient enough to handle largescale data. These approaches often suffer from slow convergence when finding the optimal solutions in a very large solution space. Because of the design of the logic tree (LT, the basic computational unit in logic regression), the size of the solution space of the logic trees increases factorially when the number of SNPs becomes larger. A way of speeding up the logic regression is to design a better regression algorithm. The greedy strategy [
Motivated by previous studies, in this paper, a novel regression algorithm on the logic regression framework is described. This new algorithm incorporates fishswarm optimization [
Logic regression (LR), which was first proposed in [
Because of the combinatorial explosion of potential Boolean combinations, the logic tree (LT) model is suggested to represent a Boolean expression, where each leaf of a logic tree corresponds to a SNP site, while the internal nodes are associated with logical operators (e.g., AND or OR). A greedy strategy and a simulated annealing algorithm are designed separately to search for a better logic tree that fits the given genotypephenotype dataset better. Note that every Boolean expression can be represented as a logic tree; see Figure
Logic tree representation of
The process of seeking a better logic tree is operated by changing the components or modifying the topology of the current logic tree. In basic logic regression approaches, three tree operations are suggested: add, delete, and change; see Figure
However, many Boolean expressions can fit equally or almost equally well, and there are no universal algorithms to reduce the Boolean expressions. Furthermore, the best Boolean expression may be an overfitted expression rather than the true one. This situation occurs more frequently due to noisy data [
Fishswarm optimization (artificial fishswarm algorithm, AFSA) is a swarm optimization framework, which was first proposed by Li and others in [
Preying is a basic biological behavior which describes how a fish tends to eat. For example, a fish perceives a concentration of food in the environment; preying behavior is to determine the movement and the tendency to achieve the concentration position. In an AFSA, the concentration of food in the environment indicates a solution that is better than the current solution where the fish agent is located. The preying behavior in an AFSA illustrates how to reach the better solution from the current one. When a single fish or several fish find the concentration, its adjacent members can trail this/these fish, and thus the swarm will reach the food more quickly. This process is called following. In an AFSA, the following behavior is imitated by comparing solutions among different fish agents. Obviously, following significantly benefits the convergence speed. To enable the following behavior, the fish must assemble the group to guarantee the existence of the colony and neighborhood relationships. On the other hand, any pair of fish cannot get too close because of the limitation of food. Thus, the swarming behavior assembles the fish but prevents them from being too dense. This behavior is very meaningful in AFSA because it prevents the fish swarm from dropping into local optima.
As an optimization framework, different behaviors may be considered and implemented for different problems and solution spaces. Overall, AFSA is suggested as one of the best swarm intelligence optimization methods due to its high convergence speed, flexibility, fault tolerance, and many other advantages [
Suppose that we are given a set of
Our new method is the fishswarm logic regression algorithm. The main motivation for developing this algorithm is to conduct a more efficient and accurate regression process and to extend the algorithm to a parallel framework. To perform initializations, we first generate
Our approach takes advantage of swarm optimizations. By incorporating a swarm framework, the algorithm searches the solution space from multiple start points (different logic trees) instead of continuing to apply modifications on one logic tree. Thus, it is obvious that we have a higher probability of converging into local optimum(s) or global optimum(s), and thus, this framework speeds up the previous search process. In particular, we use the “fish agent” framework rather than other swarm intelligence frameworks because of the high similarity between the mechanism of a fish swarm and the genotypephenotype association problem. In a natural scenario with a school of fish, a fish forages independently in a small space around it, while it also might follow other fish that could lead to a space that has more food. However, each fish always keeps a distance from the other fish, to control the school density. This arrangement is one of the major differences of the fish swarm from the other swarm algorithms. Intuitively, we would like to prevent logic trees from gathering together, because if they do so, then the algorithm might actually perform similar to the algorithm that has only one logic tree performing, and it could fall into a local optimum rapidly. Moreover, as mentioned before, selecting only the best logic tree is not sufficient; the mechanism of the fish swarm fits well with the problem and the requirements.
The fishswarm algorithm models the natural environment and animal behaviors; however, we cannot blindly or mechanically copy this framework because of two reasons: (1) the solution space that comprises the logic trees is significantly different from the 3D space (the natural environment), and (2) behaviors in the natural environment are not able to directly apply to the logic tree space.
We modify the fish agent framework to fit this specific problem. Suppose that we have generated
First, in the logic regression framework, there are three major unknown parameters: the number of logic trees in one regression model
We assume that the size of every logic tree has a prior distribution of
A logic tree not only makes up the SNPs but also connects with the logic operators. The logic operators describe complex interactions among the SNPs. Different SNPs could have different functions, including “causal,” “neutral,” and “protective.” The causal variants increase the risk of cases, while the protective variants decrease the risks. The neutral variants are considered to be independent of the phenotype. For the “additive” genetic model, the “AND” operators are adopted to connect the causal SNPs. For the “dominant” genetic model, the causal SNPs are connected by “OR” operators. If we split a logic tree into two sublogic trees (subLTs) at an “OR” operator, according to the genetic model, each new subLT affects the phenotype independently, which is the same as the original logic tree. In other words, these two new subLTs still contain the same information as the original logic tree. This arrangement implies that splitting the logic tree at an “OR” operator will not cause information loss. Thus, if we split a logic tree at the “OR” operators recursively, we obtain a forest (a set of logic trees) that comprises the subLTs with only the “AND” operators.
We highlight the split process for two reasons.
The subLTs in the forest contain only “AND” operator(s), and thus the topologies of these subLTs are no longer considerable because of the commutative law. The differences between any two subLTs, subLT
The forest represents all of the information of an original logic tree, and thus the differences between two logic trees are computable by measuring the forests that are derived from them. For example, the number of “OR” operators in a logic tree is equal to the number of subLTs in the forest.
In summary, we define a threedimensional hyperspace as follows: the first dimension, the scalar
When we have the search space, fish agents conduct behaviors that search the solution space simultaneously. Thus, to define the behaviors that regulate the search strategy is another important part of a swarm algorithm. Behaviors are often dependent on the solution space that they work on. For the specific logic tree space, note that we have a collapsing solution space from the set of all possible logic trees rather than a bijective solution space. For example, one point in the logic tree space could correspond to multiple logic trees. This correspondence occurs because of the complexity of both the tree topology and the logic operators. For a bijective solution space, defining the swarm behaviors is adequate in most cases; however, in the logic tree space, the fish agent should harbor necessary behaviors itself, in addition to the swarm behaviors, to update the logic tree that it holds even when it keeps the location in the space. In this section, we will describe the behaviors for a fish agent, while in Section
ADD SNP: select a SNP and add it to the LT.
DEL SNP: select a SNP on the LT and remove the SNP from all of the subLTs.
ALT SNP: select a SNP on the LT and alter the SNP by another SNP.
ALT OPT: select an operator on the LT and alter the operator by the opposite operator.
The probability distribution of choosing a behavior affects the preferences of the behaviors. The simplest way is to adopt a uniform distribution; for example, each behavior has the same probability, 25% of them being chosen. However, to accelerate the convergence, it is better to reflect preferences among the behaviors. Suppose that, after one iteration, the fish agent that holds the logic tree with the highest score is announced. Let this fish agent be
Furthermore, for a specific SNP, we should also consider the probability that this SNP will be chosen. We obtain the probability distribution of selecting a SNP by measuring the importance of each SNP. The measurement of importance is a statistic [
To compute the importance, each fish agent records the correctly classified outofbag (OOB) observations. Let
Suppose there is an index vector
Here, we continue to introduce the swarm behaviors. The behaviors of fish
have the same size as
have the same SNPs as
To achieve this goal,
If
If
If
“FOLLOW” behavior is illustrated. When
change to a different size from
select different SNPs with
To achieve this goal,
If
If
If
After applying a series of behaviors, each fish agent holds a new logic tree. If the new logic tree obtains a higher score than the previous logic tree, then the new logic tree explains more genotypes; next, this new logic tree is accepted and replaces the previous one. Otherwise, the new logic tree is rejected with a probability of
Each fish agent could store a local optimal logic tree during the search process, while the whole swarm always announces the current best logic tree (the global optimal). After several iterations, the reversible jump method is implemented. In other words, the acceptance probability of a newly proposed logic tree could decrease, but it might be closer to the best LT in the current iteration.
In addition, we consider a stepwise regression process. The stepwise regression eliminates insignificant SNPs iteratively and drops them off. The stepwise mechanism checks the active SNPs (SNPs not removed) every
Finally, when
We first apply our fishswarm logic regression (FSLR) approach on a real screening dataset and then apply it on a series of simulated datasets under different configurations to test the performance of our approach compared to other logic regressionbased approaches. The software tool, FSLR, is available at
Three existing LRbased approaches are compared, which are Monte Carlo logic regression (MCLR) [
The real dataset is from our own study, which focuses on the genetic association between the dopamine receptor D1 (DRD1) gene polymorphisms and the risk of opioid dependence. Seven possible functional single nucleotide polymorphisms, rs4867798, rs1799914, rs686, rs4532, rs5326, rs10063995, and rs10078866, in the regulatory or coding regions of DRD1 were identified by DNA sequencing in 20 heroin addicts and were further genotyped in 425 heroin addicts and 514 healthy controls.
Several genes that encode dopamine receptors have been confirmed to be associated with a risk of heroin addiction. Our previous studies [
We applied our approach, FSLR, on this dataset. When considering the homozygote mutations, the logic regression model reports the highest score, which is 516 (among 939 individuals). rs4532 is the SNP with the highest importance. Two interactions, rs4532rs686 and rs10078866rs4532, are accepted much more than other interactions. When considering both the homozygote and the heterozygote mutations, two interactions, rs4532rs1799914 and rs1799914rs686, are accepted much more than the others, with the highest score being 518. These results, which are for candidate associations, are supported by clinical knowledge.
For each simulation configuration, we generate 100 datasets. All of the datasets are generated by the
In the following sections, we will present the comparison results on three aspects: (1) the accuracy of each approach (measured by the type I and type II error rates), (2) the performance under different levels of risk and different levels of noise, and (3) the running time. To ensure confidence in the results, we conducted 100 repeats for each configuration used in the comparison.
We first compared the accuracy. The accuracy is measured by the type I error rate and the type II error rate, separately. The type I error rate is computed as the percentage of missed causal sites divided by the number of selected SNPs, while the type II error rate is computed as the percentage of wrong selections of noncausal SNPs among all of the SNPs involved in a regression model. The given datasets always have 1000 sites for every genotype, but the number of causal sites varies from 10 to 100 among the 1000 sites. In other words, the proportions of causal variants decrease from 1% to 10%.
The results of the type I and type II error rates are compared in Table
Accuracy for different numbers of causal SNPs. The column “Causal” shows the number of casual sites. The type I error rate is the percentage of missed causal sites divided by the number of selected SNPs. The type II error rate is the percentage of wrong selections of noncausal SNPs among all of the SNPs involved in a regression model. For each simulation configuration, the number is computed based on 100 repeats.
Causal  FSLR  MCLR  FBLR  LogicFS  

Type I  Type II  Type I  Type II  Type I  Type II  Type I  Type II  
10  0.65%  65.00%  1.38%  88.30%  0.52%  52.00%  0.63%  63.00% 
20  1.38%  69.00%  1.21%  94.75%  1.34%  67.00%  1.47%  73.50% 
30  1.75%  58.33%  1.20%  96.13%  2.15%  71.67%  2.21%  73.67% 
40  2.53%  63.25%  1.18%  97.30%  3.02%  75.50%  3.22%  80.50% 
50  3.72%  69.40%  1.14%  97.64%  4.05%  81.00%  3.98%  79.60% 
60  3.80%  63.33%  1.10%  97.90%  4.73%  78.83%  4.90%  81.67% 
70  4.62%  66.00%  1.08%  98.17%  5.78%  82.57%  5.82%  83.14% 
80  5.40%  67.50%  1.09%  98.48%  6.24%  78.00%  6.58%  82.25% 
90  5.38%  59.79%  1.10%  98.91%  7.24%  80.44%  7.67%  85.22% 
100  6.44%  64.40%  1.05%  98.40%  7.76%  77.60%  8.47%  84.70% 
Comparisons on identifying preset causal sites. The column “Causal” shows the number of casual sites. A column under the name of an approach shows the average number (among 100 repeats) of successfully identified preset causal sites among the number of casual sites.
Causal  FSLR  MCLR  FBLR  LogicFS 

10  3.5  1.23  4.8  3.7 
20  6.2  1.05  6.6  5.3 
30  12.5  1.16  8.5  7.9 
40  14.7  1.08  9.8  7.8 
50  12.8  1.18  9.5  10.2 
60  22.0  1.26  12.7  11.0 
70  23.8  1.28  12.2  11.8 
80  26.0  1.22  17.6  14.2 
90  36.2  0.98  17.5  13.3 
100  35.6  1.60  22.4  15.3 
We also compared the accuracy under different levels of risk and different levels of noise. All of the datasets applied in this group of experiments have a total of 10 preset causal sites among the 1000 sites. We first varied the levels of risk from 5% to 15%; then, we varied the levels of noise on the haplotypes from 1% to 3%.
The results are compared in Tables
Accuracy for different numbers of causal SNPs with risks and noise. The level of risk is equal to the probability of the phenotype being the same as the output of the Boolean expression. The level of noise is equal to the probability of randomly altering an allelic value from wild type to mutation or from mutation to wild type. The type I and II error rates are similar. For each simulation configuration, the number is computed based on 100 repeats.
FSLR  MCLR  FBLR  LogicFS  

Type I  Type II  Type I  Type II  Type I  Type II  Type I  Type II  
Risk  

12.8%  58.80%  1.16%  98.88%  8.70%  73.60%  6.80%  71.80% 

12.7%  59.00%  1.12%  98.44%  8.90%  77.60%  6.70%  73.40% 

12.3%  59.20%  1.19%  98.92%  8.90%  74.40%  6.60%  73.90% 
Noise  

12.5%  59.00%  1.17%  98.56%  8.90%  77.80%  6.70%  73.40% 

13.5%  58.00%  1.17%  98.76%  9.00%  81.80%  6.80%  74.60% 

14.8%  56.60%  1.08%  99.19%  8.90%  78.40%  7.30%  76.60% 
Comparisons on identifying preset causal sites with risks and noise. A column under the name of an approach shows the average number (among 100 repeats) of successfully identified preset causal sites under the particular level of noise.
Noise  FSLR  MCLR  FBLR  LogicFS 

5%  20.6  0.72  11.1  13.3 
10%  21.0  0.62  9.1  13.7 
15%  21.7  0.50  10.8  11.7 
 
1%  20.6  0.56  13.2  14.1 
2%  20.5  1.28  11.2  13.3 
3%  19.9  0.54  12.8  12.5 
In addition, we compare the running time among FSLR, MCLR, FBLR, and LogicFS. We record the average running time on 1000 repeats. Because both MCLR and FBLR rely on the Monte Carlo Markov chain (MCMC) to seek a better regression model, the number of iterations of MCMC might dominate the running time. Following the suggestions in the papers, we preset 100,000 iterations with an additional 10,000 burnin iterations. LogicFS is preset by 20 iterations with bootstrap sampling. FSLR is applied on a cluster with 12 laptops. The collections of running time are shown in Table
Comparisons on running time. The running time is measured in seconds.
Causal  FSLR  MCLR  FBLR  LogicFS 

10  17.43  56.59  1659  12.23 
20  18.48  53.64  1559  12.50 
30  18.72  58.37  1603  12.12 
40  18.96  57.76  1463  11.88 
50  19.45  58.31  1520  12.10 
60  19.94  57.72  1418  12.43 
70  20.69  59.58  1482  12.11 
80  22.49  58.04  1366  12.57 
90  24.35  58.54  1466  12.79 
100  24.35  59.13  1346  12.65 
In this paper, we present a novel logic regressionbased approach, fishswarm logic regression (FSLR), to detect the interacting SNPs that are associated with a phenotype. We designed a new regression algorithm, which incorporates the advantages of a swarm framework, to improve both the accuracy and the efficiency of logic regression. In contrast to previous swarm algorithms, in this approach, we design a specific solution space into which all possible logic trees are mapped. Then, two types of behaviors, agent behaviors and swarm behaviors, are suggested to rule the search strategy. A series of simulation experiments are performed to compare the accuracy under different scenarios of three logic regressionbased approaches. The running times among the approaches are also collected. Our approach, fishswarm logic regression, often outperforms other approaches in terms of the accuracy under different simulation configurations, and it has a better running time on parallel frameworks than that of the others.
The authors declare that they have no competing financial interests.
Jiayin Wang, Xuanping Zhang, and Chunxia Yan conducted this research. Jiayin Wang, Aiyuan Yang, and Feng Zhu designed algorithms and experiments. Aiyuan Yang and Zhi Cao developed the software packages and participated in the performance analysis and the experiments on the real dataset. Jiayin Wang, Chunxia Yan, and Zhongmeng Zhao wrote this paper. All authors have read and approved the final manuscript.
This work was supported by the National Science Foundation [CCF1116175], the Ph.D. Programs Foundation of the Ministry of Education of China [20100201110063], and the National Science Foundation of China [81172903].